Search This Blog

Saturday, September 27, 2014

Q30 Scores: Arbitrary Relative Numbers - Not A Measure of Contamination

Melba Ketchum makes much of Q30 scores in claiming her samples were uncontaminated, and therefore her highly variable results are real. But "Real what?" is the question.  Ladies and children may be reading this, so I won't answer directly.  Let's start with the basics and demystify this over-touted "statistic." It's not even a real probability, I will show.

You will not learn too much from an advertising "Tech Note" from Illumina(R) like "Quality scores from Next-Generation Sequencing" (http://res.illumina.com/documents/products/technotes/technote_q-scores.pdf ); but read this first; it's short, has some basic graphs, and has one important equation:

Q = -10 log P

That's log base 10.  (Those who paid attention in my general chemistry classes will notice that it has a form like the definition of pH, except for the leading "10.").  Q is the quality "score" for a single base in a sequence - NOT a "probability." P is also  NOT a true "probability" in the statistical sense.  What is P then?  The Devil's in the details.

Admittedly I was shocked to learn that statistical probability of sequencing error is NOT involved in P or Q.  From Ewing and Green (http://genome.cshlp.org/content/8/3/186.full.pdf+html ):


"The following four parameters were found to be particularly effective at discriminating errors from correct base-calls. In each case, smaller parameter values correspond to higher quality (more accurate sequence).

[NOTE:  Probabilities work oppositely.  The HIGHER the number, the more likely the event, AND the MAXIMUM value is one. HVH]

1. Peak spacing. The ratio of the largest peak-to-peak spacing, in a window of seven peaks centered on the current one, to the smallest peak-to -peak spacing. The minimum possible value of one corresponds to evenly spaced peaks.

2. Uncalled/called ratio. The ratio of the amplitude of the largest uncalled peak, in a window of seven peaks around the current one, to the smallest called peak; if there is no uncalled peak, the largest of the three uncalled trace array values at the location of the called base peak is used instead. 
[An uncalled peak is a peak in the signal
that was not assigned to a predicted location by phred (Ewing et al. 1998) and thus does not result in a base call.] If the called base is an N, Phred assigns a large value of 100.0. Note that this is not what is sometimes called the signal to noise ratio, as uncalled peaks may be true peaks missed by the base-calling program rather than noise in the conventional sense. The minimum parameter value is 0 for traces with no uncalled peaks.

3. Same as 2, but using a window of three peaks.

4. Peak resolution. The number of bases between the current base and the nearest unresolved base, times -1 (to force the parameter to have the right direction). (A base is unresolved if it is called as N or if for at least one of its neighboring bases, there is no point between the two corresponding peaks at which the signal is less than the signal at each peak). The minimum possible parameter value is half the number of bases in the trace, times -1, and the maximum value is 0."

[NOTE: Phred is a computer program.]

Don't be alarmed if this sounds like gibberish.  My point here is that these four parameters ARE ARBITRARY, not fundamental, and the calculations which follow are CURVE-FITTING.  The final result is a Q for each base.  Q30 is the percentage of individual base Q's which exceed 30 by the above equation.  It is not a probability, but a relative number.

What is the effect of a sample containing DNA from multiple species?  Is the Q30 for the major genome affected by the presence of the other genomes?  NOT NECESSARILY.  The contaminant species must be in significant concentrations and must react with the primer, e.g. a universal or non-specific primer.  The electropherogram 
peak shapes may then be affected, which could affect the four parameters above and could lower Qs. Statements that impurities lower Q30 to 40-50% are over generalizations.  Melba's Q30 of 85% isn't all that great.  Pure, single species Q30 values are usually >95%.  (See first reference above). And remember, this is a logarithmic scale.


Another important fact is that mtDNA is present in the cell at about 1000 X the concentration of nDNA.  This means that mtDNA sequencing requires only about 1/1000 as much sample as nDNA.   Therefore, a contaminant may show up in a mtDNA test and not show up in the nDNA sequencing.  This is the most likely explanation for some of the mixed results in the Ketchum study.

So what's the bottom line?  Melba's whining about Q30 scores is largely irrelevant to her conclusions.  She must have mentioned the number dozens of times in her appeals to the editors of Nature.  They weren't impressed.  Neither am I, and neither should you be.

Next, I'll examine the Nature peer reviews and Melba's responses.  Incredibly, the reviewers totally missed the boat in some cases, but their "arm-chair"* assessments were generally on target.

*Definition (mine): "arm-chair", adj.  An opinion which is not based on any in depth investigation, rather based on overall impressions and/or generalizations from past experience.