Pages

Paper 1, Part III


DISCUSSION

   Conservation of sequences (genes) among, panda, dog and human has been determined previously.[10]  Fully 846 Mbp (60.0%) of the three 1.4 Gbp sequences were conserved among all three species at the 90%ID level.  Not conserved are 5.9%, 13.3%, and 27.2%, respectively, for panda, dog, and human.  Although we made our comparisons at the 95+%ID level, we found corresponding statistics in Table 6 under normalized % coverage.  For S26 (which matches panda best), dog and human have the highest and third highest relative % coverage, reflecting the fact that the panda genome is the most conserved of the three.  For S140 (which matches dog best), the panda and human have next highest relative % coverages.   Finally, For S31 (which matches human best), the relative % coverages for dog and panda are the lowest, reflecting the fact that the human genome is the least conserved of the three.        

   Expanding on a few previous general principles regarding the use of the BLAST™ summary output  to identify a species from a DNA sequence, [11]  we found that the following more detailed principles applied:

1. One cannot match what is not in the database.  A species with relatively shorter sequence entries will be pushed down the hit list (ordered by score), possibly to the point of not being reported as a result (below the “Maximum target sequences” lowest score).[12]

2.  One cannot find what one does not search for.  Too narrow search criteria or preconceived notions can cause false impressions.

3. Shorter sequence ranges, can be significant if they match the database very well (99%+).  These may not appear or be obvious in preliminary searches because of relatively lower scores.  Sort the results on the second moment.  These entries will move toward the top of the hit list.

4. If a species level search yields a relatively sparse hit list, expand the search for the suspected species to the genus and family level (step back). Good matches to closely related species at these levels may indicate that the species of interest is relatively under-represented in the database compared to its kin.  Compare the total number of database entries for each group through searches of the database by group names.

5. Short but contiguous hits can combine to give matches over significantly long sequence ranges.  Sort the hits by Qstart, smallest to largest (column G), then by Qend, largest to smallest (Column H) to find these.

6. A long hit list that contains relatively unrelated species with similar scores is not necessarily the sign of a previously unknown species. It could signal conserved genes, common gene spacers, or that the species of interest is not well represented in the database (if at all).  (See no. 4) 

7. Hits with relatively long sequence lengths and high scores can have unacceptably low %ID.  Look at the individual %ID numbers in the downloaded Excel hit list, and remember that humans match chimpanzees about 95%. 

8. Nearly everything in this is relative (see in previous numbers).  Expand the scope of searches to get the proper perspective on scores, matching sequence length, % identity, mismatches, and gaps, especially as they relate to established phylogeny, i.e. the relative similarity of species.

9. Nucleotide, Genomes (chromosome), Genome plus Transcription (human), Reference Genomic Sequence and Shotgun Assembly databases should all be searched.  The Genome and Reference Genomic Sequence Databases have more sequence information for each species; however, they have much fewer species than the Nucleotide Database.

10.  The NCBI databases are “moving targets,” as new sequence data are entered continually.  For example, the polar bear sequences in Table 1 were entered in the Nucleotide Database in June, 2013, well after this study was initiated, and were subsequently moved to the Transcriptome Shotgun Assembly Database. 

11.  Complex hit lists involving many species should be downloaded to a spreadsheet and enhanced by calculating moments about an appropriate numerical axis. An axis should be chosen so that a larger moment implies a better match.  Discard data below the axis numerical value.  Sorting by second moment (Equn. 1.2), followed by first moment (Equn. 1.1), should reveal best group matches.  Comparing average moments (Equns. 1.3 and 1.4) of candidate groups reveals an overall best match (Table 6).

 12.  Scatter charts of match variables, e.g. %ID and second moment, calculated for each hit and displayed across the entire unknown sequence, may reveal subtle overall match differences between candidate groups. 

   Failure to follow these principles can lead to misidentifications and incorrect conclusions.     For example, settling for the BLAST™ summary results page without examining the individual hit list (downloaded in Excel) or without the benefit of comparison searches to other species and concluding that there is a “significant homology” is a subjective judgment (Principles 2 and 8).  All mammals are related and have some homology in their DNA.  A better, quantifiable, comparison needs to be made.

CONCLUSIONS

   Samples 26, 31, and 140 are not from the same species, nor are they subspecies.

   Sample 26 is from a bear, most likely a black bear, Ursus americanus.  This was also the previous conclusion of an independent investigation[6] of a duplicate sample using human and black bear primers.  Searches limited to human, other primates, the Canis genus, and all other species, produced poorer matches.  It is possible, but not likely, that the sample originates from a previously unknown or unreported bear species or black bear hybrid. 

   Sample 31 is genus Homo, most likely Homo sapiens.  Matches to other primates, to Canis, and to all other species were poorer.  The possibility that it could be a previously unknown, very closely related species or subspecies of the Homo genus could not be excluded, but is unlikely because the matches to human were so perfect.  There is no significant mosaic of human and primate-like sequences as claimed in the Ketchum conclusion (2); other primates were consistently poorer matches than human over the entire S31 sequence.    

   Sample 140 is from a domestic dog, Canis lupus familiaris, or a similar Canis species.  Over each of the sequence ranges of the top 15 Canis hits, the “all other” categories also bested both human and other primates but were not close to the Canis matches, further supporting the conclusion that the sample is not human or even primate.    

   A NCBI database search for Neanderthal and Denisovan nuclear DNA sequences produced none of the latter and only five very short (<90 bp) “environmental” sequences of the former in the NCBI databases.  This low and non-existent database coverage is certainly not enough to support the Ketchum conclusion (1) (database principle 1).

   In summary, none of the three Ketchum conclusions are supported by our nuclear DNA sequence interpretations of Samples 26, 31, and 140, which are from a black bear, a human, and a dog, respectively.  No new species of primate could be proven to exist based on this data, and no new phylogeny is suggested.

   Methodology described here avoids pitfalls and incorrect conclusions based on the results of nuDNA sequence searches against a database of known species.  The calculation of moments (Equns. 1.1 - 1.4), overall %ID (Equn. 1.5), and overall % coverage (Equn. 1.6) by candidate group (or taxon) more clearly differentiates the groups than any other summary statistics from BLAST™ results.

   Microsoft EXCEL spreadsheets are suitable for all the computations and graphics required for this kind of work.  Examples are available from the author upon request.  Specialized software to perform these calculations and sorts could, in principle, reduce the steps and effort involved.

    Some changes to the NCBI databases and the BLAST™ software are suggested for this kind of species identification.  Addition of the taxonomy identification (taxid) field to the NCBI hit list in Excel would be highly desirable for sorting hit lists by species, genus or family, eliminating the need for separate searches or line by line identifications from accession numbers, also requiring additional searches.  Also, ability to search all NCBI databases in one submission would save time and avoid overlooking valuable data, especially for the non-expert.  These databases are partly redundant and not adequately linked.  They require separate searches for species identifications.  However, they are invaluable in wildlife forensics. 

ACKNOWLEDGMENTS

   The author acknowledges the inspiration of anonymous Internet bloggers, whose highly variable and mixed results highlighted the need for this study, and the NCBI helpdesk for critical information about the databases.  The author received no financial support for this work.

CONFLICT OF INTEREST

   The author declares that there are no conflicting interests.

REFERENCES

 

 

[1]        Parson, W.; Pegoraro, K.; Niederstatter, H.;  Folger, M.; Steinlechner, M.  Species identification by means of the cytochrome b gene. Int. J. Leg. Med., 2000, 114 (1-2), 23-28.

[2]        Hebert, P. D. N.; Ratnasingham, S.; de Waard, J. R. Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc. Royal Acad. B , 2003, 270 (S1), S96-S99.

[3]        Linacre, A. M. T.; Tobe, S. S.  Wildlife DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester (UK), 2013;  pp. 110-126.

[4]        Linacre, A. M. T.; Tobe, S. S.  Wildlife DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester (UK), 2013; pp. 127-158.

[5]        Ketchum, M. S. et al. Novel North American Hominins: Next Generation Sequencing of Three Whole Genomes and Associated Studies. DeNovo, 2013, 1:1.  

http://www.advancedsciencefoundation.org/#!novel-north-american-hominins/cayh.

[6]        Khan, T.; White, B.  Final Report on the Analysis of Samples Submitted by Tyler Huggins, Wildlife Forensic DNA Laboratory Case File 12-019; Trent University Oshawa: Peterborough, Ontario, Canada, 2012.

 

[7]        Madden, T. The BLAST Sequence Analysis Tool, In The NCBI Handbook; McEntyre J; Ostell J., Eds.; National Center for Biotechnology Information: Bethesda, MD, 2003; Chapter 16.  http://www.ncbi.nlm.nih.gov/books/NBK21097/.

[8]        Altschul, S. F.; Gish, W.; Webb, M.; Meyers, E. W.; Lipman, D. J.  Basic local alignment search tool.  J. Mol. Biol., 1990, 215 (3), 403-410.  

[9]        Linacre, A. M. T.; Tobe, S. S.  Wildlife DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester (UK), 2013; pp. 111-123.

[10]      Li, R., et al.  The sequence and de novo assembly of the giant panda genome.  Nature, 2010, 463, 311-317.

[11]      Linacre, A. M. T.; Tobe, S. S.  Wildlife DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester (UK), 2013; pp. 160-170.

[12]      Linacre, A. M. T.; Tobe, S. S.  Wildlife DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester (UK), 2013; Chap. 4, footnote 1, p. 172.

 

No comments:

Post a Comment