DISCUSSION
Conservation of sequences (genes) among,
panda, dog and human has been determined previously.[10] Fully 846 Mbp (60.0%) of the three 1.4 Gbp
sequences were conserved among all three species at the 90%ID level. Not conserved are 5.9%, 13.3%, and
27.2%, respectively, for panda, dog, and human.
Although we made our comparisons at the 95+%ID level, we found corresponding
statistics in Table 6 under normalized % coverage. For S26 (which matches panda best), dog and
human have the highest and third highest relative % coverage, reflecting the
fact that the panda genome is the most conserved of the three. For S140 (which matches dog best), the panda
and human have next highest relative % coverages. Finally, For S31 (which matches human best),
the relative % coverages for dog and panda are the lowest, reflecting the fact
that the human genome is the least conserved of the three.
Expanding on a few previous general principles
regarding the use of the BLAST™ summary output to identify a species from a DNA sequence, [11]
we found that the following more
detailed principles applied:
1. One cannot match
what is not in the database. A species
with relatively shorter sequence entries will be pushed down the hit
list (ordered by score), possibly to the point of not being reported as a result
(below the “Maximum target sequences” lowest score).[12]
2. One cannot find what one does not search
for. Too narrow search criteria or
preconceived notions can cause false impressions.
3. Shorter sequence
ranges, can be significant if they match the database very well (99%+). These may not appear or be obvious in
preliminary searches because of relatively lower scores. Sort the results on the second moment. These entries will move toward the top of the
hit list.
4. If a species level
search yields a relatively sparse hit list, expand the search for the
suspected species to the genus and family level (step back). Good matches to
closely related species at these levels may indicate that the species of
interest is relatively under-represented in the database compared to its
kin. Compare the total number of database
entries for each group through searches of the database by group names.
5. Short but contiguous
hits can combine to give matches over significantly long sequence ranges. Sort the hits by Qstart, smallest to largest
(column G), then by Qend, largest to smallest (Column H) to find these.
6. A long hit list that
contains relatively unrelated species with similar scores is not
necessarily the sign of a previously unknown species. It could signal conserved
genes, common gene spacers, or that the species of interest is not well
represented in the database (if at all).
(See no. 4)
7. Hits with relatively
long sequence lengths and high scores can have unacceptably low %ID. Look at the individual %ID numbers in
the downloaded Excel hit list, and remember that humans match chimpanzees about
95%.
8. Nearly everything in
this is relative (see in previous numbers). Expand the scope of searches to get the
proper perspective on scores, matching sequence length, % identity, mismatches,
and gaps, especially as they relate to established phylogeny, i.e. the relative
similarity of species.
9. Nucleotide, Genomes
(chromosome), Genome plus Transcription (human), Reference Genomic Sequence and
Shotgun Assembly databases should all be searched. The Genome and Reference Genomic Sequence Databases
have more sequence information for each species; however, they have much fewer
species than the Nucleotide Database.
10. The NCBI databases are “moving targets,” as
new sequence data are entered continually.
For example, the polar bear sequences in Table 1 were entered in the Nucleotide Database in June, 2013, well after
this study was initiated, and were subsequently moved to the Transcriptome Shotgun
Assembly Database.
11. Complex hit lists involving many species should
be downloaded to a spreadsheet and enhanced by calculating moments about an
appropriate numerical axis. An axis should be chosen so that a larger moment
implies a better match. Discard data
below the axis numerical value. Sorting
by second moment (Equn. 1.2), followed by first moment (Equn. 1.1), should
reveal best group matches. Comparing
average moments (Equns. 1.3 and 1.4) of candidate groups reveals an overall
best match (Table 6).
12.
Scatter charts of match variables, e.g. %ID and second moment,
calculated for each hit and displayed across the entire unknown sequence, may
reveal subtle overall match differences between candidate groups.
Failure to follow these principles can lead
to misidentifications and incorrect conclusions. For
example, settling for the BLAST™ summary results page without examining the
individual hit list (downloaded in Excel) or without the benefit of comparison
searches to other species and concluding that there is a “significant homology”
is a subjective judgment (Principles 2 and 8). All mammals are related and have some
homology in their DNA. A better, quantifiable,
comparison needs to be made.
CONCLUSIONS
Samples 26, 31, and 140 are not from the
same species, nor are they subspecies.
Sample 26 is from a bear, most likely a
black bear, Ursus americanus. This was also the previous conclusion of an
independent investigation[6] of a duplicate
sample using human and black bear primers.
Searches limited to human, other primates, the Canis genus, and all other species, produced poorer matches. It is possible, but not likely, that the
sample originates from a previously unknown or unreported bear species or black
bear hybrid.
Sample 31 is genus Homo, most likely Homo
sapiens. Matches to other primates,
to Canis, and to all other species were
poorer. The possibility that it could be
a previously unknown, very closely related species or subspecies of the Homo genus could not be excluded, but is
unlikely because the matches to human were so perfect. There is no significant mosaic of human and
primate-like sequences as claimed in the Ketchum conclusion (2); other primates
were consistently poorer matches than human over the entire S31 sequence.
Sample 140 is from a domestic dog, Canis lupus familiaris, or a similar Canis species. Over each of the sequence ranges of the top 15
Canis hits, the “all other”
categories also bested both human and other primates but were not close to the Canis matches, further supporting the
conclusion that the sample is not human or even primate.
A NCBI database search for Neanderthal and
Denisovan nuclear DNA sequences produced none of the latter and only five very
short (<90 bp) “environmental” sequences of the former in the NCBI
databases. This low and non-existent
database coverage is certainly not enough to support the Ketchum conclusion (1)
(database principle 1).
In summary, none of the three Ketchum
conclusions are supported by our nuclear DNA sequence interpretations of
Samples 26, 31, and 140, which are from a black bear, a human, and a dog,
respectively. No new species of primate
could be proven to exist based on this data, and no new phylogeny is suggested.
Methodology described here avoids pitfalls
and incorrect conclusions based on the results of nuDNA sequence searches
against a database of known species. The
calculation of moments (Equns. 1.1 - 1.4), overall %ID (Equn. 1.5), and overall
% coverage (Equn. 1.6) by candidate group (or taxon) more clearly
differentiates the groups than any other summary statistics from BLAST™
results.
Microsoft EXCEL spreadsheets are suitable
for all the computations and graphics required for this kind of work. Examples are available from the author upon
request. Specialized software to perform
these calculations and sorts could, in principle, reduce the steps and effort
involved.
Some
changes to the NCBI databases and the BLAST™ software are suggested for this
kind of species identification. Addition
of the taxonomy identification (taxid) field to the NCBI hit list in Excel would
be highly desirable for sorting hit lists by species, genus or family,
eliminating the need for separate searches or line by line identifications from
accession numbers, also requiring additional searches. Also, ability to search all NCBI
databases in one submission would save time and avoid overlooking valuable
data, especially for the non-expert. These
databases are partly redundant and not adequately linked. They require separate searches for species
identifications. However, they are
invaluable in wildlife forensics.
ACKNOWLEDGMENTS
The author
acknowledges the inspiration of anonymous Internet bloggers, whose highly
variable and mixed results highlighted the need for this study, and the NCBI
helpdesk for critical information about the databases. The author received no financial support for
this work.
CONFLICT
OF INTEREST
The author declares that there are no
conflicting interests.
REFERENCES
[1] Parson, W.; Pegoraro, K.;
Niederstatter, H.; Folger, M.;
Steinlechner, M. Species identification
by means of the cytochrome b gene. Int. J. Leg. Med., 2000, 114 (1-2), 23-28.
[2] Hebert, P. D. N.;
Ratnasingham, S.; de Waard, J. R. Barcoding animal life: cytochrome c oxidase
subunit 1 divergences among closely related species. Proc. Royal Acad. B , 2003, 270 (S1), S96-S99.
[3] Linacre, A. M. T.; Tobe, S. S. Wildlife
DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester
(UK), 2013; pp. 110-126.
[4] Linacre, A. M. T.; Tobe, S. S. Wildlife
DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester
(UK), 2013; pp. 127-158.
[5]
Ketchum, M. S. et al. Novel North American Hominins: Next Generation Sequencing of
Three Whole Genomes and Associated Studies. DeNovo,
2013, 1:1.
http://www.advancedsciencefoundation.org/#!novel-north-american-hominins/cayh.
[6] Khan, T.; White, B. Final
Report on the Analysis of Samples Submitted by Tyler Huggins, Wildlife Forensic
DNA Laboratory Case File 12-019; Trent University Oshawa: Peterborough,
Ontario, Canada, 2012.
[7] Madden, T. The BLAST Sequence Analysis
Tool, In The NCBI Handbook; McEntyre
J; Ostell J., Eds.; National Center for Biotechnology Information: Bethesda,
MD, 2003; Chapter 16. http://www.ncbi.nlm.nih.gov/books/NBK21097/.
[8] Altschul, S. F.; Gish, W.; Webb, M.;
Meyers, E. W.; Lipman, D. J. Basic local
alignment search tool. J. Mol. Biol., 1990, 215 (3), 403-410.
[9] Linacre, A. M. T.; Tobe, S. S. Wildlife
DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester
(UK), 2013; pp. 111-123.
[10]
Li, R., et al. The sequence and de novo assembly of the giant panda
genome. Nature, 2010, 463, 311-317.
[11] Linacre, A. M. T.; Tobe, S. S. Wildlife
DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester
(UK), 2013; pp. 160-170.
[12] Linacre, A. M. T.; Tobe, S. S. Wildlife
DNA Analysis: Applications in Forensic Science; Wiley-Blackwell: Chichester
(UK), 2013; Chap. 4, footnote 1, p.
172.
No comments:
Post a Comment