Bigfoot Claims: Paper 1, Part I

Methodology and New Metrics for Distinguishing Related Species from Incomplete nuDNA Sequences.

By Haskell V. Hart

May, 13, 2014

Duplication of any part of this document is permitted for noncommercial use only. Republication may be permitted by expressed written permission of the author if proper credit is given.

Abstract: The case of identifying a species from a relatively short nuDNA sequence (<0.5% of genome) can be complicated by the conservation of genes in related species. Unstructured or unrestricted database searches (e.g. BLAST™ from NCBI/NLM/NIH) will typically yield a long hit list of relatively good matches to numerous related species, genuses and even different families. The problem is, “Which is the best match, and what level of specification (family, genus, or species) is justified.” We develop a methodology using a case example of three purported “novel North American hominin” nuDNA sequences. We employ database queries in BLAST™, and introduce the concept of moments to identify the best matching group (or taxon) among candidate groups with similar statistics in the hit list. The methodology applies well to wildlife forensics, anthropology, and criminalistics, and can be implemented with an Internet connection and Microsoft® Excel.

INTRODUCTION

Compared to nuDNA, species identification is more often attempted with mtDNA, because the amount available in a cell is many times more than the nuDNA, the whole genome is over 10⁵ times smaller than the nuclear genome, allowing for whole genome analysis in a short time, and because mtDNA is more hypervariable, allowing for easier identification of specific individuals or populations. Barcoding with cytochrome-b[1] or cytochrome-c oxidase I[2] genes is a particularly popular current method.[3] However, rapid advances in sequencing technology are closing the gap between mtDNA and nuDNA in practical terms. For example, use of species specific primers for STR (short tandem repeat) analysis at specific loci is also popular.[4] Furthermore, in the case of a new species, nuDNA sequencing, preferably of the whole genome, is required, particularly if there is no holotype.

The publication of “Novel North American Hominins: Next Generation Sequencing of Three Whole Genomes and Associated Studies”[5], which contains an overview of the subject, by Melba Ketchum, et al., February, 2013, has prompted new interest in the question of whether, in fact, such creatures, variously known as bigfoot, sasquatch, yeti, abominable snowman, yowie (and other local names), exist. To date no holotype specimen (normally required for the recognition of a new species) has been produced. However, based on the analysis of DNA in this landmark publication, the authors claimed to have proven the existence of such a hominin, for which the name, Homo sapiens cognatus, was granted at ZooBank, although such an assignment does not confer recognition of existence.

Nuclear DNA sequences of three samples, 26, 31, and 140, were published by Ketchum, et al.:

Sample 26 (S26) is a small piece of fur and attached tissue from California. The sequence contains 2,726,786 base pairs (bp). A second, independent laboratory analysis with black bear and human primers, showed that it contained black bear nuclear DNA and human and black bear mitochondrial DNA.[6] Nuclear DNA was not sequenced in that study.

Sample 31 (S31), from Alabama, came from a paper plate in a food trap intended to capture DNA of the animal(s) attempting to retrieve the food. No further details were given on the nature of the food or the trap. The sequence contains 527,761 bp.

Sample 140 (S140) is blood from the inside of a gutter downspout in Illinois which had holes in the pattern of teeth marks and was torn from a house, requiring considerable force. The sequence contains 2,101,957 bp.

None of these sequences is actually a “whole genome,” which would be over a thousand times longer.

The relevant conclusions of Ketchum et al. were:

(1) “All known ape and relic hominin species such as Neanderthal and Denisovan were excluded as being contributors to both the nuclear and mitochondrial sequences.”

and the somewhat contradictory,

(2) “Analysis of whole genome sequence and analysis of preliminary phylogeny trees from the Sasquatch indicated that the species possesses a novel mosaic pattern of nuclear DNA comprising novel sequences that are related to primates interspersed with sequences that are closely homologous to humans.”

and finally,

(3) “Nevertheless, the data conclusively proves that the Sasquatch exist as an extant hominin and are a direct maternal descendent of modern humans.”

Since no results of the database searches which led Ketchum et al. to their conclusions were published, this study was initially launched to validate their remarkable claims concerning these three sequences. In the process of reinterpreting the original sequences, it became clear that basic principles of nucleotide database search-match were not previously followed. Also, multiple different unpublished interpretations by others revealed that a consistent methodology was not widely practiced, and not surprisingly, inconsistent results were obtained. Therefore we further focused the study on establishing a clearly defined methodology in this area which would be useful to forensic scientists who may not be experts in databases.

This paper is not about the “existence of sasquatch,” or the mtDNA results of Ketchum et al.

COMPUTER METHODS

We downloaded the published nuclear DNA sequences of Samples 26, 31, and 140 in FASTA format from the Ketchum paper on the DeNovo journal website.[3]

We conducted database searches on the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov) with the use of their BLAST™ (Basic Local Alignment Search Tool) software, as was done in the Ketchum paper. The scope and content of the databases, the BLAST™ software algorithms and parameters, and the results output are well described on the NCBI website and in related publications.[7, 8] We used the following values of input parameters, except where noted.

Optimize for: Highly similar sequences (Megablast)
Max target sequences: 5000 used throughout (default 100)
Expect threshold: 10 (default)
Word Size: 28 (default), except when system required 64.
Max matches in a query range: 0 (unlimited by default)
Match/Mismatch Scores: 1,-2 (default)
Gap Costs: Linear (default)
Filter: Low complexity regions (default)
Mask: Mask for lookup table only (default)

We downloaded tables of hits as Microsoft® Excel 2010 files for sorting and comparisons. Through the “Organism” field or the Boolean logic language in the “Entrez Query” input field, we applied additional match criteria to some searches, for example: all Ursidae (bears) only. No additional or unusual search techniques were employed. Only standard options were employed, so the results can be easily duplicated.

It is worth noting here that even the most recent comprehensive reference on wildlife DNA analysis[9] does not address downloading the BLAST^TM hit list to Excel for further processing. One is only taught how to read the BLAST™ summary of results, which we initially found to be misleading in our study.

We employed a strategy which combines “step back” and “drill down” searches for each sample. Specifically, we encompassed the broader phylogeny (all other, AO) as well as limited searches to the most specific family, genus, or species possible. In this way, absolutely no potential match was missed, and the most possible species details were discovered.

Another important strategy combined comparisons of scores and %ID by sorting results first by one, then by the other of these two critical evaluation parameters. The best match, represented by multiple hits, appeared on top of both sorts.

Moments of functions or distributions are used in mathematics, statistics, and physics (e.g., the moment of inertia) to describe dispersion of data about a point or a line (axis). Drawing on these analogies, we define first and second moments about a %ID axis as new comparison statistics for groups or taxa (defined here as a family, genus or species or a combination of these). We chose the 95%ID axis because it limits the data set to a manageable number of reasonably similar groups with highest %ID. Hits with %ID below 95% were deleted from the hit list.

We define first (m1ji) and second (m2ji) moments
of a single hit (i) within the j-th group (or taxon) with a

For the entire distribution of hits in the j-th group (Nj in number) the respective first and second moments, M1j and M2j, are defined by the averages:

with and defined in Equns. (1.1) and (1.2), respectively. Note that these are weighted (by score) average moments analogous to the moment of inertia (a second moment) ≡ mass x (distance)². Calculation of group moments is simply a matter of adding two more columns (M and N) to the downloaded BLAST™ hit list in Excel and calculating the new cells as: M1 = L1 *(C1-95) and N1 = L1*(C1-95)^2, etc., then summing the M and N columns and dividing each by the number of hits (Nj) to get the M1j and M2j for each group (j) (Equns. 1.3 and 1.4, respectively). Group (j) moments are easiest to calculate for hit lists from searches limited to specific groups (j). Otherwise, the hit list must be hand sorted by group, using the accession numbers, which must be individually looked up, which is not practical.

Computed for all candidate groups, {M1j, M2j} reduce long lists of hits across the unknown sequence to two numbers for each candidate group, which can be compared across the groups. We found that ordering groups by {M2j} was more discriminating than by {M1j}, but that there was no difference in the order. The second moment puts more weight on the %ID than the first moment. The highest group moments , correspond to the group (k) with the best overall match to the entire DNA sequence.

Average overall group (j) %ID, Pj,av is calculated by:

where L_ji = sequence length (bp) of the i-th match (hit) in the j-th group.

Overall coverage of hits as a percentage of query sequence was calculated by first removing the duplicates and adjusting Qstart_j_i overlaps, then sorting by Qstart and then by Qend (“Start” and “End” in the following tables) of the hits. Gaps between hits were identified and summed. The coverage, %C_j, for the j-th group, was calculated by:

where L is the query sequence length in bp, and the summation is over all hits with numerically ordered unique Qstart_ji to Qend_ji hit ranges. The result is N_j the overall % coverage for the group j. BLAST™ summary pages now list the % coverage for the hit with highest score for each database entry (until recently it was the total coverage of all hits on a database entry), but these data are not in the downloaded Excel hit list file, so they cannot be readily summed by group.

We list valuable principles of this type of database search and resulting comparisons in the DISCUSSION section.

RESULTS

Sample 26

To start with, we queried the S26 sequence against the Nucleotide Database with BLAST™, producing a bewildering hit list of many different mammals. A wrong conclusion at this point would be that it must be a new species, since it doesn’t match any existing species solely. Retrospectively we discovered that adding the first and second moment columns to the hit list (in Excel) and sorting by second moment (Equn. 1.2), then first moment (Equn. 1.1), results in five of the top six hits being the giant panda (Ailuropoda melanoleuca). These entries rose from 129-th, 199-th, 247-th, 707-th, and 888-th places in the hit list sorted by score. Another wrong conclusion at this point would be that since pandas are not extant to California, there must be something wrong with the query sequence, e.g. contamination or sequencing errors. Next, we ran the entire S26 sequence against the Human Genome + Transcription (G + T) Database. The top 21 hits were all on human Chromosome 11. “Query cover” was as high as 15% of S26 on one database entry (representing numerous individual hits). “Total score” was as high as 6.756 x 10⁵ for one entry. “Max. ident.” (maximum % identity) was 100% for each of the top 21 hits. This table, by itself, gives the false impression that S26 is human or human-like (another wrong conclusion). However, “Max score” does not necessarily correspond to the same sequence range as “Max. ident.” (maximum %ID), which may be for a very short sequence. Furthermore, “Total score,” though an impressive, large number, may be the sum of many short (and possibly unimportant) sequences and/or relatively low %ID hits. Finally, the “E value” (the probability of a random match) is always low for any significant sequence length, N, because (1/4)^N will be very small, often below the numerical limit of the computer (10^-179), represented by a zero in BLAST™ output. In summary, the BLAST™ summary results page cannot be taken as the sole ultimate proof of identity; it is only a starting point.

Near the end of this study, NCBI changed the “Max. ident.” field to “Ident.”, now defined as the %ID of the sequence segment with the “Max. score.“ This is a much more useful number. Also “Query cover” now refers only to the sequence range of the “Max. score.” This is not a useful number in this study, 0% rounded off for all the relevant hits for long query sequences with relatively short sequence matches. A rerun of the above S26 vs. human G + T yielded output with the 21 hits/matches with the highest scores averaging only 94%ID. Usually >99%ID is required for a species match.

Table 1 compares the top 15 human hits by score and the top 15 Ursidae (bear family) hits to other primates (OP), the genus Canis (dogs, wolves, coyotes), and all other (AO) species not previously included. Here Ursidae were the best match over 15 of the total 17 sequence ranges. The dog is also a better match than human in most cases. Other primates show results more like humans than dogs or bears. There were no black bear matches over any of the sequence ranges in Table 1.

*Table 1. S26 top 15 human hits and top 15 Ursidae* hits by score.**

S26 vs.	Match(a)	% ID	LEN		MIS	GAP	Start(b)	End(c)	SCORE

human-N(1)*(d)*	BC038508.1	93.32	2142		114	13	189026	191141	3136
OP(f)-N: chimpanzee	XM_003951836.1	93.18	2142		117	10	189026	191141	3120
Canis-N: dog	NM_001097982.2	95.36	2089		72	7	189026	191089	3297
polar bear(1)-TSA	GAJD01008005.1	98.69	2134		0	12	189028	191136	3760
AO(g)-N: Pacific walrus(h)	XM_004394587.1	96.58	2136		45	12	189026	191136	3515

human-N(2)	NG_012881.1	95.85	1495		33	14	362696	364161	2390
OP-RS: orangutan	NC_012602.1	95.52	1497		35	15	362696	364161	2364
dog-RS	NC_006603.3	composite			10	10	362697	364161	~2595
giant panda-RS(4)	NW_003217713.1	97.32	1494		9	5	362698	364161	2508
AO-N: mouse	AC158579.7	92.91	1100		43	17	363059	364130	1567

human-N(3)	NM_001039618.2	92.27	1695		73	34	1655921	1657569	2351
OP-N: n. white-cheeked gibbon	XM_003254601.2	92.96	1690		73	24	1655921	1657569	2420
dog-RS	NC_006603.3	94.89	1682		35	21	1655921	1657569	2582
giant panda-RS(2)	NW_003217478.1	96.45	1688		7	13	1655920	1657569	2736
AO-N: cattle	BC142239.1	92.60	1555		57	31	1655921	1657446	2182

human-N(4)	Z83001.1	98.71	1239		15	1	690618	691855	2198
OP-RS: chimpanzee	NC_006478.3	98.79	1239		14	1	690618	691855	2204
Canis-N: dog	AC203200.2	99.03	1238		11	1	690618	691854	2218
giant panda-RS(10)	NW_003217489.1	99.35	1237		8	0	690618	691854	2241
AO-N: pig	CU462945.6	98.45	1229		17	2	690624	691851	2163

human-G+T(5)	NT_009237.18	93.90	1460		67	19	312074	313518	2183
OP-N: gorilla	XM_004050754.1	94.44	1403		63	14	312127	313518	2145
dog-RS	NC_006603.3	94.18	1444		40	13	312098	313519	2161
giant panda-RS(3)	NW_003217713.1	98.00	1453		18	6	312074	313519	2512
AO-N: Florida manatee	XM_004369676.1	92.18	1420		73	25	312127	313522	1973

human-N(7)	NM_001278163.1	94.76	1354		35	15	759948	761288	2074
OP-N: n. white-cheeked gibbon	XM_003254384.2	95.20	1353		34	12	759948	761288	2109
dog-RS	NC_006600.3	97.55	1349		20	5	759948	761288	2296
polar bear-N(6)-TSA	GAJD01025080.1	98.60	1359		1	5	759948	761288	2388
AO-N: Pacific walrus	XM_004406102.1	97.94	1357		12	6	759948	761288	2337

human-G+T(6)	NW_004078072.1	92.53	1459		94	7	1761457	1762900	2076
OP-N: chimpanzee	XM_001136982.3	92.53	1459		94	7	1761457	1762900	2076
Canis-N: dog	XM_843414.2	94.65	1459		63	6	1761457	1762900	2248
Ursidae-N(5): giant panda	XM_002920733.1	97.67	1460		19	4	1761456	1762900	2494
AO-N: Pacific walrus	XM_004410744.1	95.27	1459		54	4	1761457	1762900	2298

human-N(8)	NM_001039618.2	93.39	1376		54	15	1657663	1659004	2002
OP-N: small-eared galago	XM_003780852.1	94.43	1364		48	10	1657662	1659004	2073
dog-RS	NC_006603.3	composite					1657662	1659004	~2336
giant panda-RS(7)	XR_097023.1	incomplete composite					1657663	1658930	~2312
AO-N: star-nosed mole	XM_004683259.1	93.51		1371	47	17	1657663	1659004	2001

human-N(9)	NG_012881.1	94.27		1308	63	6	326857	328153	1989
OP-RS: gorilla	NC_018435.1	94.11		1308	65	4	326857	328153	1978
dog-RS	NC_006603.3	97.38		1300	29	4	326857	328153	2207
giant panda-RS(8)	NW_003217713.1	98.69		1300	13	3	326857	328153	2303
AO-N: mouse	AC107371.17	84.86		1268	148	25	326867	328112	1238

human-N(10)	AC080100.7	95.03		1268	48	8	855386	856643	1978
OP-N: n. white-cheeked gibbon	XM_003254447.2	95.39		1064	43	3	855581	856643	1688
dog-RS	NC_006600.3	96.37		1267	30	7	855387	856643	2071
polar bear-TSA(9):	GAJD01033484.1	99.05		1267	2	5	855387	856643	2265
AO-N: Pacific walrus	XM_004404764.1	97.71		1265	18	6	855387	856643	2165

human-N(11)	AC093262.2	98.03		1117	19	2	673215	674328	1938
OP-RS: crab-eating macaque	NC_022285.1	98.21		1118	15	2	673215	674327	1949
dog-RS	NC_006600.3	99.1		1114	10	0	673215	674328	2002
Ursidae-TSA(15): giant panda	NW_003217489.1	99.28		1114	5	1	673215	674328	2010
AO-N: pig	CU468610.6	98.57		1115	10	4	673218	674328	1965

human-N(12)	NG_027813.1	96.28		1156	42	1	2258575	2259729	1895
OP-N: pygmy chimpanzee	DQ977225.1	96.37		1156	41	1	2258575	2259729	1901
Canis-N: dog	XM_536554.3	95.30		1149	54	0	2258573	2259721	1823
Ursidae-N(13):giant panda	XM_002927883.1	99.65		1149	4	0	2258573	2259721	2100
AO-N: Pacific walrus	XM_004412631.1	98.95		1147	12	0	2258575	2259721	2052

human-N(13)	AL122015.17	96.75		1109	36	0	756650	757758	1849
OP-N: n. white-cheeked gibbon	XM_003254385.2	96.74		1104	35	1	756646	757748	1838
dog-RS	NC_006600.3	97.66		1109	26	0	756650	757758	1905
giant panda-RS	NW_003217489.1	98.20		1110	17	1	756650	757759	1936
AO-N: Pacific walrus	XM_004406102.1	98.91		1099	12	0	756650	757748	1964

human-N(14)	NG_012881.1	95.22		1172	30	11	359243	360388	1831
OP-RS: small-eared galago	NW_003852396.1	95.4		1173	27	3	359243	360388	1842
dog-RS	NC_006603.3	97.44		1171	5	1	359243	360388	1973
giant panda-RS	NW_003217713.1	97.78		1171	1	1	359243	360388	1995
AO-N: mouse	AC158579.7	90.81		1175	73	16	359243	360388	1539

human-N(15)	NM_002233.3	94.77		1185	45	12	602053	603228	1829
OP-N: olive baboon	XM_003910099.1	95.27		1185	45	8	602053	603228	1868
Canis-N: dog	XM_542545.2	93.07		895	56	4	602341	603229	1304
giant panda-RS(12)	NW_003218343.1	98.99		1184	3	2	602054	603228	2111
AO-N: Pacific walrus	XM_004412889.1	97.30		1183	23	3	602055	603228	1999

human-G+T	NW_004078072.1	93.87		1175	65	5	1835440	1836608	1764
OP-N: gorilla	XM_004052006.1	93.79		1175	66	5	1835440	1836608	1759
Canis-N: dog	XM_533970.3	92.99		1155	76	4	1835459	1836608	1679
polar bear-TSA(11)	GAJD01015683.1	99.23		1174	4	3	1835440	1836608	2113
AO-N: Pacific walrus	XM_004398087.1	97.36		1174	26	3	1835440	1836608	1991

human-N	AP000609.5	93.29		1147	62	7	1508092	1509231	1677
OP-N: chimpanzee	XM_003808137.1	93.29		1147	60	9	1508090	1509231	1676
dog-RS	NC_006603.3	95.45		1144	30	8	1508103	1509231	1805
polar bear-TSA(14)	GAJD01035976.1	98.96		1151	0	2	1508093	1509231	2049
AO-N: domestic ferret	XM_004795929.1	96.04		1161	16	11	1508093	1509231	1862

Table 1. Hits are grouped by matching sequence ranges of the top 15 hits by score and separated by blank rows. (a) NCBI accession number. (b) First position of matching query (S26) sequence range. (c) Final position of matching query (S26) sequence range. (d) "N" indicates Nucleotide Database. Relative score (1-15) in parentheses. (e) "G + T" indicates human Genomic plus Transcription Database. "TSA" indicates Transcriptome Shotgun Assembly Database. "RS" indicates Reference Genomic Sequence Database. (f) OP = Other Primates (ex Homo sapiens). (g) AO = All Other (ex primates, ex Canis, ex Ursidae). (h) Common species names for each match after colon. Same format and abbreviations followed in Tables 4 and 5.

A summary of Table 1 results is presented in Table 2. Clearly Ursidae, represented here by the panda (Ailuropoda melanoleuca) and the polar bear (Ursus maritimus), are the best overall match by all four criteria: number of best scores, average rank, average score, and average %ID, over the top 15 human and top 15 Ursidae hits by score.

Table 2. Comparison of hits by groups: summaries of Tables 1, 4, and 5.

Group	Best Scores(a)	Average Rank(b)	Std. Dev.(c)	Ave. Score	Ave. %ID(d)

	S26 Top Hits from Table 1.
Ursidae	15	1.1	0.3	2325	98.4
Canis (dog)	1	2.9	1.2	2146	96.0
All Other	1	3.2	1.6	1930	95.2
Other Primates	0	3.9	0.9	2075	94.9
Human	0	3.9	0.8	2068	94.7

	S31 Top Hits from Table 4(e)
Human (15) *(f)*	13.5	1.1	0.21	373.4	99.8
Other Primates(15)	1.5	1.9	0.21	360.3	98.7
Canis (dog) (2)	0	3.6	0.28	278.5	97.5
All Other (6)	0	3.4	0.28	176.0	87.2

	S140 Top Hits from Table 5
Canis (dog)	17	1.0	0.0	1848	98.5
All Other	0	2.4	0.8	1689	95.3
Other Primates	0	3.0	0.5	1608	94.4
Human	0	3.6	0.6	1603	94.3

Table 2. (a) Number of best scores. (b) Groups were ranked by score (1 highest, 5 lowest) in each hit sequence range. (c) Of Average Rank. "N-1" formula used. (d) Equn. 1.5. (e) Fungus and bacteria hits removed. (f) Number of hits in parentheses. All ties were averaged, e.g. tie for first is 0.5 for each best score and 1.5 for each hit rank.

CONTINUED

Bigfoot Claims

Pages

Paper 1, Part I

No comments:

Post a Comment