Search This Blog

Friday, June 17, 2016

More Inconsistencies and Evidence for Contamination in Ketchum et al. Supplementary Figures


ABSTRACT

Scientific results should be consistent, otherwise more experiments are needed to clarify discrepancies. The Ketchum et al.(2013) Supplementary Figures 1, 2, and 3, which are mitochondrial DNA phylotrees for samples 26, 31, and 140, respectively, are inconsistent with Supplementary Figures 7, 8, and 9, respectively. Haplogroups of the closest relatives do not agree: Supp. Fig. 1 with 7 (S26), 2 with 8 (S31), or 3 with 9 (S140). The Most likely explanation is contamination by at least one human in each case.

INTRODUCTION


Ketchum et al. (2013) mitochondrial DNA results have been previously reviewed (Paper 2 at right), and the Ketchum claim that they are all 100% human was found to be an overstatement. In fact, it was found that eight of 18 samples with complete mitochondrial sequences had too many mutations from the closest haplogroup to be statistically probable (less than 1% probability). Also, eight of 11 samples with HVR-1 (hypervariable region 1) only mutations listed were phylogenetically ambiguous, i.e., alternate haplogroups were equally likely. From these new results, it was concluded that either the samples were contaminated and/or degraded, or that any possible hybridization events would have to have been followed by subsequent mutations along nonhuman evolutionary lines and on a different time scale. The results in the current paper prove that S31 has human contamination by an individual with a different haplogroup than previously reported for that sample. Samples 26 and 140 have previously been shown to be from a black bear and a dog, respectively, from nuclear DNA matches (See Paper 1 at right). They are now shown here to be contaminated by two humans of different haplogroups.

METHODS

This study involves extracting data from six circular phylotrees, Supp. Figs. 1, 2, 3, 7, 8, and 9 from Ketchum et al. (2013) for comparisons. Phylotrees of this kind are generated from the query results (hits) in BLAST (TM) through the "Distance tree of results" option. The goal was to determine the haplogroup [1] of the nearest match to each query: S26, S31, or S140 in Supp. Figs. 1 and 7, 2 and 8, and 3 and 9, respectively. Phylotrees in Supp. Figs. 1, 2, and 3 were generated from complete mitochondrial sequences produced by Family Tree DNA. Phylotrees in Supp. Figs. 7, 8, and 9 [3] were generated from supercontigs, but the details of which mitochondrial genes were employed were not stated.

The title of the nearest phylotree branch tip was searched in GenBank (R)(the NCBI databases), and the accession number retrieved. A BLAST(TM)[3] alignment of this accession with rCRS (Revised Cambridge Reference Sequence, Genbank accession NC_012920.1) produced a set of rCRS-based mutations, as seen in the "Graphics" option of the results page. From these mutations a haplogroup was determined using the programs FASTmtDNA and mtDNAable as previously described (Paper 2 at right).

RESULTS

Sample 26

Table 1 presents the results for S26. The nearest haplogroup from Supp. Fig. 1 (H1)closely matches that determined by Family Tree DNA (H1a). However, The Supp. Fig. 7 result, T2b, is far removed. Interestingly, this is the haplogroup of the human contamination determined by two independent studies (Cassidy, 2013; Khan and White, 2012-the Tyler Huggins Report at right). Also to be noted is that S26 is one of the samples previously found to have too many extra mutations (16) to be called "modern human," according to the accepted mtDNA phylotree (van Oven, 2010) and a Poisson Distribution of mutations (Paper 2 at right). Given that the nuDNA of this sample matched a black bear (Paper 1 at right; Cassidy, 2013; Khan and White, 2012; Sykes et al., 2014 - See Tyler Huggins Report and Sykes Paper at right), it can be concluded that there are two sources of human contamination in this sample, with haplogroups H1a/H5e and T2b.



Table 1.  S26  
Nearest Matches to S26 mtDNA in Ketchum phylotrees
Supp.
Accession
Mis.vs.
Hap.
Fig.
S26
Homo sapiens clone 3760 mitochondrion,
1
JQ703795.1
16
H1
complete genome
Homo sapiens isolate NEC20 mitochondrion
7
JQ664540.1
22
T2b
complete genome
                  From Ketchum Supp. Data 2:
H1a
S26 from mtDNAable:
H5e



Columns left to right: Accession title, Ketchum et al. Supplementary Figure number,
Accession number (GenBank), Mismatches vs.S26, Haplogroup.
 







Sample 31

Table 2 presents the results for S31. The nearest haplogroup from Supp. Fig. 2 (L1a1) is close to that determined by Family Tree DNA (L0d2a). However, The Supp. Fig. 8 results, T2b and T2b8, are far removed. The nuDNA of this sample matches modern human (Paper 1 at right). Sample 31 is contaminated by another human of T2b haplogroup.



Table 2.  S31





Nearest Matches to S31 mtDNA in Ketchum phylotrees

Supp..

Accession

Mis. vs.

Hap.

Fig.

S31






Homo sapiens haplotype A10L1A2 mitochondrion

2

AY195777.1

2

L0d2a1

complete genome

(A10L1A2)*

Homo sapiens isolate 157 T2i Tor354 mitochondrion

8

JQ798131.1

100

T2 or T2b-16362C

complete genome

(T2i)*

Homo sapiens isolate 13T mitochondrion

8



complete genome

JX081995.1

104

T2b8








From Ketchum Supp. Data 2:

L0d2a

S31 from mtDNAable:

L0d2a1

*(Haplogroup) taken from Accession




Sample 140

Table 3 presents the results for S140. The nearest haplogroup from Supp. Fig. 3 (D4b2b1) matches that determined by Family Tree DNA for HVR-1 only(D). However, The Supp. Fig. 9 results, both R2'JT, are far removed. The nuDNA of this sample matches a dog (First paper at right). This sample is contaminated by two humans with haplogroups D and R2.

Table 3. S140


Nearest Matches to S140 mtDNA in phylotrees

Supp.

Accession

Mis vs

Hap.

Fig.

S140

Homo sapiens mitochondrial DNA complete genome

3

AP008361.1

No complete sequence available**

D4b2b1

isolate PDsq0023

Homo sapiens isolate R1 mitochondrion,

9

JX155264.1

R2'JT(R2a1)*

complete genome

Homo sapiens isolate R2 mitochondrion

9

JX155265.1

R2'JT(R2a1)*

complete genome

           From Ketchum Supp. Data 2:

D (HVR-1)

      From Behar, et al.(2012):

D (HVR-1)






*  (Haplogroup) taken from Accession

**  Oddly Supp. Figs, 3 and 9 require a full sequence, but Supp. Data 2 contains only HVR-1 mutations



CONCLUSION

Over all three samples, using supercontigs resulted in phylotrees with haplogroups which were inconsistent with full sequence derived haplogroups.

Samples 26 and 140 are contaminated by two modern humans. Sample 31 is contaminated by one additional human.

Insistence by Dr. Melba Ketchum, DVM, that her samples were not contaminated when analyzed is not warranted. Very likely some additional Ketchum et al. anomalous mtDNA samples are so because of contamination (Paper 2 at right).


NOTES


[1] Haplogroups are unique human mtDNA sequences, represented by their mutations from a standard, either rCRS (revised Cambridge Reference Sequence) or RSRS (Reconstructed Sapiens Reference Sequence). All known haplogroups of modern humans are represented in the phylotree of van Oven (2010) at www.phylotree.org. This tree stems from the root called "Mitochondrial Eve", the most recent common maternal ancestor (MRCA) of all humans. A haplotype is a particular allele (combination of SNPs-mutations) within a haplogroup and is designated by a preceding letter and number.

[2] Supp. Figs. 7, 8, and 9 are erroneously referred to in the Ketchum et al. (2013) text as Supp. Figs. 4, 5, and 6 in the last paragraph of the "Next Generation Whole Genome Sequencing" section. Supp. Figs. 4, 5, 6 are actually nuDNA-based phylotrees. See my blog "Melba Ketchum's Experts and Their Mistakes: What's in a Phylotree."

[3] BLAST (TM) is a search/match program which utilizes the National Center for Biotechnology Information (NCBI) GenBank databases. (Altschul et al.,1990; Madden, 2003). Its application has been described extensively on this blogsite. See under BLAST Search and Ketchum DNA Study Tabs above.


REFERENCES


Altschul, S. F.; Gish, W.; Webb, M.; Meyers, E. W.; Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215 (no.3): 403-410.

Behar D.M.; van Oven, M.; Rosset, S.; Metspalu, M.; Loogväli, E.-L.; Silva, N. M.; Kivisild, T.; Torroni, A.; Villems, R. (2012) A “Copernican" reassessment of the human mitochondrial dna tree from its root. American Journal of Human Genetics, 90 (no.4): 675-684. http://dx.doi.org/10.1016/j.ajhg.2012.03.002

Cassidy, B. G. (2013). Technical Examination Report DNAS Case Number: 2012-006524. DNA Solutions, Inc. (Oklahoma City).  (See this blog at right)


Ketchum, M. S. et al. (2013). Novel north american hominins: next generation sequencing of three whole genomes and associated studies, DeNovo, 1:1. Online only: 
http://sasquatchgenomeproject.org/sasquatch_genome_project_002.htm

Khan, T. and White, B. (2012) Final report on the analysis of samples submitted by Tyler Huggins. Wildlife Forensic DNA Laboratory Case File 12-019, Trent University Oshawa (Peterborough, Ontario, Canada).  (See this blog at right.)


Madden, T. (2003). The BLAST sequence analysis tool. The NCBI Handbook; McEntyre, J; Ostell, J., Eds.; National Center for Biotechnology Information (Beth
esda, MD). http://www.ncbi.nlm.nih.gov/books/NBK21097/.

Sykes, B. C.; Rhettman A.; Mullis, R. A.; Hagenmuller, C.; Melton, T. W.; Sartori, M. (2014) Genetic analysis of hair samples attributed to yeti, bigfoot and other anomalous primates. Proceedings of the Royal Society B, 281: 20140161.

https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.0161

van Oven, M. (2010). Revision of the mtDNA tree and corresponding haplogroup nomenclature. Proceedings of the National Academy of Sciences USA, 107 (no. 11): E38-E39. http://dx.doi.org/10.1073/pnas.0915120107

Sunday, September 20, 2015

The Ketchum File IV: A Reassembly of Ketchum Raw Data: Can You Turn a Bear Sow’s Ear into a Sasquatch Silk Purse?



Caution:  Contains Ketchum Koolaid.  Please drink responsibly.


Dr. Melba Ketchum, DVM, posted the following last May on Facebook:

May 4, 2015

New Technology at Washington U Maps Human Genome in Days; Large-scale Studies now Possible. By Michele Munz (St. Louis Post-Dispatch)


http://www.stltoday.com/lifestyles/health-med-fit/medical/new-technology-at-wash-u-maps-human-genome-in-days/article_9ed22975-a385-5b53-897a-ff88cba2442b.html

 “This has been around for a while now because they used a supercomputer on our genomes. Of course they get faster each year. You have to have a supercomputer to analyze genomes. That's why the critics don't know what they're talking about. They don't even have the equipment to analyze the data. We had two bioinformaticists work on the genomes in our paper. We have more working on it now with a new (as of last fall) supercomputer like this one. Of course it takes a lot longer with a novel genome, because it has to be compared to all animals including mammals, birds, reptiles, amphibians plus plants, bacteria, fungi, Ancient DNA and viruses. In other words every DNA sequence ever mapped and not just GenBank but all depositories worldwide. This is what has been being (sic) done. I hope they will let us release the results!!!!!”



*************************************************************



Let’s get a few things straight first. The supercomputer is used to assemble the raw data into a sequence, and that’s something that “critics” like me have not attempted to do with the Ketchum et al. raw data because we do not have access to this raw data, not because we don’t have access to a supercomputer. I for one accepted her three sequences at face value and reinterpreted them by using an Internet connection to the servers at NCBI to align them against their GenBank® of known sequences. This is exactly what Ketchum et al. did through an Internet connection. No need to have a supercomputer in your garage to align assembled, complete sequences.

“GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (
Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.“ (from the NCBI website)

Melba would have you believe that the search/match process takes a supercomputer a long time to do to cover all those “mammals, birds, reptiles, amphibians plus plants, bacteria, fungi, Ancient DNA and viruses .“ I’ve done high hundreds of searches through the shared NCBI servers, and none took longer than 30 min. The average time has been a minute or two, but it depends on the length of the query sequence and the time of day (since you share the resource with others). “Novel genomes” don’t take any longer to align than known species. Furthermore, as the NCBI website quote above says, GenBank® includes sequences from other important “depositories worldwide.” Anybody wanting their sequence to be available to the widest possible audience would contribute to one of these three databases, which share data with each other. There’s relatively little important, unique species data elsewhere. Access to a supercomputer does not give one access to more data. A real person needs to find the location of the data

Now for the interesting part of the above Ketchum Facebook post. Melba reveals that her raw data is being reassembled, and my communications with ”Jerry” (not his real name), one of her supporters, confirms that. Assembly involves putting together many short sequences in the right order to make the complete sequence. This involves looking for overlaps on the ends (de novo) or aligning with a reference sequence of the same or a closely related species. For complete genomes with billions of base pairs, this can require a supercomputer. But this is assembly, not identification.

According to Jerry, “… the assembly done on her raw data was about as appropriate as driving a square peg into a round hole. …She did not know enough to ask for a de Novo assembly. The University (University of Texas Southwestern Medical Center) should have known better though. Their bad, not hers.” Sorry, but the principle author has responsibility for everything in a scientific paper. She should have known better, too. A de novo assembly makes no assumptions about the species. In contrast, Ketchum et al. used a human reference sequence (chromosome 11) to assemble her sequences, because she believed human to be the closest species (her bias about her totally unprovenanced samples).

Jerry continues, “…you have just spent all your time and effort writing you (sic) treatises, trying to make sense of what was an imperfect reassembly. You might as well try comparing apples with oranges. … What was required all along was that the work be redone by a top notch bioinfomatician. Then we might get something meaningful from what I consider to be good raw data.” Not exactly a commendation for Fan Zhang, Melba’s “bioinformaticist,” whose degrees are in mechanical and aeronautical engineering from Harbin University in Manchuria and whom she continues to praise to this day. “Good raw data”? Maybe, but it’s not from a sasquatch, as I proved. Only if sasquatch is a feral human can Sample 31 be from a sasquatch.  Samples 26 and 140 are from a black bear and a dog (wolf, or coyote – all genus Canis) respectively.

And again from Jerry, “The only viable conclusion is that the assembled sequences were not entirely correct and it was quite clear to me that the whole of her nuDNA assembled data was less than reliable, as is any attempt to use that data to draw conclusions. I note that Melba said in May that her data has been reassembled again and she hopes those that did that will allow her to publish the results. I heard a whisper that this data crunching was underway but know absolutely none of the detail. All I can say is that it's a pity it was not done two or three years ago.” Is this bias, or what? May I humbly suggest another possible conclusion? The sequences are basically correct (but possibly incomplete), and Samples 26 and 140 are from a bear and a dog, respectively, as I’ve proven previously.

How about that!! An insider says that the Ketchum et al. assembly and sequencing is majorly flawed, and led to my “waste of time” trying to interpret the results by matching known species. “We can't make a silk purse from a sows (sic) ear,” says Jerry.   But that's exactly what Melba tried to do: make a sasquatch silk purse from a bear sow's ear. 

But who is “Jerry”?  Regular Google searches revealed nothing about him. I tried to find his publications through Google Scholar, but could not find even one. I searched the leadership and recent conference presenters and chairpersons of the major genetics societies in his country (he hails from outside the US according to his Facebook page), but did not find reference to Jerry. So, I cannot, without more information, consider him a peer-acknowledged authority on genetics; but for that matter, neither am I.  However, our Jerry does seem to have an ear to the ground, and he is knowledgeable about genetics, as judged by his several lengthy PMs to me.

However, I strongly disagree with him that a reassembly will change a black bear (Sample 26) sequence into a sasquatch sequence. Or a dog sequence (Sample 140) either, for that matter. Recall that three independent labs all showed S26 to be a black bear (See my blog, November 26, 2014, “Ketchum Sample 26, The Smeja Kill: Independent Lab Reports”). I compared S26 to sequences from five different NCBI databases and data from two literature sources. The results were consistent across all seven sources: Sample 26 is a black bear, with human and primate matches consistently much poorer. An incorrect assembly would likely not have produced sequence segments up to 2000 bp long matching a bear 99-100%. Such a sequence segment would involve a minimum 10-20 or likely more short raw sequences, not counting sequence coverage which was claimed to be 30X by Ketchum et al.: so, maybe 600 individual raw sequences total. It’s hard to imagine that software that “was about as appropriate as driving a square peg into a round hole,” could consistently produce relatively long sequence matches to a bear, with human and other primates consistently much poorer matches. If the software were that inappropriate, a much more likely scenario would be one in which the resulting sequence had random errors and hence matched no species consistently. How could software be so targeted as to always replace a sasquatch base with a bear base (where they differed) throughout the 2.7 million bases that were sequenced in S26? And why would the same software yield a bear for S26 and a dog for S140 if both were actually a sasquatch? Actually, use of a human reference would, if anything, result in a more human-like sequence, not a less human one. And this is precisely what we found: only conserved genes were assembled, those which matched human about 94-95%. This may also explain why only 2.7 million of the 135 million base pairs in chromosome 11 were sequenced, i.e. much of the rest was not sufficiently conserved to allow an assembly through a human reference. A similar argument is made for S140, the dog, with 2.1 million bases sequenced by Ketchum, et al. and conserved genes matching human 94-95% also.  

But Jerry has all kinds of excuses: (1) “…you have no idea what you are doing… ” A number of accomplished geneticists have told me otherwise. (2) “…there is an almost endless list of pipeline (software) approaches that can be used to reassemble short reads from Illumina data.” Does he really mean to imply that different software can give fundamentally different results, such as different, unrelated species? Show me please. (3) “…Genbank is known to have so much rubbish and unverified data in it's (sic) database…” Really? Then how would he plan to identify the newly reassembled sequences? Everybody uses GenBank®.  Whatever "rubbish" it contains won’t match a query sequence very well anyway and will not appear in the results, or if so with lower %ID. Also, “uncultured environmental samples,” which are not identified as to species, are clearly indicated as such. I used enough independent bear data, including polar bear and panda as cross checks to eliminate the possibility that one bad database sequence could result in a misidentification. (4) “You made incorrect assumptions”  Only that the published sequences are basically correct! (but possibly incomplete).  Other than that I made no assumptions; I simply compared these sequences to
GenBank®,
as Melba did, but with different results.

However, I do agree with Jerry when he says, “… you (Ketchum et al.) do not force a comparison of raw data against a known (human) genome even if the material is uncontaminated especially for a suspected unknown species.” This is what the Illumina HiSeq™ 2000 system, used by Ketchum et al., does. It is based on a human reference sequence, and is designed to detect small differences (SNPs), e.g. which may result in a disease, not large interspecies differences, e.g. which make a bear walk on four legs and a human on two. By this means, you’ll only map conserved genes, ones that involve common functions.



The bottom line question is: Can a human-based reference method, used by Ketchum et al. yield a consistent and basically correct, but incomplete, sequence which is different from human as for Samples 26 and 140?  I think it’s possible; Jerry does not. There are two reasons why I think so: (1) The sequencing of short segments is completely species blind, the reagents are completely neutral. None target a specific species. (2) Important genes are highly conserved. We found that human matched the S26 bear 94-95%, so in assembly some shorter sequences would also align with a human reference sequence well enough to assemble them (in the right order), at least for conserved genes. There will be some base mismatches (about 1 in 16 to 20 I found), but by and large the conserved genes will align correctly and will be identified as to species through a BLAST™ search of GenBank®.  No, this won’t work for assembling marijuana sequences with a Coho salmon reference, but bears and humans do have many functions in common and therefore have some, but not all, similar genes. We’ll have to see what the new team of hopefully “top notch bioinformaticians” comes up with. Hopefully, they’re sequencing some new and different samples by de novo assembly, not "wasting their time" with Samples 26 and 140.

It’s been over four months since Melba’s original post and the above St. Louis Post-Dispatch article, which claims that whole genomes can be sequenced in just a few days. And, as we noted above, searching GenBank® takes even less time (minutes). So where are the results? I asked Dr. Richard K. Wilson, the Director of the Washington University McDonnell Genome Institute whether the MGI was reassembling Melba’s raw data, but he did not respond. Similarly, Dr. Wes Warren, Assistant Director, whose field encompasses species identification, was also unresponsive to my inquiry. “Silence is Golden.” I take this as a “yes.” MGI may be involved in the reassembly of the Ketchum et al. raw data and has been sworn to secrecy by one of Melba’s famous NDAs. Of course, if their efforts fail they’d want that to be kept secret to protect their reputation. But wouldn’t it be refreshing for them to report such results, just as Bryan Sykes did with the 30 hair samples that turned out to be from known animals? I’ll make a prediction, too, that they will not find sasquatch or anything like it in Samples 26 and 140, over the published sequence ranges. Of course there are still the 99.9+% of the purported “three whole genomes” which was not assembled or published. I promised Jerry a case of his favorite national beer (largest container size) if MGI finds a sasquatch in just one of these samples. But I don’t think they will. I have the support of three independent laboratories which found S26 to be a black bear. 


Finally, you might wonder, “Why kick a dead horse?” Jerry calls it my “obsession with old news,” but it’s not “old news,” because Dr. Melba Ketchum, DVM, continues to appear on radio and TV shows expounding on her great discovery to mostly sensationalist interviewers. Science is based on a foundation of past work (“old news”), but only good, validated work, which is required to be cited as relevant references in each published scientific paper. Faulty results and conclusions need to be exposed and retracted, like polywater and cold fusion.  Jerry says, “Melba has some well qualified friends who think her work was an important start” (not exactly an endorsement of her conclusions). Actually, I agree to the extent that she and colleagues were the first to have used DNA analysis to attempt to prove the existence of sasquatch. If and when mistakes are admitted, previous results and conclusions retracted, and a reasonable interpretation of new results is presented (if there is one), I will cease and desist in my scientific criticism, but not before. Washington University of St. Louis McDonnell Genome Institute, we’re all waiting.

Note: More information on Illumina sequencing and assembly can be found on their website: 
illumina.com. It’s a good company. None of the above reflects badly on them in any way. Their machine got the right answers, even when using the wrong method selected by Ketchum et al.