Caution: Contains Ketchum Koolaid. Please drink responsibly.
Dr. Melba Ketchum, DVM, posted the following last May on Facebook:
May 4, 2015
New Technology at Washington U Maps Human Genome in Days; Large-scale Studies now Possible. By Michele Munz (St. Louis Post-Dispatch)
“This has been around for a while now because they used a supercomputer on our genomes. Of course they get faster each year. You have to have a supercomputer to analyze genomes. That's why the critics don't know what they're talking about. They don't even have the equipment to analyze the data. We had two bioinformaticists work on the genomes in our paper. We have more working on it now with a new (as of last fall) supercomputer like this one. Of course it takes a lot longer with a novel genome, because it has to be compared to all animals including mammals, birds, reptiles, amphibians plus plants, bacteria, fungi, Ancient DNA and viruses. In other words every DNA sequence ever mapped and not just GenBank but all depositories worldwide. This is what has been being (sic) done. I hope they will let us release the results!!!!!”
Let’s get a few things straight first. The supercomputer is used to assemble the raw data into a sequence, and that’s something that “critics” like me have not attempted to do with the Ketchum et al. raw data because we do not have access to this raw data, not because we don’t have access to a supercomputer. I for one accepted her three sequences at face value and reinterpreted them by using an Internet connection to the servers at NCBI to align them against their GenBank® of known sequences. This is exactly what Ketchum et al. did through an Internet connection. No need to have a supercomputer in your garage to align assembled, complete sequences.
“GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.“ (from the NCBI website)
Melba would have you believe that the search/match process takes a supercomputer a long time to do to cover all those “mammals, birds, reptiles, amphibians plus plants, bacteria, fungi, Ancient DNA and viruses .“ I’ve done high hundreds of searches through the shared NCBI servers, and none took longer than 30 min. The average time has been a minute or two, but it depends on the length of the query sequence and the time of day (since you share the resource with others). “Novel genomes” don’t take any longer to align than known species. Furthermore, as the NCBI website quote above says, GenBank® includes sequences from other important “depositories worldwide.” Anybody wanting their sequence to be available to the widest possible audience would contribute to one of these three databases, which share data with each other. There’s relatively little important, unique species data elsewhere. Access to a supercomputer does not give one access to more data. A real person needs to find the location of the data
Now for the interesting part of the above Ketchum Facebook post. Melba reveals that her raw data is being reassembled, and my communications with ”Jerry” (not his real name), one of her supporters, confirms that. Assembly involves putting together many short sequences in the right order to make the complete sequence. This involves looking for overlaps on the ends (de novo) or aligning with a reference sequence of the same or a closely related species. For complete genomes with billions of base pairs, this can require a supercomputer. But this is assembly, not identification.
According to Jerry, “… the assembly done on her raw data was about as appropriate as driving a square peg into a round hole. …She did not know enough to ask for a de Novo assembly. The University (University of Texas Southwestern Medical Center) should have known better though. Their bad, not hers.” Sorry, but the principle author has responsibility for everything in a scientific paper. She should have known better, too. A de novo assembly makes no assumptions about the species. In contrast, Ketchum et al. used a human reference sequence (chromosome 11) to assemble her sequences, because she believed human to be the closest species (her bias about her totally unprovenanced samples).
Jerry continues, “…you have just spent all your time and effort writing you (sic) treatises, trying to make sense of what was an imperfect reassembly. You might as well try comparing apples with oranges. … What was required all along was that the work be redone by a top notch bioinfomatician. Then we might get something meaningful from what I consider to be good raw data.” Not exactly a commendation for Fan Zhang, Melba’s “bioinformaticist,” whose degrees are in mechanical and aeronautical engineering from Harbin University in Manchuria and whom she continues to praise to this day. “Good raw data”? Maybe, but it’s not from a sasquatch, as I proved. Only if sasquatch is a feral human can Sample 31 be from a sasquatch. Samples 26 and 140 are from a black bear and a dog (wolf, or coyote – all genus Canis) respectively.
And again from Jerry, “The only viable conclusion is that the assembled sequences were not entirely correct and it was quite clear to me that the whole of her nuDNA assembled data was less than reliable, as is any attempt to use that data to draw conclusions. I note that Melba said in May that her data has been reassembled again and she hopes those that did that will allow her to publish the results. I heard a whisper that this data crunching was underway but know absolutely none of the detail. All I can say is that it's a pity it was not done two or three years ago.” Is this bias, or what? May I humbly suggest another possible conclusion? The sequences are basically correct (but possibly incomplete), and Samples 26 and 140 are from a bear and a dog, respectively, as I’ve proven previously.
How about that!! An insider says that the Ketchum et al. assembly and sequencing is majorly flawed, and led to my “waste of time” trying to interpret the results by matching known species. “We can't make a silk purse from a sows (sic) ear,” says Jerry. But that's exactly what Melba tried to do: make a sasquatch silk purse from a bear sow's ear.
But who is “Jerry”? Regular Google searches revealed nothing about him. I tried to find his publications through Google Scholar, but could not find even one. I searched the leadership and recent conference presenters and chairpersons of the major genetics societies in his country (he hails from outside the US according to his Facebook page), but did not find reference to Jerry. So, I cannot, without more information, consider him a peer-acknowledged authority on genetics; but for that matter, neither am I. However, our Jerry does seem to have an ear to the ground, and he is knowledgeable about genetics, as judged by his several lengthy PMs to me.
However, I strongly disagree with him that a reassembly will change a black bear (Sample 26) sequence into a sasquatch sequence. Or a dog sequence (Sample 140) either, for that matter. Recall that three independent labs all showed S26 to be a black bear (See my blog, November 26, 2014, “Ketchum Sample 26, The Smeja Kill: Independent Lab Reports”). I compared S26 to sequences from five different NCBI databases and data from two literature sources. The results were consistent across all seven sources: Sample 26 is a black bear, with human and primate matches consistently much poorer. An incorrect assembly would likely not have produced sequence segments up to 2000 bp long matching a bear 99-100%. Such a sequence segment would involve a minimum 10-20 or likely more short raw sequences, not counting sequence coverage which was claimed to be 30X by Ketchum et al.: so, maybe 600 individual raw sequences total. It’s hard to imagine that software that “was about as appropriate as driving a square peg into a round hole,” could consistently produce relatively long sequence matches to a bear, with human and other primates consistently much poorer matches. If the software were that inappropriate, a much more likely scenario would be one in which the resulting sequence had random errors and hence matched no species consistently. How could software be so targeted as to always replace a sasquatch base with a bear base (where they differed) throughout the 2.7 million bases that were sequenced in S26? And why would the same software yield a bear for S26 and a dog for S140 if both were actually a sasquatch? Actually, use of a human reference would, if anything, result in a more human-like sequence, not a less human one. And this is precisely what we found: only conserved genes were assembled, those which matched human about 94-95%. This may also explain why only 2.7 million of the 135 million base pairs in chromosome 11 were sequenced, i.e. much of the rest was not sufficiently conserved to allow an assembly through a human reference. A similar argument is made for S140, the dog, with 2.1 million bases sequenced by Ketchum, et al. and conserved genes matching human 94-95% also.
But Jerry has all kinds of excuses: (1) “…you have no idea what you are doing… ” A number of accomplished geneticists have told me otherwise. (2) “…there is an almost endless list of pipeline (software) approaches that can be used to reassemble short reads from Illumina data.” Does he really mean to imply that different software can give fundamentally different results, such as different, unrelated species? Show me please. (3) “…Genbank is known to have so much rubbish and unverified data in it's (sic) database…” Really? Then how would he plan to identify the newly reassembled sequences? Everybody uses GenBank®. Whatever "rubbish" it contains won’t match a query sequence very well anyway and will not appear in the results, or if so with lower %ID. Also, “uncultured environmental samples,” which are not identified as to species, are clearly indicated as such. I used enough independent bear data, including polar bear and panda as cross checks to eliminate the possibility that one bad database sequence could result in a misidentification. (4) “You made incorrect assumptions” Only that the published sequences are basically correct! (but possibly incomplete). Other than that I made no assumptions; I simply compared these sequences to GenBank®, as Melba did, but with different results.
However, I do agree with Jerry when he says, “… you (Ketchum et al.) do not force a comparison of raw data against a known (human) genome even if the material is uncontaminated especially for a suspected unknown species.” This is what the Illumina HiSeq™ 2000 system, used by Ketchum et al., does. It is based on a human reference sequence, and is designed to detect small differences (SNPs), e.g. which may result in a disease, not large interspecies differences, e.g. which make a bear walk on four legs and a human on two. By this means, you’ll only map conserved genes, ones that involve common functions.
The bottom line question is: Can a human-based reference method, used by Ketchum et al. yield a consistent and basically correct, but incomplete, sequence which is different from human as for Samples 26 and 140? I think it’s possible; Jerry does not. There are two reasons why I think so: (1) The sequencing of short segments is completely species blind, the reagents are completely neutral. None target a specific species. (2) Important genes are highly conserved. We found that human matched the S26 bear 94-95%, so in assembly some shorter sequences would also align with a human reference sequence well enough to assemble them (in the right order), at least for conserved genes. There will be some base mismatches (about 1 in 16 to 20 I found), but by and large the conserved genes will align correctly and will be identified as to species through a BLAST™ search of GenBank®. No, this won’t work for assembling marijuana sequences with a Coho salmon reference, but bears and humans do have many functions in common and therefore have some, but not all, similar genes. We’ll have to see what the new team of hopefully “top notch bioinformaticians” comes up with. Hopefully, they’re sequencing some new and different samples by de novo assembly, not "wasting their time" with Samples 26 and 140.
It’s been over four months since Melba’s original post and the above St. Louis Post-Dispatch article, which claims that whole genomes can be sequenced in just a few days. And, as we noted above, searching GenBank® takes even less time (minutes). So where are the results? I asked Dr. Richard K. Wilson, the Director of the Washington University McDonnell Genome Institute whether the MGI was reassembling Melba’s raw data, but he did not respond. Similarly, Dr. Wes Warren, Assistant Director, whose field encompasses species identification, was also unresponsive to my inquiry. “Silence is Golden.” I take this as a “yes.” MGI may be involved in the reassembly of the Ketchum et al. raw data and has been sworn to secrecy by one of Melba’s famous NDAs. Of course, if their efforts fail they’d want that to be kept secret to protect their reputation. But wouldn’t it be refreshing for them to report such results, just as Bryan Sykes did with the 30 hair samples that turned out to be from known animals? I’ll make a prediction, too, that they will not find sasquatch or anything like it in Samples 26 and 140, over the published sequence ranges. Of course there are still the 99.9+% of the purported “three whole genomes” which was not assembled or published. I promised Jerry a case of his favorite national beer (largest container size) if MGI finds a sasquatch in just one of these samples. But I don’t think they will. I have the support of three independent laboratories which found S26 to be a black bear.
Finally, you might wonder, “Why kick a dead horse?” Jerry calls it my “obsession with old news,” but it’s not “old news,” because Dr. Melba Ketchum, DVM, continues to appear on radio and TV shows expounding on her great discovery to mostly sensationalist interviewers. Science is based on a foundation of past work (“old news”), but only good, validated work, which is required to be cited as relevant references in each published scientific paper. Faulty results and conclusions need to be exposed and retracted, like polywater and cold fusion. Jerry says, “Melba has some well qualified friends who think her work was an important start” (not exactly an endorsement of her conclusions). Actually, I agree to the extent that she and colleagues were the first to have used DNA analysis to attempt to prove the existence of sasquatch. If and when mistakes are admitted, previous results and conclusions retracted, and a reasonable interpretation of new results is presented (if there is one), I will cease and desist in my scientific criticism, but not before. Washington University of St. Louis McDonnell Genome Institute, we’re all waiting.
Note: More information on Illumina sequencing and assembly can be found on their website: illumina.com. It’s a good company. None of the above reflects badly on them in any way. Their machine got the right answers, even when using the wrong method selected by Ketchum et al.