Search This Blog

Wednesday, September 3, 2014

Searching the NCBI Databases with BLAST™ - Part III


Nobody really believes that a polar bear or a panda roams the wild in California where Ketchum’s Sample 26 was found.  Let’s see if a black bear (Ursus americanus) matches the sequence at all.  Then if it does, we’ll investigate why it hasn’t shown up in previous searches. 

1.  You can skip steps 1. and 2. of Part I.  Complete steps 3. and 4., except with a “Job Title” of Black bear vs. nucleotide.  To the right of “Organism” type black bear and select American black bear (txid: 9643).   Also, this time, select 100 for “Max target sequences” and 28 for “Word size.”

2.  This time the hit list is much shorter, only five matches, but four of them are perfect, 100%ID.  Download the Excel results file and save it as before (step. 7., Part I).

So what do you think?  Is the black bear the real origin of Sample 26?  Let’s look at the composition of these databases.  Maybe, just maybe, there’s not very much black bear data in the databases.  We already know there’s not a complete genome from Part II.

3.  Open up http://www.ncbi.nlm.nih.gov/  and the click on “Nucleotide” to open up the list of entries in the Nucleotide collection.  At the top of the new page, see that there are 39,786 entries.  Seems like a lot, but let’s compare it to the polar bear and the panda.  Replace “American black bear with “polar bear” at the top of the page, and click “Search.”  The new page indicates that there are 100,078 polar bear sequences in the Nucleotide Collection.  Now enter “giant panda” at the top and click “Search.”  There are 184,484 panda entries in the Nucleotide collection.  It’s apparent that the black bear is underrepresented in this database relative to its cousins the polar bear and the giant panda.  That’s why it has been “hibernating” way down the hit lists sorted by score, too far down the first (Part I) list to even notice.  Remember score increases with length of matching sequence.      

So where are we now?  My Table 3 in the first paper is based on the Excel file you just saved.  I went further to show that over four of these five sequence ranges, NO  other species matched S26 better than the black bear.  I redid the fifth sequence range search just today and found that a polar bear is a 100%ID match, due to new data entered in August, 2014.  I now suspect sequencing errors in the black bear data over this relatively short (79 bp) range, or it may have some extra mutations.  Human was always way down the list over each of these five ranges (See my first paper).  With the addition of the new polar bear data, some of my tables in my first paper can be updated.  They’ll look even better for a bear being the source of S26.

Fini! Case closed.  But just for fun, let’s search the “Reference Genomic Sequences” database specifically for primates (including Homo sapiens).

You’re a pro by now, but just to refresh, change your “Job Title; enter and select primates for “Organism;” 1000 for “Max target sequences;” and 64 for “Word size.”  Then click BLAST.  We’ll now really see for sure if a human-primate hybrid is possible as concluded by Melba.  This one will take a while, so get some refreshment.  Incidentally, if you can’t stand to wait for BLAST™ results, work at night or on weekends to greatly reduce search time (assuming no server, software, or database maintenance is going on).
Hope you like monkeys and apes, because you just got a whole barrel of them. Some near the top of the list are: Tarsius syrichta (Philippine tarsier), Pan trogodytes (chimpanzee), Pan paniscus (pygmy chimpanzee), Pongo abelii (orangutan), Nomascus leucogenys (white-cheeked gibbon). I think you can figure out Gorilla gorilla gorilla. Notice the %ID for the highest scores is only 93%, much less than for the two bears and the walrus in Part II (96.58 – 98.83%). The scores are not as high either, top was 3136 vs. 3788 for the polar bear in Part II. And where is Melba’s pet “lemur” (Otolemur garnettii)? 28-th place by score (2662) at 92.01%ID. Definitely not a player. Download and save the Excel file once again. Sort by Column L (score). Notice that the best hits are in the very same query sequence range (189,026 -191141) as were the best hits for the bears and the walrus. And so is Melba’s 28-th place lemur – actually Otolemur is not a lemur, but a galago, a bush-baby, the small-eared galago. Now let’s expand the Table from Part II. We’ve added three rows at the bottom: the best primate match (PT), the best human match (H), and the best “lemur”/galago match (SG).

Accession
Species
%ID
LENGTH
MIS.
GAPS
Q-Start
Q-stop
SCORE
Nucleotide Collection
XM_004394587.1
PW
96.58
2136
45
2
189026
191136
3515
Reference Genomic Sequences
NW_007929448.1
PB
98.83
2139
0
1
189028
191141
3788
NW_003218202.1
GP
97.15
2141
33
2
189026
191141
3591
NW_007256002.1
PT
93.32
2142
114
10
189026
191141
3136
NC_000011.10
H
93.28
2142
115
`11
189026
191141
3131
NW_003852486.1
SG
92.01
1903
140
7
189026
190926
2662
PW = Pacific walrus, PB = polar bear, GP = Giant panda, PT =Philippine tarsier, H = Human,
SG = small-eared galago


How can anybody look at this table and claim Sample 26 matches a primate, a human or a lemur/galago best? Only Melba could, because that’s what she wants it to do. The best match is the polar bear, a stand in for the black bear, the more likely origin. Even the walrus is a better match than any of Melba’s candidates – and by a genetic long shot. Primates of any kind are out of the question.

As a footnote to history, Scott Carpenter’s blog (on right) posts the leaked peer reviews for Melba’s submissions to JAMEZ and Nature. Referee A of JAMEZ said:

 "6. The bioinformatics should include gene sequences from expected outlier species that may also be capable of contributing contaminating nucleic acids. For example, a BLASTN search using Sample 26 does turn up some exceptionally strong homology with a gene from Ursus americanus (DQ240386.1). This would support the idea that the consensus sequence may have been affected by contaminant sequences.”

Dead on. We have done exactly what this reviewer called for. The accession number (DQ240386.1) is the same one you found at the top of your Excel file, produced in steps 1. and 2. above. It was a 100 %ID match to S26 over 291 bp. But what was Melba’s response?

“There will always be some homology with other species when short random sequences are chosen, however, your example of bear contamination can be completely ruled out considering none of the laboratories handling the samples have bear samples.” 


100  %ID is an exact match, not just “some homology” and not to be ruled out so easily for a 291 bp sequence. They kind of both missed the boat, didn’t they? The bear is the sample – not any contamination. It’s clear that neither Melba nor the referee did the kind of exhaustive searching that you have. I’ll be addressing Melba’s peer reviews and her responses in a subsequent blog. They are full of the kind of uninformed, or perhaps purposefully misleading, responses like the quotation above. No wonder she didn’t get published.

That should do it for now. Next time we’ll do some more comparisons with the three Excel result files you produced, so don’t lose them. The above table only addresses one Sample 26 sequence range: 189,026 – 191,141. What about other good matches over the remaining 2.5M bases? Do the species line up over these ranges as they do in Table 2? Stay tuned. You’re almost there.