Search This Blog

Saturday, September 27, 2014

Q30 Scores: Arbitrary Relative Numbers - Not A Measure of Contamination

Melba Ketchum makes much of Q30 scores in claiming her samples were uncontaminated, and therefore her highly variable results are real. But "Real what?" is the question.  Ladies and children may be reading this, so I won't answer directly.  Let's start with the basics and demystify this over-touted "statistic." It's not even a real probability, I will show.

You will not learn too much from an advertising "Tech Note" from Illumina(R) like "Quality scores from Next-Generation Sequencing" (http://res.illumina.com/documents/products/technotes/technote_q-scores.pdf ); but read this first; it's short, has some basic graphs, and has one important equation:

Q = -10 log P

That's log base 10.  (Those who paid attention in my general chemistry classes will notice that it has a form like the definition of pH, except for the leading "10.").  Q is the quality "score" for a single base in a sequence - NOT a "probability." P is also  NOT a true "probability" in the statistical sense.  What is P then?  The Devil's in the details.

Admittedly I was shocked to learn that statistical probability of sequencing error is NOT involved in P or Q.  From Ewing and Green (http://genome.cshlp.org/content/8/3/186.full.pdf+html ):


"The following four parameters were found to be particularly effective at discriminating errors from correct base-calls. In each case, smaller parameter values correspond to higher quality (more accurate sequence).

[NOTE:  Probabilities work oppositely.  The HIGHER the number, the more likely the event, AND the MAXIMUM value is one. HVH]

1. Peak spacing. The ratio of the largest peak-to-peak spacing, in a window of seven peaks centered on the current one, to the smallest peak-to -peak spacing. The minimum possible value of one corresponds to evenly spaced peaks.

2. Uncalled/called ratio. The ratio of the amplitude of the largest uncalled peak, in a window of seven peaks around the current one, to the smallest called peak; if there is no uncalled peak, the largest of the three uncalled trace array values at the location of the called base peak is used instead. 
[An uncalled peak is a peak in the signal
that was not assigned to a predicted location by phred (Ewing et al. 1998) and thus does not result in a base call.] If the called base is an N, Phred assigns a large value of 100.0. Note that this is not what is sometimes called the signal to noise ratio, as uncalled peaks may be true peaks missed by the base-calling program rather than noise in the conventional sense. The minimum parameter value is 0 for traces with no uncalled peaks.

3. Same as 2, but using a window of three peaks.

4. Peak resolution. The number of bases between the current base and the nearest unresolved base, times -1 (to force the parameter to have the right direction). (A base is unresolved if it is called as N or if for at least one of its neighboring bases, there is no point between the two corresponding peaks at which the signal is less than the signal at each peak). The minimum possible parameter value is half the number of bases in the trace, times -1, and the maximum value is 0."

[NOTE: Phred is a computer program.]

Don't be alarmed if this sounds like gibberish.  My point here is that these four parameters ARE ARBITRARY, not fundamental, and the calculations which follow are CURVE-FITTING.  The final result is a Q for each base.  Q30 is the percentage of individual base Q's which exceed 30 by the above equation.  It is not a probability, but a relative number.

What is the effect of a sample containing DNA from multiple species?  Is the Q30 for the major genome affected by the presence of the other genomes?  NOT NECESSARILY.  The contaminant species must be in significant concentrations and must react with the primer, e.g. a universal or non-specific primer.  The electropherogram 
peak shapes may then be affected, which could affect the four parameters above and could lower Qs. Statements that impurities lower Q30 to 40-50% are over generalizations.  Melba's Q30 of 85% isn't all that great.  Pure, single species Q30 values are usually >95%.  (See first reference above). And remember, this is a logarithmic scale.


Another important fact is that mtDNA is present in the cell at about 1000 X the concentration of nDNA.  This means that mtDNA sequencing requires only about 1/1000 as much sample as nDNA.   Therefore, a contaminant may show up in a mtDNA test and not show up in the nDNA sequencing.  This is the most likely explanation for some of the mixed results in the Ketchum study.

So what's the bottom line?  Melba's whining about Q30 scores is largely irrelevant to her conclusions.  She must have mentioned the number dozens of times in her appeals to the editors of Nature.  They weren't impressed.  Neither am I, and neither should you be.

Next, I'll examine the Nature peer reviews and Melba's responses.  Incredibly, the reviewers totally missed the boat in some cases, but their "arm-chair"* assessments were generally on target.

*Definition (mine): "arm-chair", adj.  An opinion which is not based on any in depth investigation, rather based on overall impressions and/or generalizations from past experience.   





 


Just How Good was Sykes' Mitochondrial 12S rRNA Gene at Distinguishing Species?

In his recent paper (B. C. Sykes, R. A. Mullis, C. Hagenmuller, T. W. Melton, and M. Sartori. Genetic analysis of hair samples attributed to yeti, bigfoot and other anomalous primates. Proc. R. Soc. B. 2014: 281, (1789), p. 20140161.  Available free at http://rspb.royalsocietypublishedncounting.org and at right) Prof. Bryan Sykes et al. sequenced and identified 30 hair samples from around the world, all purportedly from sasquatch, yeti, almasti, orang pendek, etc.  None could not be attributed to a known extant mammal or closely related mammals, and only one was human.  The technique, known as mitochondrial 12S rRNA (T. Melton & C. Holland.  J Forensic Sci, 2007: Vol. 52, No. 6, pp. 1305-07. Available free at http://www.mitotyping.com/page/37), uses a sequence of only 104 mtDNA bp to compare to a database of known animals, the NCBI GenBank using BLAST(TM) search software.  How can such a small sequence distinguish among species such as three different bears(American black, polar, brown), horse, dog, cow, deer, serow, sheep, raccoon, tapir, and porcupine?  How certain is the identification?

To answer these questions we took each of Sykes' 30 sequences and searched the nucleotide database at NCBI with BLAST(TM).  We found four relatively minor ambiguities:

(1) The wolf, dog, and coyote (all genus Canis). These take nuclear DNA sequences to distinguish.

(2) The domestic sheep and the Himilayan tahr (both sheep family).


(3) Some brown bears have the same sequence as most polar bears, and Ursus thebetanus japonica, i.e. the Japanese black bear, matches the majority of brown bears. These three species are all genus
Ursus.

(4) The white-tailed and mule deers (both genus Odocoileus).  These deer species are known to hybridize.


Cases (1), (3) and (4) are genus specific, Canus, Ursus, and Odocoileus respectively.  Case (2) is only family specific (sheep). All other animals mentioned above gave a 100% ID match of all 104 bases to only one species, with related species giving consistently poorer matches. The lone human sample did not match any other primates nearly as well.


The case of bears is an interesting one.  Recent work (C. Lindqvist et al.  Proc. Natl. Acad. Sci. 2010: 107, (11), pp. 5053–5057.  Available free at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841953/) shows that they underwent relatively recent (Pleistocene) divergence, and may, in fact, still be responding to drastic climate changes, especially the brown and polar bears.  These two species are known to hybridize, a fact recognized by Sykes et al.

In their paper Sykes et al. recognize the Cases (1) and (4) ambiguities but not the Cases (2) and (3). Two samples were, however, described as intermediate between brown and polar bears. I asked Prof. Sykes about (3) but got no response. New polar and brown bear data have recently been added to GenBank, making a more detailed analysis of Case (3) possible for some Ursidae expert (not me). 
  
The conclusion (shared by Sykes et al.) is that some species have unique mitochondrial 12S rRNA sequences, and others do not.  Those that do not can be identified at the genus level (as recognized by Sykes et al.) and occasionally only the family level (e.g. the sheep above).

This would have been a useful technique for Melba Ketchum to have used on her bear (S26) and dog (S140) samples.  No way would they show human or primate mitochondrial 12S rRNA sequences.  Come on, Melba, try it.

Thursday, September 25, 2014

The Scott Carpenter File II: More on the Sykes paper

Scott Carpenter concludes his recent commentary (see at right) on the recent Sykes paper (see at right) by saying:

"This paper would have never stood up to the peer review process that the Ketchum DNA Study was subjected too (sic) by the Journal Nature. In short this paper in my opinion was no more than an attempt to discredit the American Bigfoot Community, muddy the water, and discredit the Kethcum DNA Study. I say MAJOR FAIL Dr. Sykes….."

REALLY, Scott?

Just for background on The Royal Society (see : https://royalsociety.org/about-us/), which published the Sykes paper in its Proceedings:

 
The Royal Society is a Fellowship of the world's most eminent scientists and is the oldest scientific academy in continuous existence.”

“The origins of the Royal Society lie in an 'invisible college' of natural philosophers who began meeting in the mid-1640s to discuss the new philosophy of promoting knowledge of the natural world through observation and experiment, which we now call science.

“Its official foundation date is 28 November 1660, when a group of 12 met at Gresham College after a lecture by Christopher Wren, then the Gresham Professor of Astronomy, and decided to found 'a Colledge for the Promoting of Physico-Mathematicall Experimentall Learning'. This group included Wren himself, Robert Boyle, John Wilkins, Sir Robert Moray, and William, Viscount Brouncker. “  (And, Scott, the spelling above is period 1660's, so don't get to blogging.) 

"There are approximately 1,600 Fellows and Foreign Members, including around 80 Nobel Laureates. "

"We do all we can to ensure the peer-review process is fair and we aim to minimize bias.
  • All papers submitted to Royal Society research journals are peer-reviewed in a single-blind fashion (Author names are not concealed, but Referee names are)."
Scott, the two journals, Nature and Proceedings of the Royal Society, have comparable peer review processes, and the Royal Society are no fools.  I wish I could say the same for you.

Furthermore, Prof. Sykes acknowledged in his paper the assistance of  "American Bigfoot Community" individuals such as Justin Smeja, Bart Cutino, Derek Randles, Jeff Meldrum, Maxwell David, Loren Coleman, among others.  Your assertion that Sykes attempted to "discredit" them is outrageously stupid.  They were part of his project and provided samples to him.  Unlike you, Scott, these honest individuals, though they may have participated in the Ketchum study, are more interested in exploring all avenues to finding the truth than covering for Melba.  I commend them for their open mindedness.  You should follow their example.

Since you gave a grade to Prof. Sykes, I give you one, Scott:  "Expelled permanently for perpetrating falsehoods."  Please collect your belongings, proceed to Security, turn in your badge, and leave the (bigfoot) premises. 

In a coming blog I will analyze the peer reviewers' comments on the Ketchum paper as submitted to Nature in the light of new discoveries as well as Scott Carpenter's assessment of them.  Thanks, Scott, for providing me with so much material - your misinformation and uninformed, biased viewpoints.   I'll never be wanting.  I wish I could write faster. 





Tuesday, September 23, 2014

A Good Forum to Check Out

I have been a regular visitor to the Bigfoot Forums site: http:// bigfootforums.com (also see link on right).  This is a good site for all bigfoot topics.  My favorite is "The Ketchum Report (Part 3)" under "General Bigfoot Discussion."  Parts 1 and 2 are good too but outdated by now.  It's a closed group, so you must sign up to post or comment.  The rules are good and reasonable. 

Wednesday, September 3, 2014

Otolemur garnettii is No Lemur and We're Not a Fish, a Chicken, or a Mouse

"The phylogeny trees clearly indicate relationships with primate sequences, including lemurs, chimpanzee, macaques, gibbons and marmosets and close relationship with humans."  (Ketchup DNA Study, Link on Right)
 
You'll recall that Dr. Melba Ketchum talks a lot about a lemur ancestral line leading to sasquatch.  She got this idea from a phylotree that was generated from her sequence data for a sample (she doesn't say which sample), shown as Supplementary Figure 4 in her paper (Link on right).  From Samples 26 and 140 she also generated her own phylotree for the primates as her Figure 16. (Link on right)  The anthropologists (and quite a number of us laymen) were appalled.  Figure 16 is not a tree - it has no trunk.  It's a star.  But forgiving for the moment the graphics, its very far from the established primate phylogeny, but I'll let the experts go to town on that.  My question is: Where's the lemur which Melba still refers to as recently as last Sunday on the "Coast to Coast" radio show with George Knapp? 

Answer: THERE IS NO LEMUR.  The closest animal on her Supplementary Figure  4 and Figure 16 is Otolemur garnettii, but it's not a lemur, rather a galago or bush baby.  See the NCBI taxonomy page on the right,  How could this happen?  I suspect that she didn't check this one.  The other primates were described by their common names in the text of her paper, but this one wasn't - it is simply missing. 

As if this isn't enough of a taxonomic scandal, it gets worse on Supplementary Figures 5 and 6. (Link on right).  Figure 5 shows a phylotree with only a chicken (Gallus gallus), the mouse (Mus musculus), and 29 species of fish.(See my table below).  Are these the closest relatives of sasquatch, a purported human-primate hybrid?  I think not.

Supplementary Figure 6 shows a phylotree with only mouse relatives.  Preposterous!  Again, we don't know which of the samples 26, 31, or 140 this represents, but does it really matter?

I call to account those coauthors or consultants who produced these phylotrees.  Did anybody even look up the common names of these species?
Melba always has a simple, usually totally uninformed, explanation for these things.  She says the anomalous results are because her sasquatch samples are from an unknown species which confuses the NCBI software.  Come on, those folks are way smarter than that.  A primate-human hybrid will not show as closest relatives a chicken, a mouse, or any fish.  It should be something like Supplementary Figure 4, which I guess is probably from the human Sample 31.  Remember, "They're people just like us," says Melba.

These are "rookie mistakes," and should be recognized by all as such.  Melba shouldn't be surprised by my comments; I sent them to her early last year. Oh, but hey, I'm not qualified to review her paper.  She said so on FB.

Meet your new relatives (pictures courtesy of Wikipedia).  Now you'll already know their names at your next family reunion.


Cyprinus carpio
Gallus gallus
Otolemur garnettii

Mus musculus
 



 
 
 
 
 
 
 
 
 
 
 
 
Fish in Ketchum Supplementary Figure 5 .
Cyprinus carpio – common carp
Fenerbahce devosi – dwarf killifish
Nothobranchius furzeri – turquoise killifish
Aphyosemion pascheni – an African lyretail
Epiplatys sexfasciatus – a killifish
Epiplatys bifasciatus – a killifish
Siniperca chuatsi – Chinese perch
Jordanella floridae – American flag fish
Nimbapanchax viridis – an African rivuline
Nimbapanchax jeanpoli – Jeanpol’s killifish
Nimbapanchax leucopterygius – an African rivuline
Nimbapanchax melanopterygius -  an African rivuline
Misgurnus fossilis – European weather loach
Latimeria chalumnae – West Indian Ocean coelacanth
Anguilla anguilla – European eel
Anguilla rostrata – American eel
Lepidosiren paradoxa – South American lungfish
Dalatias licha – kitefin shark
Squatina californica – Pacific Angelshark
Centroscymnus owstonii – roughskin dogfish
Squalus acanthias – spiny dogfish
Deania sp – deepwater dogfish shark
Alopias pelagicus – pelagic thresher shark
Carcharias taurus – sand tiger shark
Mitsukurina owstoni – goblin shark
Triakis semifasciata – leopard shark
Apristurus profundorum – deepwater catshark
Hydrolagus colliei – spotted ratfish
Rhinobatos productus – shovelnose guitarfish
       

Searching the NCBI Databases with BLAST™ - Part III


Nobody really believes that a polar bear or a panda roams the wild in California where Ketchum’s Sample 26 was found.  Let’s see if a black bear (Ursus americanus) matches the sequence at all.  Then if it does, we’ll investigate why it hasn’t shown up in previous searches. 

1.  You can skip steps 1. and 2. of Part I.  Complete steps 3. and 4., except with a “Job Title” of Black bear vs. nucleotide.  To the right of “Organism” type black bear and select American black bear (txid: 9643).   Also, this time, select 100 for “Max target sequences” and 28 for “Word size.”

2.  This time the hit list is much shorter, only five matches, but four of them are perfect, 100%ID.  Download the Excel results file and save it as before (step. 7., Part I).

So what do you think?  Is the black bear the real origin of Sample 26?  Let’s look at the composition of these databases.  Maybe, just maybe, there’s not very much black bear data in the databases.  We already know there’s not a complete genome from Part II.

3.  Open up http://www.ncbi.nlm.nih.gov/  and the click on “Nucleotide” to open up the list of entries in the Nucleotide collection.  At the top of the new page, see that there are 39,786 entries.  Seems like a lot, but let’s compare it to the polar bear and the panda.  Replace “American black bear with “polar bear” at the top of the page, and click “Search.”  The new page indicates that there are 100,078 polar bear sequences in the Nucleotide Collection.  Now enter “giant panda” at the top and click “Search.”  There are 184,484 panda entries in the Nucleotide collection.  It’s apparent that the black bear is underrepresented in this database relative to its cousins the polar bear and the giant panda.  That’s why it has been “hibernating” way down the hit lists sorted by score, too far down the first (Part I) list to even notice.  Remember score increases with length of matching sequence.      

So where are we now?  My Table 3 in the first paper is based on the Excel file you just saved.  I went further to show that over four of these five sequence ranges, NO  other species matched S26 better than the black bear.  I redid the fifth sequence range search just today and found that a polar bear is a 100%ID match, due to new data entered in August, 2014.  I now suspect sequencing errors in the black bear data over this relatively short (79 bp) range, or it may have some extra mutations.  Human was always way down the list over each of these five ranges (See my first paper).  With the addition of the new polar bear data, some of my tables in my first paper can be updated.  They’ll look even better for a bear being the source of S26.

Fini! Case closed.  But just for fun, let’s search the “Reference Genomic Sequences” database specifically for primates (including Homo sapiens).

You’re a pro by now, but just to refresh, change your “Job Title; enter and select primates for “Organism;” 1000 for “Max target sequences;” and 64 for “Word size.”  Then click BLAST.  We’ll now really see for sure if a human-primate hybrid is possible as concluded by Melba.  This one will take a while, so get some refreshment.  Incidentally, if you can’t stand to wait for BLAST™ results, work at night or on weekends to greatly reduce search time (assuming no server, software, or database maintenance is going on).
Hope you like monkeys and apes, because you just got a whole barrel of them. Some near the top of the list are: Tarsius syrichta (Philippine tarsier), Pan trogodytes (chimpanzee), Pan paniscus (pygmy chimpanzee), Pongo abelii (orangutan), Nomascus leucogenys (white-cheeked gibbon). I think you can figure out Gorilla gorilla gorilla. Notice the %ID for the highest scores is only 93%, much less than for the two bears and the walrus in Part II (96.58 – 98.83%). The scores are not as high either, top was 3136 vs. 3788 for the polar bear in Part II. And where is Melba’s pet “lemur” (Otolemur garnettii)? 28-th place by score (2662) at 92.01%ID. Definitely not a player. Download and save the Excel file once again. Sort by Column L (score). Notice that the best hits are in the very same query sequence range (189,026 -191141) as were the best hits for the bears and the walrus. And so is Melba’s 28-th place lemur – actually Otolemur is not a lemur, but a galago, a bush-baby, the small-eared galago. Now let’s expand the Table from Part II. We’ve added three rows at the bottom: the best primate match (PT), the best human match (H), and the best “lemur”/galago match (SG).

Accession
Species
%ID
LENGTH
MIS.
GAPS
Q-Start
Q-stop
SCORE
Nucleotide Collection
XM_004394587.1
PW
96.58
2136
45
2
189026
191136
3515
Reference Genomic Sequences
NW_007929448.1
PB
98.83
2139
0
1
189028
191141
3788
NW_003218202.1
GP
97.15
2141
33
2
189026
191141
3591
NW_007256002.1
PT
93.32
2142
114
10
189026
191141
3136
NC_000011.10
H
93.28
2142
115
`11
189026
191141
3131
NW_003852486.1
SG
92.01
1903
140
7
189026
190926
2662
PW = Pacific walrus, PB = polar bear, GP = Giant panda, PT =Philippine tarsier, H = Human,
SG = small-eared galago


How can anybody look at this table and claim Sample 26 matches a primate, a human or a lemur/galago best? Only Melba could, because that’s what she wants it to do. The best match is the polar bear, a stand in for the black bear, the more likely origin. Even the walrus is a better match than any of Melba’s candidates – and by a genetic long shot. Primates of any kind are out of the question.

As a footnote to history, Scott Carpenter’s blog (on right) posts the leaked peer reviews for Melba’s submissions to JAMEZ and Nature. Referee A of JAMEZ said:

 "6. The bioinformatics should include gene sequences from expected outlier species that may also be capable of contributing contaminating nucleic acids. For example, a BLASTN search using Sample 26 does turn up some exceptionally strong homology with a gene from Ursus americanus (DQ240386.1). This would support the idea that the consensus sequence may have been affected by contaminant sequences.”

Dead on. We have done exactly what this reviewer called for. The accession number (DQ240386.1) is the same one you found at the top of your Excel file, produced in steps 1. and 2. above. It was a 100 %ID match to S26 over 291 bp. But what was Melba’s response?

“There will always be some homology with other species when short random sequences are chosen, however, your example of bear contamination can be completely ruled out considering none of the laboratories handling the samples have bear samples.” 


100  %ID is an exact match, not just “some homology” and not to be ruled out so easily for a 291 bp sequence. They kind of both missed the boat, didn’t they? The bear is the sample – not any contamination. It’s clear that neither Melba nor the referee did the kind of exhaustive searching that you have. I’ll be addressing Melba’s peer reviews and her responses in a subsequent blog. They are full of the kind of uninformed, or perhaps purposefully misleading, responses like the quotation above. No wonder she didn’t get published.

That should do it for now. Next time we’ll do some more comparisons with the three Excel result files you produced, so don’t lose them. The above table only addresses one Sample 26 sequence range: 189,026 – 191,141. What about other good matches over the remaining 2.5M bases? Do the species line up over these ranges as they do in Table 2? Stay tuned. You’re almost there.

Tuesday, September 2, 2014

Searching the NCBI Databases with BLAST™ - Part II


I hope that you are enjoying this exercise.  Your host has been the National Center for Bioinformation (NCBI), a division of the National Library of Medicine (NLM) of the National Institutes of Health (NIH).   Researchers from around the world submit their DOCUMENTED DNA samples from CONTROLLED EXPERIMENTS to the databases in this library, for all of us to search.  And just when you may have thought that your federal government never does anything FOR YOU.  I wonder why anybody would think that.

I have maintained all along that Ketchum et al. did not search other databases than the “Nucleotide Collection.”  If I’m wrong she should present some dated output proving me so; I'd really like to see what they found.  We searched the Nucleotide Collection in Part I.  I believe that is all Melba ever did.  In order to construct an all-inclusive table of BEST matches like my Table 1 in my first paper, one needs to do multiple searches.  Additionally, whereas we found the polar bear in our first search, no polar bear data was in the Nucleotide Collection when I did my initial work and certainly not when Melba did hers prior to mine.  Let’s do another search.  If you are still logged on from Part I, go directly to your BLAST™  input page, and skip to step 2. below.

1.  You can omit steps 1. and 2. of Part I, now that you have a FASTA file of the S26 sequence, saved where you can find it.  Proceed immediately through steps 3.to 4a. 

2.  For 4b. enter a “Job Title” as “S26 vs. reference genomic sequence.”

3.  Now on the “Database” dropdown menu select Reference genomic sequences (ref_seq genomic).  This is a database of COMPLETE nuclear genomes.  It has far fewer species than the Nucleotide Collection, but, importantly, it has more data on any given species.  This will prove to be critical to finding the best match for S26.  We didn’t search it first in Part I because the number of species it contains is VERY much less than the Nucleotide Collection.

4.   Now in the “Organism” field type Ursidae and when it comes up click on it.  The “taxid: 9632” locates all bears in the “Taxonomy” database mentioned in Part I.  This will limit our search to bears only, which will also greatly reduce search time.

5. Then complete steps 4.d through 4.f in Part I, except select 500 for “Max target sequences.”  We won’t need as many output sequences this time, because we are only searching for bears. ”Word size” 64 is still important.     

6.  The BLAST™ results screen will eventually open up before your eyes.  This time notice only two species of bear on the hit list, the polar bear (Ursus maritimus) and the giant panda (Ailuropoda melanoleuca).  The black bear (Ursus americanus) has not yet shown up.  Glance at the “Ident” column to see that these are VERY GOOD matches to our S26 sequence: in retrospect, genus level match for the polar bear and family level match for the panda.

7.  Now let’s take a little break from searches to investigate the taxonomy/phylogeny of bears. Do not delete your hit results page; we’ll come back to it in a moment.  Open the NCBI webpage: http://www.ncbi.nlm.nih.gov/ and click “Taxonomy” on the left side under “Databases”.  In the new page click Browser under “Taxonomy Tools.”  Now in the new input page go to “Search for” in the upper left and enter bears.

8.  A phylogenetic tree for bears is seen.  Use a little imagination:, branches to other species are off the page to the left; each line is a twig; indentations indicate levels of the branches and twigs. Under the family Ursidae, indented one level are the several genuses; and then under each genus  indented another level, are the species; finally, indented one more level are subspecies.  The key take away points are: 1) the giant panda is in a genus by itself (Ailuropoda), 2) the black bear and the polar bear are in the same genus, Ursus, and are therefore more closely related than either is to the panda.  Keep this in mind as we proceed to examine our new search results.

9.  Back to the BLAST™ results page.  Follow steps 7.a. to 7.e from Part I.  You may want to copy the column headings from your first Excel file in Part I.  Let’s look at this file.  Also, please open up the Excel file from the Nucleotide Collection search in Part I for comparison.  We have a new champion “best of show” match (highest %ID and score) at the top of the new Excel file.    Check by clicking the accession number in the BLAST™ output to see that it’s a polar bear.     

10.  Let’s focus on three lines of data:  The first line (Pacific walrus) in the Nucleotide Collection from the Excel results file (Part I) and the first two lines (a polar bear, and a panda) of the new Reference Genomic Sequence Excel results file produced above.  A summary of the important data follows, abbreviated from these two source files (four columns were omitted as presently irrelevant).  I added “Species” for your convenience.

 

Accession
Species
%ID
LENGTH
MIS.
GAPS
Q-Start
Q-stop
SCORE
Nucleotide Collection
XM_004394587.1
PW
96.58
2136
45
2
189026
191136
3515
Reference Genomic Sequences
NW_007929448.1
PB
98.83
2139
0
1
189028
191141
3788
NW_003218202.1
GP
97.15
2141
33
2
189026
191141
3591
PW = Pacific walrus, PB = polar bear, GP = Giant panda

 

Can you see that over this query sequence range (Q-start to Q-stop) the order of match is (best to worst):

1.  Polar Bear: best match, highest %ID, fewest mismatches, fewest gaps, and highest score. 

2.  Giant Panda: next best, intermediate.

3.  Pacific walrus: worst match, lowest %ID, most mismatching bases, tied for most gaps, lowest score?

Notice how close these %IDs and scores are, yet who would confuse a walrus, a polar bear and a panda by sight alone?  This is because of conservation of genes (those invisible little segments of your chromosomes in the nucleus of all your cells), namely that important genes are passed down through evolution from ancestor to progeny with minimal mutations.  This is why I invented the concept of moments to compare matches.  See my first paper.  It uses %ID - 95:  i.e. 1.58, 3.83, and 2.15, respectively, top to bottom above.  These numbers are much more different in relative magnitude than are 96.58, 98.98, and 97.15, respectively.  Important phylogeny is not as likely to be lost in rounding off or “eyeballing” numbers.

Next time in Part III we’ll look for the black bear (drilling down) and discover why it’s still hibernating from us as well as look harder for Homo sapiens, other primates, and all other animals (stepping back).   We want to get this right before telling Melba.  As it is, she’s going to be very upset with us.  She may even unfriend us in FB and call our work “nastiness.”

Save your BLAST™ results page as a Webpage (*.html) file.  Really enterprising students may wish to attempt the above searches with Samples 31 and 140, downloading their sequences from Melba’s Sasquatch Genome Project website (on the right).  But, not to worry, I’ll give you some hints later if you don’t feel up to this just yet.  

You’ve been a really good class.  (The clapping and cheering is deafening, even on the Internet.)

 

P.S.  Has anyone found a lemur yet?  Hope not.