Search This Blog

Tuesday, September 2, 2014

Searching the NCBI Databases with BLAST™ - Part II


I hope that you are enjoying this exercise.  Your host has been the National Center for Bioinformation (NCBI), a division of the National Library of Medicine (NLM) of the National Institutes of Health (NIH).   Researchers from around the world submit their DOCUMENTED DNA samples from CONTROLLED EXPERIMENTS to the databases in this library, for all of us to search.  And just when you may have thought that your federal government never does anything FOR YOU.  I wonder why anybody would think that.

I have maintained all along that Ketchum et al. did not search other databases than the “Nucleotide Collection.”  If I’m wrong she should present some dated output proving me so; I'd really like to see what they found.  We searched the Nucleotide Collection in Part I.  I believe that is all Melba ever did.  In order to construct an all-inclusive table of BEST matches like my Table 1 in my first paper, one needs to do multiple searches.  Additionally, whereas we found the polar bear in our first search, no polar bear data was in the Nucleotide Collection when I did my initial work and certainly not when Melba did hers prior to mine.  Let’s do another search.  If you are still logged on from Part I, go directly to your BLAST™  input page, and skip to step 2. below.

1.  You can omit steps 1. and 2. of Part I, now that you have a FASTA file of the S26 sequence, saved where you can find it.  Proceed immediately through steps 3.to 4a. 

2.  For 4b. enter a “Job Title” as “S26 vs. reference genomic sequence.”

3.  Now on the “Database” dropdown menu select Reference genomic sequences (ref_seq genomic).  This is a database of COMPLETE nuclear genomes.  It has far fewer species than the Nucleotide Collection, but, importantly, it has more data on any given species.  This will prove to be critical to finding the best match for S26.  We didn’t search it first in Part I because the number of species it contains is VERY much less than the Nucleotide Collection.

4.   Now in the “Organism” field type Ursidae and when it comes up click on it.  The “taxid: 9632” locates all bears in the “Taxonomy” database mentioned in Part I.  This will limit our search to bears only, which will also greatly reduce search time.

5. Then complete steps 4.d through 4.f in Part I, except select 500 for “Max target sequences.”  We won’t need as many output sequences this time, because we are only searching for bears. ”Word size” 64 is still important.     

6.  The BLAST™ results screen will eventually open up before your eyes.  This time notice only two species of bear on the hit list, the polar bear (Ursus maritimus) and the giant panda (Ailuropoda melanoleuca).  The black bear (Ursus americanus) has not yet shown up.  Glance at the “Ident” column to see that these are VERY GOOD matches to our S26 sequence: in retrospect, genus level match for the polar bear and family level match for the panda.

7.  Now let’s take a little break from searches to investigate the taxonomy/phylogeny of bears. Do not delete your hit results page; we’ll come back to it in a moment.  Open the NCBI webpage: http://www.ncbi.nlm.nih.gov/ and click “Taxonomy” on the left side under “Databases”.  In the new page click Browser under “Taxonomy Tools.”  Now in the new input page go to “Search for” in the upper left and enter bears.

8.  A phylogenetic tree for bears is seen.  Use a little imagination:, branches to other species are off the page to the left; each line is a twig; indentations indicate levels of the branches and twigs. Under the family Ursidae, indented one level are the several genuses; and then under each genus  indented another level, are the species; finally, indented one more level are subspecies.  The key take away points are: 1) the giant panda is in a genus by itself (Ailuropoda), 2) the black bear and the polar bear are in the same genus, Ursus, and are therefore more closely related than either is to the panda.  Keep this in mind as we proceed to examine our new search results.

9.  Back to the BLAST™ results page.  Follow steps 7.a. to 7.e from Part I.  You may want to copy the column headings from your first Excel file in Part I.  Let’s look at this file.  Also, please open up the Excel file from the Nucleotide Collection search in Part I for comparison.  We have a new champion “best of show” match (highest %ID and score) at the top of the new Excel file.    Check by clicking the accession number in the BLAST™ output to see that it’s a polar bear.     

10.  Let’s focus on three lines of data:  The first line (Pacific walrus) in the Nucleotide Collection from the Excel results file (Part I) and the first two lines (a polar bear, and a panda) of the new Reference Genomic Sequence Excel results file produced above.  A summary of the important data follows, abbreviated from these two source files (four columns were omitted as presently irrelevant).  I added “Species” for your convenience.

 

Accession
Species
%ID
LENGTH
MIS.
GAPS
Q-Start
Q-stop
SCORE
Nucleotide Collection
XM_004394587.1
PW
96.58
2136
45
2
189026
191136
3515
Reference Genomic Sequences
NW_007929448.1
PB
98.83
2139
0
1
189028
191141
3788
NW_003218202.1
GP
97.15
2141
33
2
189026
191141
3591
PW = Pacific walrus, PB = polar bear, GP = Giant panda

 

Can you see that over this query sequence range (Q-start to Q-stop) the order of match is (best to worst):

1.  Polar Bear: best match, highest %ID, fewest mismatches, fewest gaps, and highest score. 

2.  Giant Panda: next best, intermediate.

3.  Pacific walrus: worst match, lowest %ID, most mismatching bases, tied for most gaps, lowest score?

Notice how close these %IDs and scores are, yet who would confuse a walrus, a polar bear and a panda by sight alone?  This is because of conservation of genes (those invisible little segments of your chromosomes in the nucleus of all your cells), namely that important genes are passed down through evolution from ancestor to progeny with minimal mutations.  This is why I invented the concept of moments to compare matches.  See my first paper.  It uses %ID - 95:  i.e. 1.58, 3.83, and 2.15, respectively, top to bottom above.  These numbers are much more different in relative magnitude than are 96.58, 98.98, and 97.15, respectively.  Important phylogeny is not as likely to be lost in rounding off or “eyeballing” numbers.

Next time in Part III we’ll look for the black bear (drilling down) and discover why it’s still hibernating from us as well as look harder for Homo sapiens, other primates, and all other animals (stepping back).   We want to get this right before telling Melba.  As it is, she’s going to be very upset with us.  She may even unfriend us in FB and call our work “nastiness.”

Save your BLAST™ results page as a Webpage (*.html) file.  Really enterprising students may wish to attempt the above searches with Samples 31 and 140, downloading their sequences from Melba’s Sasquatch Genome Project website (on the right).  But, not to worry, I’ll give you some hints later if you don’t feel up to this just yet.  

You’ve been a really good class.  (The clapping and cheering is deafening, even on the Internet.)

 

P.S.  Has anyone found a lemur yet?  Hope not.