Search This Blog

Monday, September 1, 2014

Searching NCBI Databases with BLAST™ - PART I


NOTE:  Don’t close any of the following windows until you finish the exercise.  You may want to come back to them. 

1.  To download the Sample 26 sequence from the Sasquatch Genome Project website, open:  http://sasquatchgenomeproject.org/view-dna-study/    Click on the “View DNA Study” tab and then click “Supplemental Data 4 Sample 26 consensus text.”  Then do:   File, Save as…,   Text File (*.txt).

2. Open this file and delete the first two rows (second one is blank).  Then: File, Save, in a directory where you can find it.   The new first row should begin with “>.”  You now have a FASTA file in the correct format to do a BLAST™ search.

3.  Open the webpage:  http://blast.ncbi.nlm.nih.gov/Blast.cgi and click nucleotide blast.

4.  This is your input screen.

            a.  To the right of “Or, upload file” click on “Browse” which opens the “Choose File to Upload” screen.  Find the file you saved in 2. above, highlight it, and click “Open.” The root to this sequence file should now appear in the box to the left of “Browse.”

            b.  Enter a “Job Title” which will help you identify the output later: “S26 vs. nucleotide.”  Copy this Job Title to your clipboard highlighting it, then right click and “Copy.”  It will become a useful heading and filename for a spreadsheet later.

            c.   Check to the right of Database to see whether “Nucleotide collection (nr/nt)” appears in the dropdown menu box.  If not, click the down arrow and select this database.

            d.   Go toward the bottom of the page and click “Algorithm parameters.”  To the right of “Max target sequences” click the down arrow and select 5000.  Three lines lower to the right of Word size click the down arrow and select 64.  Very important, otherwise the NCBI server cannot handle the sequence of 2.7 million bases.  You have just outsmarted Melba Ketchum and Scott Carpenter’s “expert”, Dr. David Swenson, who said, “My desktop had difficulty with a blast analysis of the consensus sequences. but then never the less went on to ARMCHAIR a pseudoreview of Melba’s paper, concluding “Sasquatch is real, as proven by genetic analysis.”  What?  How proven? You couldn’t even run a sequence search, David. 

    Leave all the other parameters in their default values.

            e.   Go to the very bottom of the page and check the box for show results in a new window.  Then just to the left, click on “BLAST.”

f.  Now, stand up, pat yourself on the back for being so smart (Melba didn’t think you could do it – you’re not an “expert”), and go get a favorite beverage from your refrigerator, while the NCBI server does the search.  This may take a while during normal work hours on a work day, when you share NCBI servers with many others.

5.        AHaaaah!  You now see the BLAST™ search results screen.  But Ohhhhh!  This is not what you expected, “A bewildering list of many different species”(H. V. Hart) headed by Odobenus rosmanus divergens (Pacific walrus), then Leptonychotes weddellii, the Weddell seal.  To check my common names, click on the corresponding “Accession” numbers to see the full NCBI database entry, including the common name in parentheses following the Latin.

6.  Notice that the list of hits is sorted by Max score.  Click on Ident heading to see the list sorted by %ID of the best match by score for each accession entry.  Now we see Ursus maritimus, the polar bear, tops the list in the 99-100 %ID range.  In neither sort was Homo sapiens (human) or any other primate found near the top of the list, so a human-primate hybrid is just not going to happen for this sample.  Are you getting excited yet?  You should!!  Now you’ve completely outsmarted Melba and all her silent coauthors plus her anonymous consultants plus her diehard supporters such as David Swenson, Scott Carpenter, Chris Noël, Adrian Erickson, David Paulides and others.  They believe (by association with and by supporting Melba and her conclusions) that a list of many species means that the sample is a “completely unknown” new species.  So, you’re not doing too badly for a few minutes work.  “But wait….there’s more.”. 

7.  Now we are going to do something that I doubt Ketchum et al. ever did.  Go to the top of the page and click on “Download” then under Alignment click on “Hit Table(csv).  At the bottom a new box opens.  Answer the question by clicking on “Open.”  You now have an Excel file of hits, a.k.a. alignments, a.k.a. matches of your S26 sequence to the database.  

8.  This file is 69,401 rows long, so we’ll want to sort it to look at the most relevant entries on the top. 

            a.  Highlight columns A-L.

            b.  At the top of the page, click “Data” then click “Sort.”  This opens up a “Sort” dialog box.  Uncheck “My data has headers,” to include the first row in the sorting.

            c.  Select Sort by: “Column L” (the score); Sort On: “Values”; Order: “Largest to Smallest.”  Then click “Add Level” – a new row of secondary sorts will appear. His time select Column C (%ID), Values, and “Largest to Smallest.” Then click “OK,”  and see the table sort by score, then %ID.

d.  You may want to add a new line to your table at the top and Right Click, then “Paste” in cell A1 that “Job Title” that you’ve been saving on your clipboard from the BLAST™ input screen.  If you lost it, just go back to 4.b. above and “Copy” it again, then “Paste.”

e.  Now you may want to provide the column headings in cells B1 – L1.  They are (use your own abbreviations), respectively, ACCESSION, %ID, LENGTH,  MISMATCHES, GAPS, Query Start, Query Stop, Database Start, Database Stop, E-Value, and SCORE.  We’ve already dealt with those in bold.  The others we’ll use later.   Then click File, Save As: Excel Workbook, give it a name (perhaps the one still on your clipboard), and “Save” in a directory where you can find it later.  This file now contains all the search results of the last BLAST™ page, plus some additional useful information.

We’ll now have to resolve a bit of a delimma:  Pacific walrus has the highest score, but polar bear has high %ID.  Which is closest to our Sample 26 species? Score has a mathematical definition.  Basically, it combines LENGTH, %ID , MISMATCHES, and GAPS in one number which is largest for the best match, successively smaller for poorer matches. 

As a footnote, bears diverged from seals, sea lions, and walruses about 30 MYBP (million years before present).  The giant panda, Ailuropoda melanoleuca,  diverged from all other bears (extant and extinct) about 22-24 MYBP and is the sole member of its genus (Ailuropoda). It’s sometimes called a living fossil, and only recently was genetically declared a bear (Ursidae).  Previously it was thought to be in the raccoon family, like the red panda.  See what neat stuff you are lead to by doing these BLAST™ searches?  From the NCBI home page, https://www.ncbi.nlm.nih.gov/  select “Taxonomy” on the left side to discover where any species fits into the complete Tree of Life.  This is helpful in interpreting BLAST™ results, especially when the best match is not the true query species, but is closely related. But how close is it, for example, the polar bear to the black bear, the only extant bear in California, the origin of S26.  And why hasn’t the black bear shown up?  We’ll find out in Part II.

So as not to make this blog too long, we’ll conclude PART I here.  It might be time to go back for another drink – you earned it!!

Are we having fun yet?