An introduction to the searching the scientific literature
Finding the Nucleotide Sequence for a Gene
Determining the correct reading frame for an unknown nucleotide sequence
Using BLAST to identify a gene (cont from Exercise 2)
 Searching for Sequence motifs in a given protein
Finding homologs of a human gene in other organisms

Exercise 4: Using BLAST to identify a gene (cont. from Exercise 3)

I. INTRODUCTION
Once you have identified a likely reading frame for your DNA sequence, you will want to see if it corresponds to any known protein. Alternatively, if you obtained two reading frames of nearly equal length, you will need to decide which is correct. To accomplish these tasks, you can compare your sequences to all of the known protein sequences in the databases using a search tool known as BLAST. BLAST comes in a variety of formats depending on whether you are using a DNA sequence or a amino acid sequence and depending on whether you are searchng through nucleotide or protein databases. (BLAST Background)

We are going to do this exercise twice. First, we will take the longest open reading frame and use it as a query sequence with BLASTP. After saving those results, we will then take the next longest amino acid sequence and use it as our query sequence.

II. IDENTIFYING YOUR SEQUENCE
1. Copy and paste your longest translated sequence into the first box below. Copy and paste the next longest sequence into the second box below. (or call up the translated sequences previously submitted)

Longest Amino Acid Sequence

Second Longest Sequence (do not include the stop signals)

2. Open a new window with the BLAST search engines list by opening the link to BLAST

3. From this page, under Protein select the standard Protein-Protein BLAST (BLASTP)

4. Copy the sequence from the first box above and paste it into the Search box on the protein-protein blast page

5. Scroll down this page to the Format Section - in this section use the pull-down menus to change the Descriptions to 10 and the Alignments to 10 . Change the Layout to One Window. We will leave the Options section settings on the Default values and will address these choices in a more advanced exercise.

6. Click on the Blast button at the bottom or top of the screen. A new window will appear gives an estimate of how long the search will take and which lists conserved domains in your query sequence. You may want to copy your request id number, but usually this isn't necessary. After the indicated time has passed, press the Format button to see your results

If similarity to any known protein has been found, you will see a color window (which may or may not print) showing the degree of similarity and the range of similarity. Perfect matches show up as red, next best as purple, mediocre as green, poor matches as blue and very poor or no match as black. If you scroll down you will see the best 10 alignments (make sure you have limited this to 10!). If the DNA sequence has already been identified it should show up as a perfect match (score generally between 200-400, but could be lower depending on size of peptide analyzed. The E value will be down around 10(-50) to 10(-100)).The E value tells you the probability that an unrelated sequence in the database could have given the score value. (For more information on the meaning of E values)

7. Copy the line below the color alignment window which shows the sequence producing the best alignment. This will give you the identifiers (gi number and other identifying numbers) you will need to download the full protein from the database for characterization in Project 5. Save this information.

III. CONCLUSION
Congratulations! You now know what your sequence corresponds to. If you are working with an unusual organism you may find only partial similarity and you may need to exercise some caution in deciding if your gene really does correspond to a known protein. We can get a sense of what a significant match is by comparing this result to a result using a random sequence as our query sequence. In this case, we will use the next longest amino acid sequence we got out of the other reading frames as a "random" query.

IV. EXAMINING AN INCORRECT READING FRAME
What might a mistaken sequence show when compared against the database? We will look at this question by using our "incorrect" reading frame.

1. Copy the next longest translated sequence from the second above window.

2. Open a new window with the BLAST search engines

3. Again select the "Standard protein-protein BLAST (blastp) and paste your new sequence into the search window. Lower descriptions to 10 and alignments to 10.

4. Carry out a blastp search as you did above

5. Note your results! (Print them out)

You will probably get scores down in the 20-30 range (if you get high scores and perfect matches over an extended sequence, it is an interestin result Ð what do you think it might mean? Look at the alignments. Your sequence (the "Query" sequence is on top, the subject sequence is below and between them is a line that indicates exact matches (letter), conserved residues (i.e. a hydrophobic leucine in one sequence might be matched by a hydrophobic valine in the other sequence- this is called a conserved substitution). Generally you will find that matches are quite small (2-3 amino acids in a row) and only over a very limited region.

The National Library of Medicine provides a BLAST TUTORIAL and a General Guide to BLAST

These tutorials were developed by Dr. Ross S. Feldberg, Dept of Biology, Tufts University, Medford, MA 02155 with the assistance of a Teaching with Technology grant from the Academic Computing Department at Tufts. Thanks to Anoop Kumar, Abhra Verma and Scott Cordeiro for help in developing this resource. Suggestions, corrections and comments should be sent to Ross.Feldberg@Tufts.edu. (last modified Aug 2005)