|
Exercise 4: Using BLAST to identify a gene (cont. from Exercise 3)
I. INTRODUCTION
Once you have identified a likely reading frame for your DNA sequence, you will
want to see if it corresponds to any known protein. Alternatively, if you obtained
two reading frames of nearly equal length, you will need to decide which is
correct. To accomplish these tasks, you can compare your sequences to all of
the known protein sequences in the databases using a search tool known as BLAST.
BLAST comes in a variety of formats depending on whether you are using a DNA
sequence or a amino acid sequence and depending on whether you are searchng
through nucleotide or protein databases. (BLAST
Background)
We are going to do this exercise twice. First, we will take the longest open
reading frame and use it as a query sequence with BLASTP. After saving those
results, we will then take the next longest amino acid sequence and use it as
our query sequence.
II. IDENTIFYING YOUR SEQUENCE
1. Copy and paste your longest translated sequence into the first box below.
Copy and paste the next longest sequence into the second box below. (or call
up the translated sequences previously submitted)
Longest Amino Acid Sequence
Second Longest Sequence (do not include the stop signals)
2. Open a new window with the BLAST search engines list by opening the link
to BLAST
3. From this page, under Protein select the standard Protein-Protein BLAST (BLASTP)
4. Copy the sequence from the first box above and paste it into the Search box
on the protein-protein blast page
5. Scroll down this page to the Format
Section - in this section use the pull-down menus to change
the Descriptions to 10 and the Alignments to 10
. Change the
Layout to One Window. We will leave the Options section settings on the Default
values and will address these choices in a more advanced exercise.
6. Click on the Blast button at the bottom or top of the screen. A new window will appear
gives an estimate of how long the search will take and which lists conserved domains in your query sequence. You may want to copy your request id number,
but usually this isn't necessary. After the indicated time has passed, press
the Format button to see your results
If similarity to any known protein has been found, you will see a color window (which may or may not print) showing the degree
of similarity and the range of similarity. Perfect matches show up as red, next
best as purple, mediocre as green, poor matches as blue and very poor or no match
as black. If you scroll down you will see the best 10 alignments (make sure
you have limited this to 10!). If the DNA sequence has already been identified
it should show up as a perfect match (score generally between 200-400, but could
be lower depending on size of peptide analyzed. The E value will be down around
10(-50) to 10(-100)).The E value
tells you the probability that an unrelated sequence in the database could have
given the score value. (For more information
on the meaning of E values)
7. Copy the line below the color alignment window which shows the sequence
producing the best alignment. This will give you the identifiers (gi number
and other identifying numbers) you will need to download the full protein from
the database for characterization in Project 5. Save this information.
III. CONCLUSION
Congratulations! You now know what your sequence corresponds to. If you are
working with an unusual organism you may find only partial similarity and you
may need to exercise some caution in deciding if your gene really does correspond
to a known protein. We can get a sense of what a significant match is by comparing
this result to a result using a random sequence as our query sequence. In this
case, we will use the next longest amino acid sequence we got out of the other
reading frames as a "random" query.
IV. EXAMINING AN INCORRECT READING FRAME
What might a mistaken sequence show when compared against the database? We will
look at this question by using our "incorrect" reading frame.
1. Copy the next longest translated sequence from the second above window.
2. Open a new window with the
BLAST search engines
3. Again select the "Standard protein-protein BLAST (blastp) and paste your
new sequence into the search window. Lower descriptions to 10 and alignments
to 10.
4. Carry out a blastp search as you did above
5. Note your results! (Print them out)
You will probably get scores down in the 20-30 range (if you get high scores
and perfect matches over an extended sequence, it is an interestin result Ð
what do you think it might mean? Look at the alignments. Your sequence (the
"Query" sequence is on top, the subject sequence is below and between them is
a line that indicates exact matches (letter), conserved residues (i.e. a hydrophobic
leucine in one sequence might be matched by a hydrophobic valine in the other
sequence- this is called a conserved substitution). Generally you will find
that matches are quite small (2-3 amino acids in a row) and only over a very
limited region.
The National Library of Medicine provides a BLAST
TUTORIAL and a
General Guide to BLAST
These tutorials were developed by Dr. Ross S. Feldberg, Dept of Biology, Tufts University, Medford, MA 02155 with the assistance of a Teaching with Technology grant from the Academic Computing Department at Tufts. Thanks to Anoop Kumar, Abhra Verma and Scott Cordeiro for help in developing this resource. Suggestions, corrections and comments should be sent to Ross.Feldberg@Tufts.edu. (last modified Aug 2005)
|