One of the most basic exercises in bioinformatics is determining if a nucleic
acid sequence actually codes for a protein. This is complicated by the fact
that we generally do not know which strand is the coding strand (i.e. whether
the sequence itself or its complementary strand will be transcribed into mRNA)
nor the correct reading frame (whether the sequence should be read three bases
at a time starting with the first nucleotide, the second or the third. We resolve
both these questions by translating both strands in all three reading frames
and looking for the one that gives the longest amino acid sequence before a
stop codon is encountered. Since there are 64 codons and three of these code
for no amino acid i.e. are stop signals- we expect a stop codon to appear
on average once every 20 amino acids if we are reading a sequence in the incorrect
frame. However, things are not always that clear cut and it is possible for
an out of frame translation to extend to over 100 amino acids before a stop
codon is reached.
In the exercise below you will be given an unknown DNA sequence and asked to
use a web tool to translate the sequence into an amino acid sequence and hopefully identify the proper reading frame. You will then save this
amino acid sequence to a word processing program (or e-mail it to yourself)
if you want to use it in the next exercise.
Obtaining your sequence
In the lab, this might be obtained by sequencing a clone from a cDNA library
or by isolating an amplified DNA fragment from a PCR amplification. Often,
when we sequence such a product we find we have an unexpected fragment of
DNA which we need to analyze. Here we will provide a partial sequence at random
from our database of sequences. A partial nucleotide sequence will appear
in the window below after you click on the Get Gene Sequence button.
Translating the Sequence
Several sites on the web perform a translation of an input sequence. Clicking
on the Expasy link below will open a new window giving you access to a
translation tool. Translating the DNA sequence is done by reading the nucleotide
sequence three bases at a time and then looking at a table of the genetic
code to arrive at an amino acid sequence. This program examines the input
sequence in all six possible frames (i.e. reading the sequence from 5' to 3'
and from 3' to 5' starting with nt 1, nt 2 and nt 3). What we typically look
for in identifying the proper translation is the frame that gives the longest
amino acid sequence before a stop codon is encountered. (Since there are 64
codons and three code for nonsense, we expect a stop codon to appear on average
once every 20 amino acids if we simply read a sequence "out of frame". However,
"on average" is just that, and it is possible to have an incorrect reading
frame give an extended sequence with no stop codons. The next exercise will
address that problem.
We will use Expasy tools for translation. Clicking on it will
open a new window so you can return to this window for instructions and to
copy your sequence.
1. Select the sequence, copy it and then paste it into the
translate sequence window in the ExPasy link.
2. Under Output format select "Compact". This gives the amino acid sequence as one letter codes with stop codons indicated by a hyphen. (The "Verbose" output indicates start codons (ATG) in
bold as Met and stop codons written out so this is an easy way to scan the outputs. However, you cannot use this output for a Blast search (Exercise 4).
3. Click on Translate Sequence
4. Often only one reading frame will give you a translation with no stop codons, but this is not always the case. If you get multiple possible reading frames, one way to determine which is most likely the true frame is to use the BLAST program to determine if the sequence corresponds to any known protein sequence (Exercise 4)
5. Using the "Compact output" to get one letter sequences, copy the one letter sequence of the best reading frame (i.e. one with no stop codons) and paste it into the window below labelled "Best Guess".
6. Copy the longest amino acid sequence (i.e. no hypens) of one of the other reading frames
to the window below labelled "Second Best". If you have two reading frames without a stop codon, simply copy each to the boxes below.
7. Copy and save each sequence to a word processor for use in Exercise 4.
Best reading Frame
Amino acid sequence from next best Frame (don't include the stop)
You have now been introduced to the use of a translation program to identify
the most probable reading frame and to translate an unknown sequence. What
if none of the six possible reading frames gives an extended a.a. sequence?
This could be due to your having errors in sequence (you need to sequence both strands to ensure an accurate sequence). Or you may have isolated
a non-coding region of DNA (e.g. we know that the 5' and 3' ends of most genes
are not coding for protein, but serve regulatory functions. There are many
untranslated regions of DNA (exxons, pseudogenes, etc). We can now take the two amino acid sequences and determine if either matches any known sequences in the huge protein sequence database (Exercise 4)
These tutorials were developed by Dr. Ross S. Feldberg, Dept of Biology, Tufts University, Medford, MA 02155 with the assistance of a Teaching with Technology grant from the Academic Computing Department at Tufts. Thanks to Anoop Kumar, Abhra Verma and Scott Cordeiro for help in developing this resource. Suggestions, corrections and comments should be sent to Ross.Feldberg@Tufts.edu. (last modified Aug 2005)