An introduction to the searching the scientific literature
Finding the Nucleotide Sequence for a Gene
Determining the correct reading frame for an unknown nucleotide sequence
Using BLAST to identify a gene (cont from Exercise 2)
 Searching for Sequence motifs in a given protein
Finding homologs of a human gene in other organisms

Exercise 3: Translating an Unknown DNA Sequence

Introduction
One of the most basic exercises in bioinformatics is determining if a nucleic acid sequence actually codes for a protein. This is complicated by the fact that we generally do not know which strand is the coding strand (i.e. whether the sequence itself or its complementary strand will be transcribed into mRNA) nor the correct reading frame (whether the sequence should be read three bases at a time starting with the first nucleotide, the second or the third. We resolve both these questions by translating both strands in all three reading frames and looking for the one that gives the longest amino acid sequence before a stop codon is encountered. Since there are 64 codons and three of these code for no amino acid i.e. are stop signals- we expect a stop codon to appear on average once every 20 amino acids if we are reading a sequence in the incorrect frame. However, things are not always that clear cut and it is possible for an out of frame translation to extend to over 100 amino acids before a stop codon is reached.

In the exercise below you will be given an unknown DNA sequence and asked to use a web tool to translate the sequence into an amino acid sequence and hopefully identify the proper reading frame. You will then save this amino acid sequence to a word processing program (or e-mail it to yourself) if you want to use it in the next exercise.

Obtaining your sequence
In the lab, this might be obtained by sequencing a clone from a cDNA library or by isolating an amplified DNA fragment from a PCR amplification. Often, when we sequence such a product we find we have an unexpected fragment of DNA which we need to analyze. Here we will provide a partial sequence at random from our database of sequences. A partial nucleotide sequence will appear in the window below after you click on the Get Gene Sequence button.

Nucleotide Sequence

 

Translating the Sequence
Several sites on the web perform a translation of an input sequence. Clicking on the Expasy link below will open a new window giving you access to a translation tool. Translating the DNA sequence is done by reading the nucleotide sequence three bases at a time and then looking at a table of the genetic code to arrive at an amino acid sequence. This program examines the input sequence in all six possible frames (i.e. reading the sequence from 5' to 3' and from 3' to 5' starting with nt 1, nt 2 and nt 3). What we typically look for in identifying the proper translation is the frame that gives the longest amino acid sequence before a stop codon is encountered. (Since there are 64 codons and three code for nonsense, we expect a stop codon to appear on average once every 20 amino acids if we simply read a sequence "out of frame". However, "on average" is just that, and it is possible to have an incorrect reading frame give an extended sequence with no stop codons. The next exercise will address that problem.

We will use Expasy tools for translation. Clicking on it will open a new window so you can return to this window for instructions and to copy your sequence.

1. Select the sequence, copy it and then paste it into the translate sequence window in the ExPasy link.
2. Under Output format select "Compact". This gives the amino acid sequence as one letter codes with stop codons indicated by a hyphen. (The "Verbose" output indicates start codons (ATG) in bold as Met and stop codons written out so this is an easy way to scan the outputs. However, you cannot use this output for a Blast search (Exercise 4).
3. Click on Translate Sequence
4. Often only one reading frame will give you a translation with no stop codons, but this is not always the case. If you get multiple possible reading frames, one way to determine which is most likely the true frame is to use the BLAST program to determine if the sequence corresponds to any known protein sequence (Exercise 4)
5. Using the "Compact output" to get one letter sequences, copy the one letter sequence of the best reading frame (i.e. one with no stop codons) and paste it into the window below labelled "Best Guess".
6. Copy the longest amino acid sequence (i.e. no hypens) of one of the other reading frames to the window below labelled "Second Best". If you have two reading frames without a stop codon, simply copy each to the boxes below.
7. Copy and save each sequence to a word processor for use in Exercise 4.

 

Best reading Frame

Amino acid sequence from next best Frame (don't include the stop)


Conclusion

You have now been introduced to the use of a translation program to identify the most probable reading frame and to translate an unknown sequence. What if none of the six possible reading frames gives an extended a.a. sequence? This could be due to your having errors in sequence (you need to sequence both strands to ensure an accurate sequence). Or you may have isolated a non-coding region of DNA (e.g. we know that the 5' and 3' ends of most genes are not coding for protein, but serve regulatory functions. There are many untranslated regions of DNA (exxons, pseudogenes, etc). We can now take the two amino acid sequences and determine if either matches any known sequences in the huge protein sequence database (Exercise 4)

 

These tutorials were developed by Dr. Ross S. Feldberg, Dept of Biology, Tufts University, Medford, MA 02155 with the assistance of a Teaching with Technology grant from the Academic Computing Department at Tufts. Thanks to Anoop Kumar, Abhra Verma and Scott Cordeiro for help in developing this resource. Suggestions, corrections and comments should be sent to Ross.Feldberg@Tufts.edu. (last modified Aug 2005)