An introduction to the searching the scientific literature
Finding the Nucleotide Sequence for a Gene
Determining the correct reading frame for an unknown nucleotide sequence
Using BLAST to identify a gene (cont from Exercise 2)
 Searching for Sequence motifs in a given protein
Finding homologs of a human gene in other organisms

Exercise 2: Finding the Nucleotide Sequence for a Known Gene

I. INTRODUCTION
In Exercise 1, you learned how to use Entrez Browser to efficiently find articles in the scientific literature. In this exercise, you will use Entrez to find entries for the coding sequence of a gene of interest. We will use glucokinase as an initial example (glucokinase is the enzyme that catalyzes the initial step of glycolysis in liver and several other cell types).

II. THE GENBANK NUCLEOTIDE DATABASE

1) Open Entrez Browser (http://www3.ncbi.nlm.nih.gov/qguery/gquery.fcgi)
2) In the left column, select Nucleotide sequence databank
3) In the top search box type in "glucokinase" (without the quotes) and click on the Go Button
Note: You will get about 1000 entries listed on more than 50 pages of 20 entries each. This is an unwieldy number, so you will have to figure out a way to narrow your search. There are two ways in general to narrow a search, the use of the Limits menu within Entrez or the use of Boolean operators (AND, OR, NOT).

III. USING LIMITS AND BOOLEAN OPERATORS TO NARROW A SEARCH

This search will pick up all entries in the database that have the word glucokinase ANYWHERE in the entry (e.g. an entry that contains a line stating "Gene X has nothing to do with glucokinase" will come up as a hit in this search). We can eliminate some entries by adding after glucokinase in the search box NOT similar NOT hypothetical. This will eliminate entries listed only because they are noted to be similar to glucokinase.

We can apply additional filters to our search by using the Limits tab just below the search box. Click on the Limits tab.

1) If we are interested only in the coding regions of glucokinase genes (i.e. DNA sequences obtained from mRNA for glucokinase) we can eliminate genomic sequences with their large introns  In the "Molecule" pull-down menu select mRNA (eliminates entries from chromosomal DNA) and click on the "GO" button. Note how many hits are now listed.

2.You still have entries that are not glucokinase. To further narrow your search click on the Limits tab one more time. In the top left drop down menu change from All Fields to Title. This will limit this search to those entries that have glucokinase in their title line. Still, you will note that your entries include not only glucokinase but also glucokinase regulatory proteins and other entries that have the term glucokinase in the title.

IV. USING THE CLIPBOARD
To save a subset of these hits, we make use of the Clipboard. Scroll down your list and select three entries of glucokinase from different species. Check the box to the left of the entry you wish to save. and then at the top or bottom of the page, use the Send To pull down menu to send these selected items to the Clipboard. When you have the three entries on the clipboard, click on the Clipboard tab (very pale at top of page) to bring up a screen with only your desired entries.You may want to write down the genebank entries for these mRNAs. In a later exercise (to be developed) we will compare these three sequences. The clipboard will only be saved for eight hours.

V. GENEBANK INFORMATION
Clicking on the accession number for one of your entries will bring up the full Nucleotide sequence information. Most of the information in an entry is self-explanatory, but if you scroll down to the Features entry you should find a CDS entry. This specifies that part of the nucleotide sequence below that actually codes for a protein (often you will find untranslated regions at both the 3' and 5' ends of a sequence). In addition, the translated sequence is given in the one letter amino acid shorthand just above the full nucleotide sequence. A sample GenBank record with information on each field is given at the Sample Record link

To obtain the sequence in a form which can be analyzed by a variety of gene analysis software, select FASTA from the Display pull down menu. The browser will give you a page which has the sequence without any line numbers or breaks. Save the sequence by selecting the material beginning with the > and going up to the last nucleotide (be sure to avoid the line above the > and below the last nucleotide) and copying this to a word processor program. The > line is recognized as comment by all analysis software. You can change the font to courier 10 point to obtain the proper spacing and lines.

VI. OBTAINING A PROTEIN FASTA ENTRY

To compare protein sequences, you will want to obtain the protein FASTA output.

To obtain this change the Display menu back to the GeneBank Display and scroll down until you reach  the CDS information. Click on the link in the line that begins /protein_id= "xxx1234" (i.e. whatever the assigned protein id number is).

This will change the display to GenPept and bring up a page which shows some of the same information, but is limited to the amino acid sequence. In this page, change the  Display menu to FASTA to obtain an output similar to the nucleotide FASTA output (an index line which begins with > and an amino acid sequence). You can copy the index line and sequence to a word processor for use later (once you are in the word processor, again change the text to courier 10 pt to retain line spacing).

SAVE THE THREE PROTEIN FASTA OUTPUTS (glucokinase from three mammal species of your choice) to a word processor program. We will compare the sequences of these three proteins in a future exercise (to be developed).

These tutorials were developed by Dr. Ross S. Feldberg, Dept of Biology, Tufts University, Medford, MA 02155 with the assistance of a Teaching with Technology grant from the Academic Computing Department at Tufts. Thanks to Anoop Kumar, Abhra Verma and Scott Cordeiro for help in developing this resource. Suggestions, corrections and comments should be sent to Ross.Feldberg@Tufts.edu. (Last modified Aug 2005)