|
Exercise 2: Finding the Nucleotide Sequence for a Known Gene
I. INTRODUCTION
In Exercise 1, you learned how to
use Entrez Browser to efficiently find articles in the scientific literature.
In this exercise, you will use Entrez to find entries for the coding sequence
of a gene of interest. We will use glucokinase as an initial example
(glucokinase is the enzyme that catalyzes the initial step of glycolysis in
liver and several other cell types).
II. THE GENBANK NUCLEOTIDE DATABASE
1) Open Entrez Browser (http://www3.ncbi.nlm.nih.gov/qguery/gquery.fcgi)
2) In the left column, select Nucleotide sequence databank
3) In the top search box type in "glucokinase" (without the quotes) and click on the Go Button
Note: You will get about 1000 entries listed on more than 50
pages of 20 entries each. This is an unwieldy number, so you will have to
figure out a way to narrow your search. There are two ways in general to narrow
a search, the use of the Limits menu within Entrez or the use of Boolean
operators (AND, OR, NOT).
III. USING LIMITS AND BOOLEAN OPERATORS TO NARROW A SEARCH
This search will pick up all entries in the database that have the word glucokinase ANYWHERE in the entry (e.g. an entry that contains a line stating "Gene X has nothing to do with glucokinase" will come up as a hit in this search). We can eliminate some entries by adding after glucokinase in the search box NOT similar NOT hypothetical. This will eliminate entries listed only because they are noted to be similar to glucokinase.
We can apply additional filters to our search by using the Limits tab just below the search box. Click on the Limits tab.
1) If we are interested only in the coding regions of glucokinase
genes (i.e. DNA sequences obtained from mRNA for glucokinase) we can eliminate genomic sequences with their large
introns In the "Molecule" pull-down
menu select mRNA (eliminates entries from chromosomal DNA) and click
on the "GO" button. Note how many hits are now listed.
2.You still have entries that are not glucokinase. To further narrow your search click on the Limits tab one more time. In the top left drop down menu change from All Fields to Title. This will limit this search to those entries that have glucokinase in their title line. Still, you will note that your entries include not only glucokinase but also glucokinase regulatory proteins and other entries that have the term glucokinase in the title.
IV. USING THE CLIPBOARD
To save a subset of these hits, we make use of the Clipboard. Scroll down your list and select three entries of glucokinase from different species. Check the box to the left of the entry you wish to save. and then
at the top or bottom of the page, use the Send To pull down menu to send these selected items to the Clipboard. When you have the three entries on the clipboard,
click on the Clipboard tab (very pale at top of page) to bring up a screen with only
your desired entries.You may want to
write down the genebank entries for these mRNAs. In a later exercise (to be developed) we will compare these three sequences. The clipboard will only be saved for eight hours.
V. GENEBANK INFORMATION
Clicking on the accession number
for one of your entries will bring up the full Nucleotide sequence information. Most of the information in an entry is
self-explanatory, but if you scroll down to the Features entry you should find a CDS entry. This specifies that part of the nucleotide sequence below that actually codes for a protein (often you will find untranslated regions at both the 3' and 5' ends of a sequence). In addition, the translated sequence is given in the one letter amino acid shorthand just above the full nucleotide sequence. A sample GenBank record with information on each field is given at the Sample Record link
To obtain the
sequence in a form which can be analyzed by a variety of gene analysis
software, select FASTA from the Display pull
down menu. The
browser will give you a page which has the sequence without any line numbers or
breaks. Save the sequence by selecting the material beginning with the > and
going up to the last nucleotide (be sure to avoid the line above the > and below the last nucleotide) and copying
this to a word processor program. The > line is recognized as comment by all
analysis software. You can change the font to courier 10 point to obtain the
proper spacing and lines.
VI. OBTAINING A PROTEIN FASTA ENTRY
To compare protein sequences, you will want
to obtain the protein
FASTA output.
To obtain this change the Display menu back to the GeneBank
Display and scroll down until you reach the CDS information.
Click on the link in the line that begins /protein_id=
"xxx1234" (i.e. whatever the assigned protein id
number is).
This will change the display to GenPept and bring up a page which shows some of the same information, but is limited to the amino
acid sequence. In this page, change the Display menu to FASTA to obtain an output similar to the nucleotide
FASTA output (an index line which begins with > and an amino acid sequence).
You can copy the index line and sequence to a word processor for use later
(once you are in the word processor, again change the text to courier 10 pt to retain line spacing).
SAVE THE THREE PROTEIN FASTA OUTPUTS
(glucokinase from three mammal species
of your choice) to a word
processor program. We will compare the sequences of these three proteins in a future exercise (to be developed).
These tutorials were developed by Dr. Ross S. Feldberg, Dept of Biology, Tufts University, Medford, MA 02155 with the assistance of a Teaching with Technology grant from the Academic Computing Department at Tufts. Thanks to Anoop Kumar, Abhra Verma and Scott Cordeiro for help in developing this resource. Suggestions, corrections and comments should be sent to Ross.Feldberg@Tufts.edu. (Last modified Aug 2005)
|