1. Gene Prediction ManualA. Gene annotation
Step 1. Accesing EMBL database to retrieve the gene
Go toEMBLdatabase SelectNucleotide sequences Type sequence entry
nameHS307871 PressGobutton Click onEmblEntrylink Have a look at the
different entry fields: detect the mRNA and CDS exons Click onText
Entrylink to see the plain text formatted output This is the
sequence inFASTA format
B. Exploring ab initio gene prediction
Step 2. Runninggeneid
Connect to thegeneidserver Paste the FASTA sequence Choose
geneidoutput format Rungeneidwith different parameters:1. Searching
signals: Selectacceptors, donors, start and stop codons. Look for
them in the real annotation of the sequence2. Searching exons:
SelectAll exonsand try to find the real ones3. Finding genes: You
do not need to select any option (default behaviour). Compare the
predicted gene with the real gene
Figure 1.Signal, exons and genes predicted bygeneidin the
sequence HS307871
Step 3. Running other genefinders
Provided that there are several alternative programs to analyze
a DNA sequence, we can run every application and observe the common
parts of the predictions.1. GENSCAN: Connect to the GENSCANserver
Paste DNA sequence PressRun Genscanbutton Compare annotations and
predictions
2. FGENESH: Connect toSoftberry homepage On the left frame,
selectGENE FINDING in Eukaryota Select the programFGENESH Paste DNA
sequence PressSearchbutton Compare annotations and predictions
3. GRAIL: Connect toGrailEXP homepage ActivatePerceval Exon
Candidatesbox Paste DNA sequence PressGo!button Check the results
Compare annotations and predicted exons
4. NOTE: First exon is always missed in the predictions and
there are some problems to detect the donor site from exon 5.
Detection of Start codons is a serious drawback in current gene
finding programs (see Figure 2). However, this problem can be
overcome by using homology information to complete the gene
prediction.
Figure 2.EMBL annotation and genes predicted by Grail,
GENSCAN,geneidand FGENESH in the sequence HS307871
C. Using EST/cDNA homology information
Step 4. Using GrailEXP
Connect toGrailExp homepage ActivateGalahad EST/mRNA/cDNA
Alignmentsbox Select GrailEXP database
(RefSeq/HTDB/dbEST/EGAD/Riken) Activate exon assembly:Gawain Gene
Models Paste DNA sequence PressGo!button Check the results:
predictions and supporting information Compare annotations, ab
initio GRAIL prediction and five predicted alternative spliced
variants
Figure 3.Comparison between EMBL annotation and genes predicted
ab inition by Grail Vs five alternative predictions supported by
ESTs information in the sequence HS307871
Step 5. Using other gene finding programs + alignment of
transcripts
Usingblastn, we can search the databaseest_humanfor ESTs
supporting future predictions. Filter this output in order to
select those non-overlapping ESTs that could form a complete cDNA
sequence (see Figure 4). Moreover, ESTs not divided into two or
more pieces in the genomic sequence (containing a couple of splice
sites) should be rejected. Connect to theFGENESH-Cserver (onGene
finding with similarity menu) Paste the sequence HS307871 Paste the
cDNA sequence or EST you have selected Press thesearchbutton Notice
that predicted gene will necessarily supported by homology
information, so it will likely mapped only in the genomic region
overlapping your EST query.
Figure 4.Best human ESTs in the alignment mapped on the genomic
sequence HS307871
D. Using protein homology information
Step 6. Spliced alignment
Spliced alignment is very useful when we have additional
information (a putative homologous protein sequence) about the
content of the sequence. Thus, gene prediction is guided by fitting
the protein sequence into the best splice sites predicted in the
genomic sequence. Open theNCBI blast server Choose blastx program
(genomic query versus protein database) Paste the genomic sequence
and press theBlast!andFormat! Select the first protein. Display the
FASTA sequence or clickhere. Obviously, it is the real protein
annotated in the genomic sequence. Opengenewiseweb server to use
this protein to predict the best gene structure Paste both protein
and genomic sequences and run the program Compare predicted gene
(end of the file) and annotations: look for splice sites within
introns to check exon boundaries are correct
Figure 5.Best HSPs representing proteins homologues similar to
the genomic sequence HS307871 obtained using blastx
Step 7. Spliced alignment using homologous proteins
From blastx output, choose several homologous genes and run
genewise for each one separately, again. Observe the gain of
accuracy as long as the homologue is closer to the original human
protein: Homo sapiens Ovis aries Mus musculus Rattus norvegicus
Danio rerio Drosophila melanogaster Drosophila virilis
Saccharomyces cerevisiae Schizosaccharomyces pombe
Figure 6.Graphical comparison of the real gene annotation and
different genewise predictions using different homologous proteins
for the geneuroporphyrinogen decarboxylase (URO-D)
Step 8. Using protein homology information: GenomeScan
Protein homology information can also be used to enhance ab
initio predicted exons supported by blastx HSPs as in the case of
GenomeScan andgeneidimproving therefore the final prediction
GenomeScan: Connect to theGenomeScanweb server Retrieve the protein
from the previous blast search Paste both genomic and protein
sequences Press the buttonGenomeScan Check the results. It seems
that the first exon has not been detected even using homology
information. This is due to the fact that blast programs have a
minimal word lenght.
Figure 6.GenomeScan output: first exon is not correctly
predicted probably due to blast length restrictions
E. Using a genome annotation browser
Step 9. Golden path archive:
Open theUCSC Genome Bioinformatics Site Select theblatlink to
locate the genomic coordinates of our sequence Paste theDNA
sequence in FASTAformat (HS307871) Submitthe file Click over the
first hit:(browser link) Compare the graphical annotation with the
EMBL entry of the gene Analyze these different sets of output
options:Genes and Gene Prediction Tracks,mRNA and EST Tracks
Figure 7.(a) UCSC genome browser representation of the region
containing the geneuroporphyrinogen decarboxylase (URO-D)(b) UCSC
genome browser representation of the contex (100Kbps) region around
the geneuroporphyrinogen decarboxylase (URO-D).
F. Results
Here you can find the solutions to every exercise:EMBL
annotation
EMBL annotation (plain text)
FASTA sequence
geneid results: signals
geneid results: exons
geneid results: genes
GENSCAN results
FGENESH results
GRAIL results
GrailEXP results
Blastn + human ESTs results
Blastx + protein results
Genewise (human protein)
Genewise (ovis protein)
Genewise (mouse protein)
Genewise (rat protein)
Genewise (Danio rerio protein)
Genewise (Drosophila melanogaster protein)
Genewise (Drosophila virilis protein)
Genewise (yeast protein)
Genewise (fission yeast protein)
GenomeScan results
F. Bibliography
1. J.F. Abril and R. Guig.gff2ps: visualizing genomic
annotations.Bioinformatics 16:743-744 (2000).2. Altschul, S.F.,
Gish, W., Miller, W., Myers, E.W. & Lipman, D.J.Basic local
alignment search tool.J. Mol. Biol. 215:403-410 (1990).3. Burge, C.
and Karlin, S.Prediction of complete gene structures in human
genomic DNA.J. Mol. Biol. 268, 78-94 (1997).4. E. Blanco, G. Parra
and R. Guig.Using geneid to Identify Genes.In A. D. Baxevanis and
D. B. Davison, chief editors: Current Protocols in Bioinformatics.
Volume 1, Unit 4.3. John Wiley & Sons Inc., New York. ISBN:
0-471-25093-7 (2002).5. G. Parra, E. Blanco, and R. Guig.Geneid in
Drosophila.Genome Research 10:511-515 (2000).6. Asaf A. Salamov and
Victor V. Solovyev.Ab initio Gene Finding in Drosophila Genomic DNA
Genome Res. 10: 516-522 (2000).7. Yeh, R.-F., Lim, L. P. and Burge,
C. B.Computational inference of homologous gene structures in the
human genome.Genome Res. 11: 803-816 (2001).8. D. Hyatt, J. Snoddy,
D. Schmoyer, G. Chen, K. Fischer, M. Parang, I. Vokler, S. Petrov,
P. Locascio, V. Olman, Miriam Land, M. Shah, and E.
Uberbacher.Improved Analysis and Annotation Tools for Whole-Genome
Computational Annotation and Analysis: GRAIL-EXP Genome Analysis
Toolkit and Related Analysis Tools.Genome Sequencing & Biology
Meeting (2000).9. Ewan Birney and Richard Durbin.Using GeneWise in
the Drosophila Annotation Experiment. Genome Res. 10: 547-548
(2000).