Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar
Feb 02, 2016
Gene PredictionChengwei Luo, Amanda McCook, Nadeem Bulsara,
Phillip Lee, Neha Gupta, and Divya Anjan Kumar
Gene Prediction
• Introduction
• Protein-coding gene prediction
• RNA gene prediction
• Modification and finishing
• Project schema
Gene Prediction
• IntroductionIntroduction
• Protein-coding gene prediction
• RNA gene prediction
• Modification and finishing
• Project schema
Why gene prediction?experimental way?
Why gene prediction?
Exponential growth of sequences
Metagenomics: ~1% grow in lab
New sequencing technology
How to do it?
How to do it?It is a complicated task, let’s break it into parts
How to do it?It is a complicated task, let’s break it into parts
Genome
How to do it?It is a complicated task, let’s break it into parts
Genome
How to do it?Protein-coding gene prediction
Phillip Lee & Divya Anjan Kumar
Homology Search
ab initio approach
Nadeem Bulsara & Neha Gupta
How to do it?RNA gene prediction
Amanda McCook & Chengwei Luo
tRNA
rRNA
sRNA
Gene Prediction
• Introduction
• Protein-coding gene predictionProtein-coding gene prediction
• RNA gene prediction
• Modification and finishing
• Project schema
Homology Search
Homology Search
Strategy
open reading frame(ORF)
How/Why find ORF?
How/Why find ORF?
How/Why find ORF?
Protein Database Searches
Domain searches
Limits of Extrinsic Prediction
ab initio Prediction
Homology Search is not Enough!
Biased and incomplete Database
Sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either.
Number of sequenced genomes clustered here
ab initio Gene Prediction
Features
ORFs (6 frames)
Codon Statistics
Features (Contd.)
Probabilistic View
Supervised Techniques
Unsupervised Techniques
Usually Used Tools
GeneMark
GLIMMER
EasyGene
PRODIGAL
GeneMark
•Developed in 1993 at Georgia Institute of Technology as the first gene finding tool.
•Used markov chain to represent the statistics of coding and noncoding reading frames using dicodon statistics.
Shortcomings
Inability to find exact gene boundaries
GeneMark.hmm
GeneMark.hmm
• Probability of any sequence S underlying functional sequence X is calculated as P(X|S)=P(x1,x2,…………,xL| b1,b2,…………,bL)
• Viterbi algorithm then calculates the functional sequence X* such that P(X*|S) is the largest among all possible values of X.
• Ribosome binding site model was also added to augment accuracy in the prediction of translational start sites.
GeneMark
• RBS feature overcomes this problem by defining a % position nucleotide matrix based on alignment of 325 E coli genes whose RBS signals have already been annotated.
• Uses a consensus sequence AGGAG to search upstream of any alternative start codons for genes predicted by HMM.
GENEMARKS
• Considered the best gene prediction tool.
• Based on unsupervised learning.
Even in prokaryotic genomes gene overlaps are quite common
GeneMarkS
GLIMMER
• Used IMM (Interpolated Markov Models) for the first time.
• Predictions based on variable context (oligomers of variable lengths).
• More flexible than the fixed order Markov models.
PrincipleIMM combines probability based on 0,1……..k previous bases, in this case k=8 is used. But this is for oligomers that occur
frequently. However, for rarely occurring oligomers, 5th order or lower may also be used.
Maintained by Steven Salzberg, Art Delcher at the University of Maryland , College Park
Glimmer development
Glimmer 2 (1999)
• Increased the sensitivity of prediction by adding concept of ICM (Interpolated Context Model)
Glimmer 3 (2007)
• Overcomes the shortcomings of previous models by taking in account sum of RBS score, IMM coding potentials and a score for start codons which is dependent on relative frequency of each possible start codon in the same training set used for RBS determination.
• Algorithm used reverse scoring of IMM by scoring all ORF (open reading frames) in reverse, from the stop codon to start codon.
• Score being the sum of log likelihood of the bases contained in the ORF.
Glimmer3.02
PRODIGALProkaryotic Dynamic Programming Gene Finding Algorithm
Developed at Oak Ridge National Laboratory and the University of Tennessee
PRODIGAL-Features
PRODIGAL-Features
EasyGene
Developed at University of Copenhagen
Statistical significance is the measure for gene prediction.
¥ High quality data set based onsimilarity in SwissPRot isextracted from genome.
¥ Data set used to estimate theHMM where based on ORF scoreand length statistical significance iscalculated.
Problem:
¥ No standalone version available
Comparison of Different Tools
Gene Prediction
• Introduction
• Protein-coding gene prediction
• RNA gene predictionRNA gene prediction
• Modification and finishing
• Project schema
RNA Gene Prediction
Why Predict RNA?
Regulatory sRNA
sRNA Challenges
Fundamental Methodology
RFAM
What Is Covariance?
Fig: Christian Weile et al. BMC Genomics (2007) 8:244
Noncomparative Prediction
Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612
Noncomparative Prediction
*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1
Comparative+Noncomparative
Effective sRNA prediction in V. cholerae
• Non-enterobacteria
• sRNAPredict2
• 32 novel sRNAs predicted
• 9 tested
• 6 confirmedJonathan Livny et al. Nucleic Acids Res. (2005) 33:4096
Software
*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1
Eva K. Freyhult et al. Genome Res. (2007) 17:117
Gene Prediction
• Introduction
• Protein-coding gene prediction
• RNA gene prediction
• Modification and finishingModification and finishing
• Project schema
Modification & Finishing
• Consensus strategy to integrate ab initio results
• Broken gene recruiting
• TIS correcting
• IS calling
• operon annotating
• Gene presence/absence analysis
Modification & FinishingConsensus strategy
pass
pass
fail
Broken gene recruiting
ab initio results
homology search
candidate fragments
Modification & FinishingTIS correcting
Start codon redundancy:ATG, GTG, TTG, CTG
Markov iteration, experimental verified data
Leaderless genes
Modification & FinishingIS calling Operon annotating
IS Finder DB
Modification & FinishingGene Presence/absence analysis
Gene Prediction
• Introduction
• Protein-coding gene prediction
• RNA gene prediction
• Modification and finishing
• Project schemaProject schema
Schema (proposed)
Schema (proposed)
assembly group
Schema (proposed)
assembly group