Top Banner
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar
67

Gene Prediction

Feb 02, 2016

Download

Documents

Bor_ka

Gene Prediction. Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar. Gene Prediction. Introduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schema. Gene Prediction. Introduction - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gene Prediction

Gene PredictionChengwei Luo, Amanda McCook, Nadeem Bulsara,

Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Page 2: Gene Prediction

Gene Prediction

• Introduction

• Protein-coding gene prediction

• RNA gene prediction

• Modification and finishing

• Project schema

Page 3: Gene Prediction

Gene Prediction

• IntroductionIntroduction

• Protein-coding gene prediction

• RNA gene prediction

• Modification and finishing

• Project schema

Page 4: Gene Prediction

Why gene prediction?experimental way?

Page 5: Gene Prediction

Why gene prediction?

Exponential growth of sequences

Metagenomics: ~1% grow in lab

New sequencing technology

Page 6: Gene Prediction

How to do it?

Page 7: Gene Prediction

How to do it?It is a complicated task, let’s break it into parts

Page 8: Gene Prediction

How to do it?It is a complicated task, let’s break it into parts

Genome

Page 9: Gene Prediction

How to do it?It is a complicated task, let’s break it into parts

Genome

Page 10: Gene Prediction

How to do it?Protein-coding gene prediction

Phillip Lee & Divya Anjan Kumar

Homology Search

ab initio approach

Nadeem Bulsara & Neha Gupta

Page 11: Gene Prediction

How to do it?RNA gene prediction

Amanda McCook & Chengwei Luo

tRNA

rRNA

sRNA

Page 12: Gene Prediction

Gene Prediction

• Introduction

• Protein-coding gene predictionProtein-coding gene prediction

• RNA gene prediction

• Modification and finishing

• Project schema

Page 13: Gene Prediction

Homology Search

Page 14: Gene Prediction

Homology Search

Page 15: Gene Prediction

Strategy

Page 16: Gene Prediction

open reading frame(ORF)

Page 17: Gene Prediction

How/Why find ORF?

Page 18: Gene Prediction

How/Why find ORF?

Page 19: Gene Prediction

How/Why find ORF?

Page 20: Gene Prediction

Protein Database Searches

Page 21: Gene Prediction

Domain searches

Page 22: Gene Prediction

Limits of Extrinsic Prediction

Page 23: Gene Prediction

ab initio Prediction

Page 24: Gene Prediction

Homology Search is not Enough!

Biased and incomplete Database

Sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either.

Number of sequenced genomes clustered here

Page 25: Gene Prediction

ab initio Gene Prediction

Page 26: Gene Prediction

Features

Page 27: Gene Prediction

ORFs (6 frames)

Page 28: Gene Prediction

Codon Statistics

Page 29: Gene Prediction

Features (Contd.)

Page 30: Gene Prediction

Probabilistic View

Page 31: Gene Prediction

Supervised Techniques

Page 32: Gene Prediction

Unsupervised Techniques

Page 33: Gene Prediction

Usually Used Tools

GeneMark

GLIMMER

EasyGene

PRODIGAL

Page 34: Gene Prediction

GeneMark

•Developed in 1993 at Georgia Institute of Technology as the first gene finding tool.

•Used markov chain to represent the statistics of coding and noncoding reading frames using dicodon statistics.

Shortcomings

Inability to find exact gene boundaries

Page 35: Gene Prediction

GeneMark.hmm

Page 36: Gene Prediction

GeneMark.hmm

• Probability of any sequence S underlying functional sequence X is calculated as P(X|S)=P(x1,x2,…………,xL| b1,b2,…………,bL)

• Viterbi algorithm then calculates the functional sequence X* such that P(X*|S) is the largest among all possible values of X.

• Ribosome binding site model was also added to augment accuracy in the prediction of translational start sites.

Page 37: Gene Prediction

GeneMark

• RBS feature overcomes this problem by defining a % position nucleotide matrix based on alignment of 325 E coli genes whose RBS signals have already been annotated.

• Uses a consensus sequence AGGAG to search upstream of any alternative start codons for genes predicted by HMM.

GENEMARKS

• Considered the best gene prediction tool.

• Based on unsupervised learning.

Even in prokaryotic genomes gene overlaps are quite common

GeneMarkS

Page 38: Gene Prediction

GLIMMER

• Used IMM (Interpolated Markov Models) for the first time.

• Predictions based on variable context (oligomers of variable lengths).

• More flexible than the fixed order Markov models.

PrincipleIMM combines probability based on 0,1……..k previous bases, in this case k=8 is used. But this is for oligomers that occur

frequently. However, for rarely occurring oligomers, 5th order or lower may also be used.

Maintained by Steven Salzberg, Art Delcher at the University of Maryland , College Park

Page 39: Gene Prediction

Glimmer development

Glimmer 2 (1999)

• Increased the sensitivity of prediction by adding concept of ICM (Interpolated Context Model)

Glimmer 3 (2007)

• Overcomes the shortcomings of previous models by taking in account sum of RBS score, IMM coding potentials and a score for start codons which is dependent on relative frequency of each possible start codon in the same training set used for RBS determination.

• Algorithm used reverse scoring of IMM by scoring all ORF (open reading frames) in reverse, from the stop codon to start codon.

• Score being the sum of log likelihood of the bases contained in the ORF.

Page 40: Gene Prediction

Glimmer3.02

Page 41: Gene Prediction

PRODIGALProkaryotic Dynamic Programming Gene Finding Algorithm

Developed at Oak Ridge National Laboratory and the University of Tennessee

Page 42: Gene Prediction

PRODIGAL-Features

Page 43: Gene Prediction

PRODIGAL-Features

Page 44: Gene Prediction

EasyGene

Developed at University of Copenhagen

Statistical significance is the measure for gene prediction.

¥ High quality data set based onsimilarity in SwissPRot isextracted from genome.

¥ Data set used to estimate theHMM where based on ORF scoreand length statistical significance iscalculated.

Problem:

¥ No standalone version available

Page 45: Gene Prediction

Comparison of Different Tools

Page 46: Gene Prediction

Gene Prediction

• Introduction

• Protein-coding gene prediction

• RNA gene predictionRNA gene prediction

• Modification and finishing

• Project schema

Page 47: Gene Prediction

RNA Gene Prediction

Page 48: Gene Prediction

Why Predict RNA?

Page 49: Gene Prediction

Regulatory sRNA

Page 50: Gene Prediction

sRNA Challenges

Page 51: Gene Prediction

Fundamental Methodology

Page 52: Gene Prediction

RFAM

Page 53: Gene Prediction

What Is Covariance?

Fig: Christian Weile et al. BMC Genomics (2007) 8:244

Page 54: Gene Prediction

Noncomparative Prediction

Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612

Page 55: Gene Prediction

Noncomparative Prediction

*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1

Page 56: Gene Prediction

Comparative+Noncomparative

Effective sRNA prediction in V. cholerae

• Non-enterobacteria

• sRNAPredict2

• 32 novel sRNAs predicted

• 9 tested

• 6 confirmedJonathan Livny et al. Nucleic Acids Res. (2005) 33:4096

Page 57: Gene Prediction

Software

*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1

Eva K. Freyhult et al. Genome Res. (2007) 17:117

Page 58: Gene Prediction

Gene Prediction

• Introduction

• Protein-coding gene prediction

• RNA gene prediction

• Modification and finishingModification and finishing

• Project schema

Page 59: Gene Prediction

Modification & Finishing

• Consensus strategy to integrate ab initio results

• Broken gene recruiting

• TIS correcting

• IS calling

• operon annotating

• Gene presence/absence analysis

Page 60: Gene Prediction

Modification & FinishingConsensus strategy

pass

pass

fail

Broken gene recruiting

ab initio results

homology search

candidate fragments

Page 61: Gene Prediction

Modification & FinishingTIS correcting

Start codon redundancy:ATG, GTG, TTG, CTG

Markov iteration, experimental verified data

Leaderless genes

Page 62: Gene Prediction

Modification & FinishingIS calling Operon annotating

IS Finder DB

Page 63: Gene Prediction

Modification & FinishingGene Presence/absence analysis

Page 64: Gene Prediction

Gene Prediction

• Introduction

• Protein-coding gene prediction

• RNA gene prediction

• Modification and finishing

• Project schemaProject schema

Page 65: Gene Prediction

Schema (proposed)

Page 66: Gene Prediction

Schema (proposed)

assembly group

Page 67: Gene Prediction

Schema (proposed)

assembly group