Top Banner
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman. Partially modified by Benny Chor.
69

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Gene prediction and HMMComputational Genomics 2005/6

Lecture 9b

Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman. Partially modified by Benny Chor.

Page 2: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Annotation of Genomic Sequence

Given the sequence of an organism’s genome, we would like to be able to identify:– Genes– Exon boundaries & splice sites– Beginning and end of translation– Alternative splicings– Regulatory elements (e.g. promoters)

The only certain way to do this is experimentally, but it is time consuming and expensive. Computational

methods can achieve reasonable accuracy quickly, and help direct experimental approaches.

primary goals

secondary goals

Page 3: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Prokaryotic Gene Structure

Promoter CDS Terminator

transcription

Genomic DNA

mRNA

Most bacterial promoters contain the Shine-Delgarno signal, at about -10 that has the consensus sequence: 5'-TATAAT-3'.

The terminator: a signal at the end of the coding sequence that terminates the transcription of RNA

The coding sequence is composed of nucleotide triplets. Each triplet codes for an amino acid. The AAs are the building blocks of proteins.

Page 4: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Pieces of a (Eukaryotic) Gene(on the genome)

5’

3’

3’

5’

~ 1-100 Mbp

5’

3’

3’

5’

……

……

~ 1-1000 kbp

exons (cds & utr) / introns(~ 102-103 bp) (~ 102-105 bp)

Polyadenylation site

promoter (~103 bp)

enhancers (~101-102 bp)other regulatory sequences (~ 101-102 bp)

Page 5: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

What is it about genes that we can measure (and model)?

• Most of our knowledge is biased towards protein-coding characteristics

– ORF (Open Reading Frame): a sequence defined by in-frame AUG and stop codon, which in turn defines a putative amino acid sequence.

– Codon Usage: most frequently measured by CAI (Codon Adaptation Index)

• Other phenomena– Nucleotide frequencies and correlations:

• value and structure– Functional sites:

• splice sites, promoters, UTRs, polyadenylation sites

Page 6: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae

Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771

Page 7: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Codon Adaptation Index (CAI)

• Parameters are empirically determined by examining a “large” set of example genes

• This is not perfect– Genes sometimes have unusual codons for a reason– The predictive power is dependent on length of

sequence

max

i

i

codon

i codons codon

fCAI

f

Page 8: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Splice signals (mice): GT , AG

Page 9: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

General Things to Remember about (Protein-coding) Gene Prediction Software

• It is, in general, organism-specific

• It works best on genes that are reasonably similar to something seen previously

• It finds protein coding regions far better than non-coding regions

• In the absence of external (direct) information, alternative forms will not be identified

• It is imperfect! (It’s biology, after all…)

Page 10: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Simple HMM : Prokaryotes

0002.0001.00

0996.0001.05.0

0002.0998.05.0

0000

32.0

18.0

18.0

32.0

25.0

25.0

22.0

28.0

H

xm(i) = probability of being in state m at position i;

H(m,yi) = probability of emitting character yi in state m;

mk = probability of transition from state k to m.

Page 11: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Outline: Rest of Lecture

• Eukaryotic gene structure

• Modeling gene structure

• Using the model to make predictions

• Improving the model topology

• Modeling fixed-length signals

Page 12: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A eukaryotic gene

• This is the human p53 tumor suppressor gene on chromosome 17.

• Genscan is one of the most popular gene prediction algorithms.

Page 13: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A eukaryotic gene

3’ untranslated region

Final exon

Initial exon

Introns

Internal exons

This particular gene lies on the reverse strand.

Page 14: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

An Intron

3’ splice site 5’ splice site

revcomp(CT)=AGrevcomp(AC)=GTGT: signals start of intron

AG: signals end of intron

Page 15: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Signals vs contents

• In gene finding, a small pattern within the genomic DNA is referred to as a signal, whereas a region of genomic DNA is a content.

• Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sites

• Examples of contents: exons, introns, UTRs, promoter regions

Page 16: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Prior knowledge

• We want to build a probabilistic model of a gene that incorporates our prior knowledge.

• E.g., the translated region must have a length that is a multiple of 3.

Page 17: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Prior knowledge

• The translated region must have a length that is a multiple of 3.

• Some codons are more common than others.• Exons are usually shorter than introns.• The translated region begins with a start signal

and ends with a stop codon.• 5’ splice sites (exon to intron) are usually GT; • 3’ splice sites (intron to exon) are usually AG.• The distribution of nucleotides and dinucleotides

is usually different in introns and exons.

Page 18: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A simple gene model

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic

Intergenic Intergenic

Intergenic

Page 19: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A probabilistic gene model

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic

Intergenic Intergenic

Intergenic

Every box stores transition probabilities for outgoing arrows.Every arrow stores emission probabilities for emitted nucleotides.

0.67

0.33

1.00

0.25

0.75

Pr(TACAGTAGATATGA) = 0.0001 Pr(AACAGT) = 0.001

Pr(AACAGTAC) = 0.002…

Page 20: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Parse

• For a given sequence, a parse is an assignment of gene structure to that sequence.

• In a parse, every base is labeled, corresponding to the content it (is predicted to) belongs to.

• In our simple model, the parse contains only “I” (intergenic) and “G” (gene).

• A more complete model would contain, e.g., “-” for intergenic, “E” for exon and “I” for intron.

S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGP = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG

TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Page 21: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

The probability of a parse

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic

Intergenic Intergenic

Intergenic

0.67

0.33

1.00

0.25

0.75

Pr(ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT) =

0.0000543

Pr(ATGCGTATGTTTTGA) = 0.00000000142

Pr(ACTGACTATGCGATCTACGACTCGACTAGCTAC) = 0.0000789

Pr(parse P| sequence S, model M) = 0.67 0.0000543 1.00 0.00000000142 0.75 x 0.0000789 = 3.057 10-18

S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGTATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTACP = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGGGGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Page 22: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Finding the best parse

• For a given sequence S, the model M assigns a probability Pr(P|S,M) to every parse P.

• We want to find the parse P* that receives the highest probability.

MSpPp

,Prmaxarg*

Page 23: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Beyond Simplest Model

• Improving the gene model topology• Fixed-length signals

– PSSMs– Dependencies between positions

• Variable-length contents– Using HMMs– Semi-Markov models

• Parsing algorithms– Viterbi– Posterior decoding

• Including other types of data– Expressed sequence tags– Orthology

Page 24: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Improved model topology

• Draw a model that includes introns

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic 2

Intergenic 1 Intergenic 4

Intergenic 3

Page 25: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Improved model topology

Transcriptionstop

Transcriptionstart

Start

End

5’ splicesite

3’ splicesite

Page 26: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Improved model topology

Transcriptionstop

Transcriptionstart

Start

End

5’ splicesite

3’ splicesite

4 intergenics1 intron4 exons

Page 27: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Improved model topology

Transcriptionstop

Transcriptionstart

Start

End

5’ splicesite

3’ splicesite

Single exonInitial exon

Intron

Internal exon

Final exon

Page 28: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Modeling the 5’ splice site

• Most introns begin with the letters “GT.”

• We can add this signal to the model.

5’ splicesite

3’ splicesiteIntronGT

Page 29: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Modeling the 5’ splice site

• Most introns begin with the letters “GT.”• We can add this signal to the model.• Indeed, we can model each nucleotide

with its own arrow.

5’ splicesite

3’ splicesiteIntronG T

Pr(A)=0Pr(C)=0Pr(G)=0Pr(T)=1

Pr(A)=0Pr(C)=0Pr(G)=1Pr(T)=0

Page 30: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Modeling the 5’ splice site

• Like most biological phenomenon, the splice site signal admits exceptions.

• The resulting model of the 5’ splice site is a length-2 PSSM.

5’ splicesite

3’ splicesiteIntronG T

Pr(A)=0.01Pr(C)=0.01Pr(G)=0.01Pr(T)=0.97

Pr(A)=0.01Pr(C)=0.01Pr(G)=0.97Pr(T)=0.01

Page 31: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Real splice sites

• Real splice sites show some conservation at positions beyond the first two.

• We can add additional arrows to model these states.

weblogo.berkeley.edu

Page 32: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Modeling the 5’ splice site

5’ splicesite

3’ splicesiteIntron

Page 33: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Adding signals

Transcriptionstop

Transcriptionstart

Start

End

5’ splicesite

3’ splicesite

Single exonInitial exon

Intron

Internal exon

Final exon

Red ellipses correspondto signal models like this:

Page 34: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Positional Independence

Pr(“ACTT”|M) = Pr(“A” at position 1 and “C” at position 2 and “T” at

position 3 and “T” at position 4|M)= Pr(“A” at position 1|M) Pr(“C” at position 2|M)

Pr(“T” at position 3|M) Pr(“T” at position 4|M)

• In general, probabilities of independent events get multiplied.

• A PSSM assumes independence among nucleotides at different positions.

Page 35: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Positional dependence

• In this data, every time a “G” appears in position 1, an “A” appears in position 3.

• Conversely, an “A” in position 1 always occurs with a “T” in position 3.

ACTG

ACTT

GCAC

ACTT

ACTA

GCAT

ACTA

ACTT

Page 36: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

nth-order PSSM

• Normally, PSSM entry (i,j) gives the score for observing the ith letter in position j.

• In an nth-order PSSM, each score is conditioned on the preceding letters in the sequence.

• The entries A|A, C|A, G|A and T|A should sum to 1.

1 2 3 4

A|A 0.25 0.45 0.12 0.21

A|C 0.29 0.20 0.24 0.15

A|G 0.33 0.13 0.41 0.33

A|T 0.13 0.22 0.23 0.31

C|A 0.34 0.35 0.09 0.10

T|T 0.19 0.24 0.25 0.31

2nd-order PSSM

Page 37: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

nth-order PSSM

• Normally, PSSM entry (i,j) gives the score for observing the ith letter in position j.

• In an nth-order PSSM, each score is conditioned on the preceding letters in the sequence.

• How many rows are in a 3rd-order PSSM for nucleotides? nth-order?

1 2 3 4

A|A 0.25 0.45 0.12 0.21

A|C 0.29 0.20 0.24 0.15

A|G 0.33 0.13 0.41 0.33

A|T 0.13 0.22 0.23 0.31

C|A 0.34 0.35 0.09 0.10

T|T 0.19 0.24 0.25 0.31

2nd-order PSSM

The probability of observing an “A”

in position 3, given that we

already observed a “C” in position

2.

Page 38: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Conditional probability

• What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position?

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

Page 39: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Conditional probability

• What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position?

• Answer: total number of CA’s divided by total number of C’s in position 1.

• 3/11 = 27%• Probability of observing CA = 3/18 =

17%.

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

Page 40: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Conditional probability

• The conditional probability Pr(x|y) =

Number of occurrences of y:xNumber of occurrences of y:*

where * is any letter.

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

Page 41: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Conditional probability

• What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position?

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

Page 42: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Conditional probability

• What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position?

• Answer: 9/12 = 75%.

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

Page 43: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Modeling signals

Transcriptionstop

Transcriptionstart

Start

End

5’ splicesite

3’ splicesite

Single exonInitial exon

Intron

Internal exon

Final exon

Red ellipses may correspond to nth-order PSSMs.

Page 44: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Modeling variable-length regions

Exon length

Page 45: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Modeling variable-length regions

1. The easy way, using standard HMMs.

2. And why that’s not so great.

How are variable-length insertions modeled in protein HMMs?

Page 46: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

The HMM solution

5’ splicesite

3’ splicesiteIntron

Fixed-length signals

Variable-length content

5’ splicesite

3’ splicesiteIntron

Page 47: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Codons

starttranslation

endtranslationSingle

exon

starttranslation

endtranslationSingle

exon

0 1 2

2

0 1

Page 48: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

The complete model

Transcriptionstop

Transcriptionstart

Start

End

5’ splicesite

3’ splicesite

Single exonInitial exon

Intron

Internal exon

Final exon

Red ellipses correspond to nth-order PSSMs.Every arrow contains an invisible box with a self-loop.

Page 49: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A small problem

• Say that each blue arrow emits one letter.

• What is the probability that the intron will be exactly 2 letters long?

• 3 letters long?

• 4 letters long?

5’ splicesite

3’ splicesiteIntron

0.1

0.9

Page 50: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A small problem

• Say that each blue arrow emits one letter.

• What is the probability that the intron will be exactly 2 letters long? 10%

• 3 letters long? 9%

• 4 letters long? 8.1%

5’ splicesite

3’ splicesiteIntron

0.1

0.9

Page 51: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A small problemHMMs tend to

produce geometric

distributions

Real contents are not necessarily geometric.

Page 52: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Building an HMM

• Input: annotated gene sequences

• Output: HMM parameters– Emission distributions within each content– Length distributions of contents– Transition distributions between contents

Page 53: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A more realistic (and complex) HMM model for Gene

Prediction (Genie)

Kulp, D., PhD Thesis, UCSC 2003

Page 54: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Assessing performance:Sensitivity and Specificity

•Testing of predictions is performed on sequences where the gene structure is known

•Sensitivity is the fraction of known genes (or bases or exons) correctly predicted

–“Am I finding the things that I’m supposed to find”

•Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes

–“What fraction of my predictions are true”?

•In general, increasing one decreases the other

Page 55: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Graphic View of Specificity and Sensitivity

iveFalseNegatveTruePositi

veTruePositi

AllTrue

veTruePositiSn

iveFalsePositveTruePositi

veTruePositi

eAllPositiv

veTruePositiSp

Page 56: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Quantifying the tradeoff:Correlation Coefficient

FNTNPNFPTPPP

FNTPAPFPTNAN

PNAPPPAN

FNFPTNTPCC

;

;;

Page 57: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Specificity/Sensitivity Tradeoffs

•Ideal Distribution of Scores

•More Realistically…

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40 45 50

score (arb units)

co

un

t

random sequence true sites

0

200

400

600

800

1000

1200

0 10 20 30 40 50

score (arb units)

co

un

t

random sequence true sites

Page 58: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Bayesian Statistics

•Bayes’ Rule

•M: the model, D: data or evidence

D

MMDDM

PP|P

|P posterior

likelihood prior

marginal

continuousP|P

discreteP|PP

dMMMD

MMDD

Page 59: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Basic Bayesian Statistics

•Bayes’ Rule is at the heart of much predictive software

•In the simplest example, we can simply compare two models, and reduce it to a log-odds ratio

2

1

2

1

2

1

M

Mlog

M

Mlog

M

Mlog

P

P

dataP

dataP

dataP

dataP

Page 60: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Genetic +

Genetic -

short +

short -

intergenic

Initiation +

Initiation - Termination -

Termination +

overlap 0

overlap 1 overlap 2

overlap 3

Prokaryotes HMMs: Taking Overlaps onTwo Strands into Account  

Page 61: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Genetic +

Genetic -

short +

short -

Initiation +

Initiation - Termination -

Termination +

overlap 0

overlap 1 overlap 2

overlap 3

Coding region (genes)

intergenic

Page 62: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

A A A

A A C

A A G….…..…

T T T

Transition from any codon to any other.

Model of

all

possible

64 codons

Coding region (genes)

Page 63: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

 

Integenic regions and overlap regions:

Model Design (3) 

Two consecutive genes either overlap each other or separated by an itergenic region.

The overlaping segment or the intergenic region is bordered in one of 4 possible ways.

5'

3'

3'

5'

5'

3'

3'

5'

5'

3'

3'

5'

5'

3'

3'

5'

5'

3'

3'

5'

5'

3'

3'

5'

5'

3'

3'

5'

5'

3'

3'

5'

Tail–Head

Head–Tail

Tail–Tail

Head–Head

Intermediate intergenic region Overlapping Region

Tail Head

Page 64: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Example 1 

Genetic +

Genetic -

short +

short -

Initiation +

Initiation - Termination -

Termination +

overlap 0

overlap 1 overlap 2

overlap 3 Transition between two genes on the same strand.

5'

3'

3'

5'

intergenic

Page 65: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Example 2

Genetic +

Genetic -

short +

short -

Initiation +

Initiation - Termination -

Termination +

overlap 0

overlap 1 overlap 2

overlap 3 Two genes on the opposite strands.

5'

3'

3'

5'

intergenic

Page 66: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Transitions between genes

Genetic +

Genetic -

short +

short -

intergenic

Initiation +

Initiation - Termination -

Termination +

overlap 0

overlap 1 overlap 2

overlap 3

5'

3'

3'

5'

Page 67: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Intergenic regions are modeled by

profile HMMs.

Intergenic Regions 

5'

3'

3'

5'

We model two different types of intergenic regions:

1. Short intergenic sequences:

9 bases long.

Model situations where two same strand genes are close together.

This situation is common in polycistronic operons.

2. Long intergenic sequences are the more common case.

They are modeled by the following 2 profile HMMs:

Transcription termination signal: 18 bases long.

Promoter region including the Shine-Dalgarno signal: 25 bases long.

Page 68: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

ACGT---- 

A--CG--T 

A----CGT 

ACG----T 

ACGT---- 

Weight matrix models[i] (WMM) are used to represent overlapping regions of 1 or 4 bases,

consisting of the stop codon of the previous gene and the start codon of the next one. .

T A A GT

N N A T G A N NA-C-G-T- 

A-C-G-T- 

A---CG-T 

ACG----T 

ACGT---- 

A----CGT 

A-C-G-T- 

A-C-G-T- 

Overlap Regions (1) 

1 base overlap of stop codon TAA or TGA, with init codon ATG:

4 bases overlap: First gene terminated by TGA, second gene starts with [AG]TG:

WMM formatbases

bases WMM format

Overlap regions of 1 or 4 bases:

Page 69: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

For each one of the 4 possible paths described (headhead, head tail, tail tail,

tailhead), all possible frame differences are allowed.

For example: a tailhead transition allows a 1 or 2 bases' shift of the reading frame.

Overlap Regions (2) 

Overlap regions of 6 or more bases:

Frame 1Stop codon

Frame 1Stop codon

Frame 1

Frame 2

Frame 3

Init codon

Initcodon

Frame 2/3

5'

3'

3'

5'