Gene Predict

8/13/2019 Gene Predict

http://slidepdf.com/reader/full/gene-predict 1/69

Gene prediction and HMMComputational Genomics



Annotation of

Genomic SequenceGiven the sequence of an organism ’s genome, wewould like to be able to identify: – Genes – Exon boundaries & splice sites – Beginning and end of translation – Alternative splicings – Regulatory elements (e.g. promoters)The only certain way to do this is experimentally,but it is time consuming and expensive. Computationalmethods can achieve reasonable accuracy quickly, and

help direct experimental approaches.

primary goals

secondary goals



Prokaryotic Gene Structure

Promoter CDS Terminator

transcription

Genomic DNA

mRNA

Most bacterial promoters contain the Shine-Delgarno signal, atabout -10 that has the consensus sequence: 5'-TATAAT-3'.

The terminator : a signal at the end of the coding sequence thatterminates the transcription of RNA

The coding sequence is composed of nucleotide triplets. Eachtriplet codes for an amino acid. The AAs are the building blocksof proteins.



Pieces of a (Eukaryotic) Gene(on the genome)

5’

3’

3’

5’

~ 1-100 Mbp

5’

3’

3’

5’

…

… …

…

~ 1-1000 kbp

exons ( cds & utr ) / introns(~ 10 2-10 3 bp) (~ 10 2-10 5 bp)

Polyadenylationsite

promoter (~10 3 bp)

enhancers (~10 1-10 2 bp)other regulatory sequences

(~ 10 1-10 2 bp)



What is it about genes that we

can measure (and model)?• Most of our knowledge is biased towardsprotein-coding characteristics – ORF (Open Reading Frame): a sequence defined by in-

frame AUG and stop codon, which in turn defines aputative amino acid sequence. – Codon Usage : most frequently measured by CAI (Codon

Adaptation Index)• Other phenomena

– Nucleotide frequencies and correlations:• value and structure

– Functional sites:• splice sites, promoters, UTRs, polyadenylation sites



A simple measure: ORF length Comparison ofAnnotation and Spurious ORFs in S. cerevisiae

Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771



Codon Adaptation Index (CAI)

• Parameters are empirically determined byexamining a “large” set of example genes

• This is not perfect – Genes sometimes have unusual codons for a reason – The predictive power is dependent on length of

sequence

max

i

i

codon

i codons codon

f CAI f



Splice signals (mice): GT , AG



General Things to Remember about(Protein-coding) Gene Prediction Software

• It is, in general, organism-specific

• It works best on genes that are reasonably similar

to something seen previously

• It finds protein coding regions far better than non-coding regions

• In the absence of external (direct) information,alternative forms will not be identified

• It is imperfect! (It’s biology, after all…)



Simple HMM : Prokaryotes

0002.0001.000996.0001.05.00002.0998.05.0

0000

32.018.018.032.0

25.025.022.028.0

H

xm(i) = probability of being in state m at position i;

H(m,yi) = probability of emitting character y i in state m;

mk = probability of transition from state k to m.



Outline: Rest of Lecture

• Eukaryotic gene structure• Modeling gene structure

• Using the model to make predictions• Improving the model topology• Modeling fixed-length signals



A eukaryotic gene

• This is the human p53 tumor suppressorgene on chromosome 17.

• Genscan is one of the most popular geneprediction algorithms.



A eukaryotic gene

3’ untranslatedregion

Final exon

Initial exon

Introns

Internal exons

This particular gene lies on the reverse strand.



An Intron

3’ splice site 5 ’ splice site

revcomp(CT)=AG

revcomp(AC)=GTGT: signals start of intron AG: signals end of intron



Signals vs contents

• In gene finding, a small pattern within thegenomic DNA is referred to as a signal , whereasa region of genomic DNA is a content .

• Examples of signals : splice sites, starts andends of transcription or translation, branchpoints, transcription factor binding sites

• Examples of contents : exons, introns, UTRs,promoter regions



Prior knowledge

• We want to build a probabilistic model of agene that incorporates our prior knowledge .

• E.g., the translated region must have alength that is a multiple of 3.



Prior knowledge

• The translated region must have a length that isa multiple of 3.

• Some codons are more common than others.

• Exons are usually shorter than introns.• The translated region begins with a start signal

and ends with a stop codon.• 5’ splice sites ( exon to intron ) are usually GT;• 3’ splice sites ( intron to exon ) are usually AG.• The distribution of nucleotides and dinucleotides

is usually different in introns and exons.



A simple gene model

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic

Intergenic Intergenic

Intergenic



A probabilistic gene model

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic


Intergenic

Every box stores transition probabilities for outgoing arrows.Every arrow stores emission probabilities for emitted nucleotides.

0.67

0.33

1.00

0.25

0.75

Pr(TACAGTAGATATGA) = 0.0001

Pr(AACAGT) = 0.001Pr(AACAGTAC) = 0.002

…



Parse

• For a given sequence, a parse is an assignment of genestructure to that sequence.

• In a parse, every base is labeled, corresponding to thecontent it (is predicted to) belongs to.

• In our simple model, the parse contains only “I”(intergenic ) and “G” ( gene ).

• A more complete model would contain, e.g., “ -” forintergenic , “E” for exon and “I” for intron .

S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT ATGCG

P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII GGGGG

TATGTTTTGA ACTGACTATGCGATCTACGACTCGACTAGCTAC

GGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII



The probability of a parse

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic


Intergenic

0.67

0.33

1.00

0.25

0.75

Pr(ACTGACTACTACGACTACGAT

CTACTACGGGCGCGACCT) =0.0000543

Pr(ATGCGTATGTTTTGA) =0.00000000142

Pr(ACTGACTATGCGATCTACGACTCGACTAGCTAC) = 0.0000789

Pr(parse P| sequence S, model M)= 0.67 0.0000543 1.00 0.00000000142 0.75 x 0.0000789= 3.057 10 -18

S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT ATGCGTATGTTTTGA ACTGACTATGCGATCTACGACTCGACTAGCTAC

P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII GGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII



Finding the best parse

• For a given sequenceS, the model Massigns a probability

Pr(P|S,M) to everyparse P.• We want to find the

parse P* that receivesthe highestprobability.

M S p P p

,Pr maxarg*





Improved model topology

• Draw a model that includes introns

Transcriptionstop

Transcriptionstart

Start EndGene

Intergenic 2

Intergenic 1 Intergenic 4

Intergenic 3




Transcriptionstop

Transcriptionstart

Start

End

5’ splice site

3’ splice site




Transcriptionstop

Transcriptionstart

Start

End

5’ splice site

3’ splice site

4 intergenics1 intron4 exons




Transcriptionstop

Transcriptionstart

Start

End

5’ splice site

3’ splice site

Single exonInitial exon

Intron

Internal exon

Final exon



Modeling the 5 ’ splice site

• Most introns begin with the letters “GT.”

• We can add this signal to the model.

5’ splice site

3’ splice siteIntronGT




• Most introns begin with the letters “GT.”

• We can add this signal to the model.• Indeed, we can model each nucleotide

with its own arrow.

5’ splice site

3’ splice siteIntronG T

Pr(A)=0Pr(C)=0Pr(G)=0Pr(T)=1

Pr(A)=0Pr(C)=0Pr(G)=1Pr(T)=0




• Like most biological phenomenon, the

splice site signal admits exceptions.• The resulting model of the 5 ’ splice site isa length-2 PSSM.

5’ splice site

3’ splice siteIntronG T

Pr(A)=0.01Pr(C)=0.01Pr(G)=0.01Pr(T)=0.97

Pr(A)=0.01Pr(C)=0.01Pr(G)=0.97Pr(T)=0.01



Real splice sites

• Real splice sites show some conservation atpositions beyond the first two.

• We can add additional arrows to model thesestates.

weblo o.berkele .edu





Adding signals

Transcriptionstop

Transcriptionstart

Start

End

5’ splice site

3’ splice site


Intron

Internal exon

Final exon

Red ellipses correspondto signal models like this:



Positional Independence

Pr(“ACTT”|M)= Pr(“A” at position 1 and “C” at position 2 and “T”

at position 3 and “T” at position 4|M)

= Pr(“A” at position 1|M) Pr(“C” at position 2|M) Pr(“T” at position 3|M) Pr(“T” at position 4|M)

• In general, probabilities of independent events

get multiplied.• A PSSM assumes independence among

nucleotides at different positions.



Positional dependence

• In this data, everytime a “G” appears inposition 1 , an “A”

appears in position 3.• Conversely, an “A” in

position 1 alwaysoccurs with a “T” inposition 3.

A C T G

A C T T

G C A C

A C T T

A C T A

G C A T

A C T A

A C T T



n th-order PSSM

• Normally, PSSM entry (i,j)gives the score forobserving the i th letter inposition j.

• In an n th-order PSSM,each score is conditionedon the preceding letters inthe sequence.

• The entries A|A, C|A, G|Aand T|A should sum to 1.

1 2 3 4

A|A 0.25 0.45 0.12 0.21

A|C 0.29 0.20 0.24 0.15

A|G 0.33 0.13 0.41 0.33

A|T 0.13 0.22 0.23 0.31

C|A0.34 0.35 0.09 0.10

…

T|T 0.19 0.24 0.25 0.31

2nd -order PSSM



n th-order PSSM

• Normally, PSSM entry (i,j)gives the score forobserving the i th letter inposition j.

• In an n th-order PSSM,each score is conditionedon the preceding letters inthe sequence.

• How many rows are in a3 rd-order PSSM fornucleotides? n th-order?

1 2 3 4

A|A 0.25 0.45 0.12 0.21

A|C 0.29 0.20 0.24 0.15

A|G 0.33 0.13 0.41 0.33

A|T 0.13 0.22 0.23 0.31

C|A0.34 0.35 0.09 0.10

…

T|T 0.19 0.24 0.25 0.31

2nd -order PSSM

The probability ofobserving an “A”

in position 3,given that we

already observeda “C” in position

2.



Conditional probability

• What is the probability of observingan “A” at position 2, given that weobserved a “C” at the previous

position?

GCG

CAG

CCG

GCG

CCG

CCG

GCG

CCT

CCG

GGG

CGG

GCG

AGGCAG

CCT

CAT

CCT

GCG




• What is the probability of observing an“A” at position 2, given that we observeda “C” at the previous position?

• Answer: total number of CA’s divided bytotal number of C’s in position 1.

• 3/11 = 27%• Probability of observing CA = 3/18 =

17%.

GCG

CAG

CCG

GCG

CCG

CCG

GCG

CCT

CCG

GGG

CGG

GCG

AGGCAG

CCT

CAT

CCT

GCG




• The conditional probability Pr(x|y) =

Number of occurrences of y:x

Number of occurrences of y:*

where * is any letter.

GCG

CAG

CCG

GCG

CCG

CCG

GCG

CCT

CCG

GGG

CGG

GCG

AGGCAG

CCT

CAT

CCT

GCG




• What is the probability of observinga “G” at position 3, given that weobserved a “C” at the previous

position?

GCG

CAG

CCG

GCG

CCG

CCG

GCG

CCT

CCG

GGG

CGG

GCG

AGGCAG

CCT

CAT

CCT

GCG




• What is the probability of observinga “G” at position 3, given that weobserved a “C” at the previous

position?• Answer: 9/12 = 75%.

GCG

CAG

CCG

GCG

CCG

CCG

GCG

CCT

CCG

GGG

CGG

GCG

AGGCAG

CCT

CAT

CCT

GCG



Modeling signals

Transcriptionstop

Transcriptionstart

Start

End

5’ splice site

3’ splice site


Intron

Internal exon

Final exon

Red ellipses may correspond to n th-order PSSMs.



Modeling variable-length regions

Exon length



Modeling variable-length regions

1. The easy way, using standard HMMs.2. And why that’s not so great.

How are variable-length insertionsmodeled in protein HMMs?



The HMM solution

5’ splice site

3’ splice siteIntron

Fixed-length signals

Variable-length content

5’ splice site




Codons

starttranslation

endtranslationSingle

exon

starttranslation

endtranslationSingle

exon

0 1 2

2

0 1



The complete model

Transcriptionstop

Transcriptionstart

Start

End

5’ splice site

3’ splice site


Intron

Internal exon

Final exon

Red ellipses correspond to nth

-order PSSMs.Every arrow contains an invisible box with a self-loop.



A small problem

• Say that each blue arrow emits one letter.• What is the probability that the intron will

be exactly 2 letters long?

• 3 letters long?• 4 letters long?

5’ splice site


0.1

0.9



A small problem

• Say that each blue arrow emits one letter.• What is the probability that the intron will

be exactly 2 letters long? 10%

• 3 letters long? 9%• 4 letters long? 8.1%

5’ splice site


0.1

0.9



A small problemHMMs tend to

producegeometric

distributions

Real contents are not necessarily geometric.



Building an HMM

• Input: annotated gene sequences• Output: HMM parameters

– Emission distributions within each content – Length distributions of contents – Transition distributions between contents

A more realistic (and complex)



A more realistic (and complex)HMM model for Gene

Prediction (Genie)

Kulp, D., PhD Thesis, UCSC 2003

Assessing performance:



Assessing performance:Sensitivity and Specificity

•Testing of predictions is performed on sequenceswhere the gene structure is known

•Sensi t iv i ty is the fraction of known genes (or bases

or exons) correctly predicted –“Am I finding the things that I’m supposed to find”

•Specif ic i ty is the fraction of predicted genes (or

bases or exons) that correspond to true genes –“What fraction of my predictions are true?”

•In general, increasing one decreases the other



Graphic View of Specificity and Sensitivity

ive FalseNegat veTruePositiveTruePositi

AllTrueveTruePositi

Sn

ive FalsePosit veTruePositiveTruePositi

e AllPositivveTruePositi

Sp



Quantifying the tradeoff:

Correlation Coefficient

FN TN PN FP TP PP

FN TP AP FP TN AN

PN AP PP AN FN FP TN TP

CC

;

;;



Specificity/Sensitivity Tradeoffs

•Ideal Distribution ofScores

•More Realistically…

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40 45 50

score (arb units)

c o u n

t

random sequence true sites

0

200

400

600

800

1000

1200

0 10 20 30 40 50

score (arb units)

c o u n

t

random sequence true sites



Bayesian Statistics

•Bayes’ Rule

•M : the model, D: data or evidence

D

M M D D M

P

P|P|P posterior

likelihood prior

marginal

continuousP|P

discreteP|PP

dM M M D

M M D D



Basic Bayesian Statistics

•Bayes’ Rule is at the heart of much predictivesoftware

•In the simplest example, we can simply compare two

models, and reduce it to a log-odds ratio

2

1

2

1

2

1

M

Mlog

M

Mlog

M

Mlog

P

P

data P

data P

data P

data P

Prokaryotes HMMs: Taking Overlaps on



Genetic +

Genetic -

short +

short -

intergenic

Initiation +

Initiation - Termination -

Termination +

overlap 0

overlap 1 overlap 2

overlap 3

y g pTwo Strands into Account



Genetic +

Genetic -

short +

short -

Initiation +


Termination +

overlap 0

overlap 1 overlap 2

overlap 3

Coding region (genes)

intergenic



A A A

A A C

A A G…. …. ….

T T T

Transition

from anycodon toany other.

Model of

all

possible

64 codons

Coding region (genes)



Integenic regions and overlap regions:

Model Design (3)

Two consecutive genes either overlap each other or separated by an itergenic region.

The overlaping segment or the intergenic region is bordered in one of 4 possible ways.

5'3'

3'5'

5'3'

3'5'

5'3'

3'5'

5'3'

3'5'

5'3'

3'5'

5'3'

3'5'

5'3'

3'5'

5'3'

3'5'

Tail – Head

Head – Tail

Tail – Tail

Head – Head

Intermediate intergenic region Overlapping Region

Tail Head

5' 3'



Example 1

Genetic +

Genetic -

short +

short -

Initiation +


Termination +

overlap 0

overlap 1 overlap 2

overlap 3 Transition between twogenes on the same strand.

3' 5'

intergenic

5' 3'



Example 2

Genetic +

Genetic -

short +

short -

Initiation +


Termination +

overlap 0

overlap 1 overlap 2

overlap 3 Two genes on the oppositestrands.

3' 5'

intergenic

T i i b5'3'

3'5'



Transitions between genes

Genetic +

Genetic -

short +

short -

intergenic

Initiation +


Termination +

overlap 0

overlap 1 overlap 2

overlap 3

3 5



Intergenic regions are modeled by

profile HMMs.

Intergenic Regions

5'3'

3'5'

We model two different types of intergenic regions:

1. Short intergenic sequences:

9 bases long.

Model situations where two same strand genes are close together.

This situation is common in polycistronic operons.

2. Long intergenic sequences are the more common case.

They are modeled by the following 2 profile HMMs:

Transcription termination signal: 18 bases long.

Promoter region including the Shine-Dalgarno signal: 25 bases long.



A C

G T----

A-- C

G-- T

A---- C

G T

A C

G---- T

A C

G T----

Weight matrix models [i] (WMM) are used to represent overlapping regions of 1 or 4 bases,

consisting of the stop codon of the previous gene and the start codon of the next one..

T A A GT

N N A T G A N NA- C- G- T-

A- C- G- T-

A--- C G- T

A C G---- T

A C G T----

A---- C G T

A- C- G- T-

A- C- G- T-

Overlap Regions (1)

1 base overlap of stop codon TAA or TGA, with init codon ATG:

4 bases overlap: First gene terminated by TGA, second gene starts with [AG]TG:

WMM format bases

bases WMM format

Overlap regions of 1 or 4 bases:



For each one of the 4 possible paths described (head head, head tail, tail tail,

tail head), all possible frame differences are allowed.

For example: a tail head transition allows a 1 or 2 bases' shift of the reading frame.

Overlap Regions (2)

Overlap regions of 6 or more bases:

Frame 1Stop codon

Frame 1Stop codon

Frame 1

Frame 2

Frame 3

Initcodon

Initcodon

Frame 2/3

Gene Predict

Documents