Hidden Markov Models in Bioinformaticshein/hmm.pdf · Hidden Markov Models in Bioinformatics 14.11 60 min Definition Three Key Algorithms • Summing over Unknown States • Most

Hidden Markov Models in Bioinformatics 14.11 60 min

Definition

Three Key Algorithms

• Summing over Unknown States

• Most Probable Unknown States

• Marginalizing Unknown States

Key Bioinformatic Applications• Pedigree Analysis• Isochores in Genomes (CG-rich regions)• Profile HMM Alignment• Fast/Slowly Evolving States• Secondary Structure Elements in Proteins• Gene Finding• Statistical Alignment

Hidden Markov Models

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10H1

H2

H3

(O1,H1), (O2,H2),……. (On,Hn) is a sequence of stochastic variables with 2components - one that is observed (Oi) and one that is hidden (Hi).

The marginal discribution of the Hi’s are described by a Homogenous Markov Chain:

•Let pi,j = P(Hk=i,Hk+1=j)

•Let πi =P{H1=i) - often πi is the equilibrium distribution of the Markov Chain.

•Conditional on Hk (all k), the Ok are independent.

•The distribution of Ok only depends on the value of Hi and is called the emit function

€

e(i, j) = P{Ok = i Hk = j)

What is the probability of the data?

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1H2H3

€

PO5 = iH5 = 2 = P(O5 = i H5 = 2) PO4

H4 = j

H4 = j∑ p j,i

€

P(r O ) = P(r

H ∑r O

r H )P(

r H )The probability of the observed is , which can

be hard to calculate. However, these calculations can be considerablyaccelerated. Let the probability of the observations (O1,..Ok)conditional on Hk=j. Following recursion will be obeyed:

€

POk = iHk = j

€

i. POk = iHk = j = P(Ok = i Hk = j) POk−1

Hk−1 = j

Hk−1 = j∑ p j,i

€

ii. PO1 = iH1 = j = P(O1 = i H1 = j)π j (initial condition)

€

iii. P(O) = POn= iHn= j

Hn= j∑

What is the most probable ”hidden” configuration?

€

H61 =max j{H6

1 * p j,1 *e(O6,1)}O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1H2H3

€

P{OH}Let be the sequences of hidden states that maximizes the observedsequence O ie ArgMaxH[ ]. Let probability of the mostprobability of the most probable path up to k ending in hidden state j.

Again recursions can be found€

Hkj

€

H*

€

i. Hkj = maxi{Hk−1

i pi, j}e(Ok, j)

€

ii. H1j = π je(O1,1)

€

iii. Hk−1* = {i Hk−1

i pi, je(Ok, j) = HkHk

*}

The actual sequence of hidden states can be found recursively by

€

Hk*

€

H5* = {i H5

i * pi,1 *e(O6,1) = H61}

What is the probability of specific ”hidden” state?

€

Qkj = P(Ok Hk+1 = i)p j,iQk+1

i

Hk+1= i∑

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1H2H3

€

P{Hk = j) = PkjQk

j /P(O)

€

P{H5 = 2) = P52Q5

2 /P(O)

Let be the probability of the observations from j+1 to n given Hk=j.These will also obey recursions:

€

Qkj

€

P{O,Hk = j) = PkjQk

jThe probability of the observations and a specific hidden state canfound as:

And of a specific hidden state can found as:

Fast/Slowly Evolving StatesFelsenstein & Churchill, 1996

n1positions

seque

nces

k

1

slow - rsfast - rfHMM:

• πr - equilibrium distribution of hidden states (rates) at first position

•pi,j - transition probabilities between hidden states

•L(j,r) - likelihood for j’th column given rate r.

•L(j,r) - likelihood for first j columns given j’th column has rate r.

€

L(j,f) = (L(j-1,f)pf , f + L(j-1,s)ps, f )L( j , f )

€

L(j,s) = (L(j-1,f)pf ,s + L(j-1,s)ps,s)L( j,s)Likelihood Recursions:

Likelihood Initialisations:

€

L(1,s) = π sL(1,s)

€

L(1,f) = π f L(1, f )

- # # E # # - E** λβ λ/µ (1− λβ)e-µ λ/µ (1− λβ)(1− e-µ) (1− λ/µ) (1− λβ)

-# λβ λ/µ (1− λβ)e-µ λ/µ (1− λβ)(1− e-µ) (1− λ/µ) (1− λβ)

_# λβ λ/µ (1− λβ)e-µ λ/µ (1− λβ)(1− e-µ) (1− λ/µ) (1− λβ)

#- λβ

€

1− λβe−µ

1−e−µ

€

λβe−µ

1− e−µ

€

(µ − λ)β1− e−µ

An HMM Generating Alignments

Statistical AlignmentSteel and Hein,2000 + Holmes and Bruno,2000

C

T

C A C

Emit functions:e(##)= π(N1)f(N1,N2)e(#-)= π(N1), e(-#)= π(N2)

π(N1) - equilibrium prob. of Nf(N1,N2) - prob. that N1evolves into N2

Probability of Data given a pedigree.Elston-Stewart (1971) -Temporal Peeling Algorithm:

Lander-Green (1987) - Genotype Scanning Algorithm:

Mother Father

Condition on parental states

Recombination and mutation are Markovian

Mother Father

Condition on paternal/maternal inheritance

Recombination and mutation are Markovian

Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm

Further Examples I

Simple Prokaryotic Simple EukaryoticGene Finding:Burge and Karlin, 1996

Isochore:Churchill,1989,92

€

L(j,p) = (Lj−1,p pp,p + L(j-1,s)ps, f )Pp (S[ j]), L(j,r) = (L(j-1,r)pp,r + L(j-1,r)pr,r)Pr (S[ j])Likelihood Recursions:

Likelihood Initialisations:

€

L(1,p) = π pPp (S[1]), L(1,r) = π rPr (S[1])

poorrichHMM: Lp(C)=Lp(G)=0.1, Lp(A)=Lp(T)=0.4,

Lr(C)=Lr(G)=0.4, Lr(A)=Lr(T)=0.1

Further Examples IISecondary Structure Elements:Goldman, 1996

Profile HMMAlignment:Krogh et al.,1994

.462.212.325

.852.086.062L

.184.881.005β

.091.0005.909α

Lβα

αα ββL L L

HMM for SSEs:

Adding Evolution: SSE Prediction:

Summary O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1H2H3

DefinitionThree Key Algorithms• Summing over Unknown States• Most Probable Unknown States• Marginalizing Unknown StatesKey Bioinformatic Applications• Pedigree Analysis• Isochores in Genomes (CG-rich regions)• Profile HMM Alignment• Fast/Slowly Evolving States• Secondary Structure Elements in Proteins• Gene Finding• Statistical Alignment

Recommended LiteratureVineet Bafna and Daniel H. Huson (2000) The Conserved Exon Method for Gene Finding ISMB 2000 pp. 3-12

S.Batzoglou et al.(2000) Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Research.10.950-58.

Blayo, Rouze & Sagot (2002) ”Orphan Gene Finding - An exon assembly approach” J.Comp.Biol.

Delcher, AL et al.(1998) Alignment of Whole Genomes Nuc.Ac.Res. 27.11.2369-76.

Gravely, BR (2001) Alternative Splicing: increasing diversity in the proteomic world. TIGS 17.2.100-

Guigo, R.et al.(2000) An Assesment of Gene Prediction Accuracy in Large DNA Sequences. Genome Research 10.1631-42

Kan, Z. Et al. (2001) Gene Structure Prediction and Alternative Splicing Using Genomically Aligned ESTs Genome Research 11.889-900.

Ian Korf et al.(2001) Integrating genomic homology into gene structure prediction. Bioinformatics vol17.Suppl.1 pages 140-148

Tejs Scharling (2001) Gene-identification using sequence comparison. Aarhus University

JS Pedersen (2001) Progress Report: Comparative Gene Finding. Aarhus University

Reese,MG et al.(2000) Genome Annotation Assessment in Drosophila melanogaster Genome Research 10.483-501.

Stein,L.(2001) Genome Annotation: From Sequence to Biology. Nature Reviews Genetics 2.493-

Hidden Markov Models in Bioinformaticshein/hmm.pdf · Hidden Markov Models in Bioinformatics 14.11 60 min Definition Three Key Algorithms • Summing over Unknown States • Most

Documents