Hidden Markov Models Outline - people.cs.pitt.edupakdaman/tutorials/Hmm_tutorial.pdf · Hidden Markov Models & Its application in Bioinformatics Bioinformatics Group, Electrical and

11/1/2007

1

Hidden Markov Models &Its application in Bioinformatics

Bioinformatics Group,Electrical and computer Department

University of Tehran, 2008

By: Mahdi Pakdaman

2Pakdaman@ gmail.com 1387/7/291387/7/29

Markov modelsHidden Markov models

DefinitionThree basic problems

Forward/Backward algorithmViterbi algorithmBaum-Welch estimation algorithmIssuesApplications in Bioinformatics

Hidden Markov ModelsHidden Markov ModelsOutlineOutline

11/1/2007

2


Markov Models Markov Models


Markov Model ExampleMarkov Model Example

Weather model: 3 states {rainy, cloudy, sunny}

Problem: Forecast weather state, based on the current weather state

11/1/2007

3


Observable states: S1 ,S2 ,…,SN Below we will designate them simply as

1,2,…,N

Actual state at time t is qt.

q1 ,q2 ,…,qt ,…,qT

First order Markov assumption:

Stationary condition:

Observable Markov Models Observable Markov Models

1 2 1( | , ,...) ( | )t t t t tP q j q i q k P q j q i− − −= = = = = =

1 1( | ) ( | )t t t l t lP q j q i P q j q i− + + −= = = = =


State transition matrix A:

where

Constraints on :

Markov ModelsMarkov Models

11 12 1 1

21 22 2 2

1 2

1 2

... ...

... ...

... ...

... ...

j N

j N

i i ij iN

N N Nj NN

a a a aa a a a

a a a a

a a a a

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥

= ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

AM M M M M M

M M M M M M

1( | ) 1 ,ij t ta P q j q i i j N−= = = ≤ ≤

ija

1

0, ,

1,

ij

N

ijj

a i j

a i=

≥ ∀

= ∀∑

11/1/2007

4


States:1. Rainy (R)2. Cloudy (C)3. Sunny (S)

State transition probability matrix:

Compute the probability of observing SSRRSCS given that today is S.

Markov Models: ExampleMarkov Models: Example

0.4 0.3 0.3⎡ ⎤⎢ ⎥= 0.2 0.6 0.2⎢ ⎥⎢ ⎥0.1 0.1 0.8⎣ ⎦

A


Basic conditional probability rule:

The Markov chain rule:


( , ) ( | ) ( )P A B P A B P B=

1 2

1 2 1 1 2 1

1 1 2 1

1 1 2 1 2 2

1 1 2 2 1 1

( , ,..., )( | , ,..., ) ( , ,..., )( | ) ( , ,..., )( | ) ( | ) ( , ,..., )( | ) ( | )... ( | ) ( )

T

T T T

T T T

T T T T T

T T T T

P q q qP q q q q P q q qP q q P q q qP q q P q q P q q qP q q P q q P q q P q

− −

− −

− − − −

− − −

= = = =

11/1/2007

5


Observation sequence O:

Using the chain rule we get:


33 33 31 11 13 32 23

2 4

( | model)( , , , , , , , | model)( ) ( | ) ( | ) ( | ) ( | )

( | ) ( | ) ( | )

(1)(0.8) (0.1)(0.4)(0.3)(0.1)(0.2) 1.536 10

P OP S S S R R S C SP S P S S P S S P R S P R R

P S R P C S P S Ca a a a a a aπ

−

= = × =

= = ×

( , , , , , , , )O S S S R R S C S=


What is the probability that the sequence remains in state ifor exactly d time units?

Exponential Markov chain duration density.

What is the expected value of the duration d in state i?


1 2 1

1

( ) ( , ,..., , ,...)

( ) (1 )i d d

di ii ii

p d P q i q i q i q i

a aπ+

−

= = = = ≠

= −

1

1

1

( )

( ) (1 )

i id

dii ii

d

d dp d

d a a

∞

=

∞−

=

=

= −

∑

∑

11/1/2007

6


1

1

1

(1 ) ( )

(1 ) ( )

(1 )1

11

dii ii

d

dii ii

dii

iiii

ii ii

ii

a d a

a aa

aaa a

a

∞−

=

∞

=

= −

∂ = −

∂

⎛ ⎞∂ = − ⎜ ⎟∂ −⎝ ⎠

=−

∑

∑




Avg. number of consecutive sunny days =

Avg. number of consecutive cloudy days = 2.5

Avg. number of consecutive rainy days = 1.67

33

1 1 51 1 0.8a

= =− −

11/1/2007

7


States are not observable

Observations are probabilistic functions of state

State transitions are still probabilistic

Hidden Markov ModelsHidden Markov Models


HMM: an exampleHMM: an example

Weather model: 3 “hidden” states

{rainy, cloudy, sunny}

Measure weather-related variables (e.g. temperature, humidity, barometric

pressure)Problem:

Forecast the weather state, given the current weather variables.

11/1/2007

8


N urns containing colored balls

M distinct colors of balls

Each urn has a (possibly) different distribution of colors

Sequence generation algorithm:1. Pick initial urn according to some random process.

2. Randomly pick a ball from the urn and then replace it

3. Select another urn according a random selection process associated with the urn

4. Repeat steps 2 and 3

Urn and Ball ModelUrn and Ball Model


N – the number of hidden statesQ – set of states Q={1,2,…,N}M – the number of symbolsV – set of symbols V ={1,2,…,M}A – the state-transition probability matrix

B – Observation probability distribution:

π - the initial state distribution:

λ – the entire model

Elements of Hidden Elements of Hidden Markov ModelsMarkov Models

, 1( | ) 1 ,i j t ta P q j q i i j N+= = = ≤ ≤

( ) ( | ) 1 ,j t k tB k P o v q j i j M= = = ≤ ≤

1( ) 1i P q i i Nπ = = ≤ ≤

( , , )A Bλ π=

11/1/2007

9


1. EVALUATION – given observation O=(o1 , o2 ,…,oT )and model , efficiently compute

Hidden states complicate the evaluation

Given two models λ1 and λ1, this can be used to choose the better one.

2. DECODING - given observation O=(o1 , o2 ,…,oT ) and model λ find the optimal state sequence q=(q1 , q2 ,…,qT ) .

Optimality criterion has to be decided (e.g. maximum likelihood)

3. LEARNING – given O=(o1 , o2 ,…,oT ), estimate model parameters that maximize

Three Basic ProblemsThree Basic Problems

( , , )A Bλ π= ( | ).P O λ

( , , )A Bλ π= ( | ).P O λ


Problem: Compute P(o1 , o2 ,…,oT |λ).

Algorithm:Let q=(q1 , q2 ,…,qT ) be a state sequence.

Assume the observations are independent:

Probability of a particular state sequence is:

Also,

Solution to Problem 1Solution to Problem 1

1

1 1 2 2

( | , ) ( | , )

( ) ( )... ( )

T

t ti

q q qT T

P O q P o q

b o b o b o

λ λ=

=

=

∏

1 1 2 2 3 1( | ) ...q q q q q qT qTP q a a aλ π −==

( , | ) ( | , ) ( | )P O q P O q P qλ λ λ=

11/1/2007

10


Enumerate paths and sum probabilities:

N T state sequences and O(T) calculations.

Complexity: O(T N T) calculations.


( | ) ( | , ) ( | )q

P O P O q P qλ λ λ= ∑


Forward Procedure: Forward Procedure: IntuitionIntuition

ST

AT

ES

N

3

2

1

v

aNk

t t+1

TIME

a3k

a1k

k

11/1/2007

11


Define forward variable as:

is the probability of observing the partial sequence

such that the state qt is i. Induction:

1. Initialization:

2. Induction:

3. Termination:

4. Complexity:

Forward AlgorithmForward Algorithm

( )t iα1 1( ) ( , ,..., , | )t t ti P o o o q iα λ= =

1 1( , ,..., )to o o

1 1( ) ( )i ii b oα π=

1 11

( ) ( ) ( )N

t t ij j ti

j i a b oα α+ +=

⎡ ⎤= ⎢ ⎥

⎣ ⎦∑

1( | ) ( )

N

Ti

P O iλ α=

= ∑2( )O N T

( )t iα


Consider the following coin-tossing experiment:

– state-transition probabilities equal to 1/3

– initial state probabilities equal to 1/3

ExampleExample

0.250.75

0.750.25

0.50.5

P(H)P(T)

State3State2State 1

11/1/2007

12


1. You observe O=(H,H,H,H,T,H,T,T,T,T). What state

sequence q, is most likely? What is the joint probability,

, of the observation sequence and the state sequence?

2. What is the probability that the observation sequence came entirely of state 1?

3. Consider the observation sequence

4. How would your answers to parts 1 and 2 change?

ExampleExample

( , | )P O q λ

( , , , , , , , , , ). O H T T H T H H T T H=


4. If the state transition probabilities were:

How would the new model λ’ change your answers to parts 1-3?

ExampleExample

0.9 0.45' 0.05 0.1

0.05 0.45A

0.45⎡ ⎤⎢ ⎥= 0.45⎢ ⎥⎢ ⎥ 0.1⎣ ⎦

11/1/2007

13


Backward AlgorithmBackward Algorithm

STA

TES

N

4

3

2

1

OBSERVATION

1 2 t-1 t t+1 t+2 T-1 TTIME

o1 o2 ot-1 ot ot+1 ot+2 oT-1 oT


Define backward variable as:

is the probability of observing the partial sequence

such that the state qt is i.

Induction:

1. Initialization:

2. Induction:

Backward AlgorithmBackward Algorithm

( )t iβ

( )t iβ1 2( ) ( , ,..., | , )t t t T ti P o o o q iβ λ+ += =

1 2( , ,..., )t t To o o+ +

( ) 1T iβ =

1 11

( ) ( ) ( ), 1 , 1,...,1N

t ij j t tj

i a b o j i N t Tβ β+ +=

= ≤ ≤ = −∑

11/1/2007

14


Solution to Problem 2 (1)Solution to Problem 2 (1)

Difficulty lies in declaring optimality condition

Choose the state qt which is individually most likely so maximize the expected number of correct individual states



Define the probability of being in state Si at time t, given O and λ

),|()( λγ OiqPi tt ==

∑=

== N

itt

ttttt

ii

iiOP

iii

1

)().(

)().()|()().()(

βα

βαλ

βαγ

11/1/2007

15



TtiMaxq tNi

t ≤≤=≤≤

1)]([arg1

γ

Although maximize the expected number of correct states there could be some problems with the resulting state sequence!


Choose the most likely path

Find the path (q1 , q2 ,…,qT ) that maximizes the likelihood:

Solution by Dynamic Programming

Define

is the best score (highest Prob.) along a single path accounts for the first t observations and ends in state Si

By induction we have:


1 2 11 2 1 1, ,...,

( ) max ( , ,..., , , ,..., | )t

t t tq q qi P q q q i o o oδ λ

−

= =

1 2( , ,..., | , )TP q q q O λ

( )t iδ

1 1( ) max[ ( ) ] ( )t t ij j tij i a b oδ δ+ += ⋅

11/1/2007

16


ViterbiViterbi AlgorithmAlgorithm

STA

TES

N

4

3

2

1

OBSERVATION

1 2 t-1 t t+1 t+2 T-1 TTIME

o1 o2 ot-1 ot ot+1 ot+2 oT-1 oT

aNk

k

a1k


Initialization:

Recursion:

Termination:

Path (state sequence) backtracking:

Viterbi AlgorithmViterbi Algorithm

1 1

1

( ) ( ), 1( ) 0

i ii b o i Ni

δ πψ

= ≤ ≤=

11

11

( ) max[ ( ) ] ( )

( ) arg max[ ( ) ]

, 1

t t ij j ti N

t t iji N

j i a b o

j i a

t T j N

δ δ

ψ δ−≤ ≤

−≤ ≤

=

=

2 ≤ ≤ ≤ ≤

*

1

*

1

m ax[ ( )]

arg m ax[ ( )]

T Ti N

T Ti N

P i

q i

δ

δ≤ ≤

≤ ≤

=

=

* *1 1 1, 2, ...,1t t tq q t T Tψ + += ( ), = − −

11/1/2007

17


Estimate to maximize No analytic method because of complexity – iterative solution.Baum-Welch Algorithm (actually EM algorithm) :

1. Let initial model be λ02. Compute new λ based on λ0 and observation O.3. Ιf 4. Else set λ0 λ and go to step 2


( , , )A Bλ π= ( | )P O λ

0log ( | ) log ( | ) stopP O P O DELTAλ λ− <


BaumBaum--Welch: PreliminariesWelch: Preliminaries

1( , ) ( , | , )t t ti j P q i q j Oξ λ+= = =

11/1/2007

18


Define as the probability of being in state i at time t, and in state j at time t+1.

Define as the probability of being in state i at time t, given the observation sequence


( , )i jξ

1 1

1 1

1 11 1

( ) ( ) ( )( , )

( | )( ) ( ) ( )

( ) ( ) ( )

t ij j t t

t ij j t tN N

t ij j t ti j

i a b o ji j

P Oi a b o j

i a b o j

α βξ

λα β

α β

+ +

+ +

+ += =

=

=∑ ∑

( )t iγ

1

( ) ( , )N

t tj

i i jγ ξ=

= ∑


is the expected number of times state i is visited.

is the expected number of transitions from state ito state j.


1( )T

ttiγ

=∑1

1( , )T

tti jξ−

=∑

11/1/2007

19


expected frequency in state i at time (t=1) (expected number of transitions from state i to state j) / expected number of transitions from state i):

(expected number of times in state j and observing symbol k) / (expected number of times in state j ):

BaumBaum--Welch: Update RulesWelch: Update Rules_

iπ = 1( )iγ=_

ija =

_ ( , )( )

tij

t

i ja

iξγ

= ∑∑_

( )ib k =

∑

∑

=

=−== T

tt

T

tt

j

j

i

kb kvtOts

1

1

)(

)(

)( ..

γ

γ


Some issuesSome issues

Limitations imposed byMarkov chain

ScalabilityLearning

InitialisationModel orderLocal maximaWeighting training sequences

11/1/2007

20


HMM ApplicationsHMM Applications

Classification (e.g., Profile HMMs)Build an HMM for each class (profile HMMs)Classify a sequence using Bayes rule

Multiple sequence alignmentBuild an HMM based on a set of sequencesDecode each sequence to find a multiple alignment

Segmentation (e.g., gene finding)Use different states to model different regionsDecode a sequence to reveal the region boundaries


HMMsHMMs for Classificationfor Classification

1{ ,..., }( | ) ( )( | )

( )* arg max ( | ) ( )

k

C

C C Cp X C p Cp C X

p XC p X C p C

∈

=

=

p(X|C) is modeled by a profile HMM built specifically for C

Assuming example sequences are available for C

E.g., Protein families

Assign a family to X

11/1/2007

21


HMMsHMMs for Motif Findingfor Motif Finding

Given a set of sequences S={X1, …,Xk}Design an HMM with two kinds of states

Background states: For outside a motifMotif states: For modeling a motif

Train the HMM, e.g., using Baum-Welch (finding the HMM that maximizes the probability of S)The “motif part” of the HMM gives a motif model (e.g., a PWM) The HMM can be used to scan any sequence (including Xi) to figure out where the motif is. We may also decode each sequence Xi to obtain a set of subsequences matched by the motif (e.g., a multiset of k-mers)


HMMsHMMs for Multiple Alignmentfor Multiple Alignment

Given a set of sequences S={X1, …,Xk}Train an HMM, e.g., using Baum-Welch (finding the HMM that maximizes the probability of S)Decode each sequence XiAssemble the Viterbi paths to form a multiple alignment

The symbols belonging to the same state will be aligned to each other

11/1/2007

22


HMMHMM--based Gene Findingbased Gene Finding

Design two types of states “Within Gene” States“Outside Gene” States

Use known genes to estimate the HMMDecode a new sequence to reveal which part is a geneExample software:

GENSCAN (Burge 1997)FGENESH (Solovyev 1997)HMMgene (Krogh 1997)GENIE (Kulp 1996)GENMARK (Borodovsky & McIninch 1993)VEIL (Henderson, Salzberg, & Fasman 1997)


VEIL: Viterbi VEIL: Viterbi ExonExon--IntronIntronLocatorLocator

Exon HMM ModelUpstream

Start Codon

Exon

Stop Codon

Downstream

3’ Splice Site

Intron

5’ Poly-A Site

5’ Splice Site

• Enter: start codon or intron (3’ Splice Site)

• Exit: 5’ Splice site or three stop codons(taa, tag, tga)

VEIL Architecture

11/1/2007

23


Solutions to the Local Maxima Solutions to the Local Maxima ProblemProblem

Repeat with different initializationsStart with the most reasonable initial modelSimulated annealing (slow down the convergence speed)


Local Maxima: IllustrationLocal Maxima: Illustration

Global maximaLocal maxima

Good starting pointBad starting point

11/1/2007

24


Optimal Model ConstructionOptimal Model Construction

( | ) ( )( | )( )

* arg max ( | )arg max ( | ) ( )

HMM

HMM

p X HMM p HMMp HMM Xp X

HMM p HMM Xp X HMM p HMM

=

==

Bayesian model selection: -P(HMM) should prefer simpler models

(i.e., more constrained, fewer states, fewer transitions)-P(HMM) could reflect our prior on the parameters


Sequence WeightingSequence Weighting

Avoid over-counting similar sequences from the same organismsTypically compute a weight for a sequence based on an evolutionary treeMany ways to incorporate the weights, e.g.,

Unequal likelihoodUnequal weight contribution in parameter estimation

11/1/2007

25


Toolkits for HMMToolkits for HMM

Hidden Markov Model Toolkit (HTK)http://htk.eng.cam.ac.uk/Hidden Markov Model (HMM) Toolbox for Matlabhttp://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.htmlTraining HMM for ASRhttp://cslu.cse.ogi.edu/tutordemos/nnet_training/tutorial.html#1.1_Setup


L. Rabiner and B. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, p. 4--16, Jan. 1986.

Further ReadingFurther Reading

Hidden Markov Models Outline - people.cs.pitt.edupakdaman/tutorials/Hmm_tutorial.pdf · Hidden Markov Models & Its application in Bioinformatics Bioinformatics Group, Electrical and

Documents