Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.

Lecture 8: Hidden Markov Models (HMMs)

Michael Gutkin Shlomi Haba

Prepared byPrepared by

Originally presented at Yaakov Stein’s DSPCSP Seminar, spring Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 20022002

ModifiedModified by Benny Chor, using also some slides of Nir Friedman by Benny Chor, using also some slides of Nir Friedman (Hebrew Univ.), for the Computational Genomics Course, Tel-Aviv (Hebrew Univ.), for the Computational Genomics Course, Tel-Aviv Univ., Dec. 2002Univ., Dec. 2002

Hidden Markov Models – Computational Genomics

Previous Next Back Outline Auxiliary

Outline

Discrete Markov Models Hidden Markov Models Three major questions: Q1. Computing the probability of a given observation. A1. Forward – Backward (Baum Welch) DP algorithm. Q2. Computing the most probable sequence, given

an observation. A2. Viterbi DP Algorithm Q3. Given an observation, learn best model. A3. Expectation Maximization (EM): A Heuristic.



Markov Models A discrete (finite) system:

N distinct states. Begins (at time t=1) in some initial state. At each time step (t=1,2,…) the system moves from current to next state (possibly the same

as the current state) according to transition probabilities associated with current state.

This kind of system is called aDiscrete Markov Model



Discrete Markov Model Example: Discrete Markov

Model with 5 states Each of the aij represents

the probability of moving from state i to state j

The aij are given in a matrix A = {aij}

The probability to start in a given state i is i , The vector represents these

startprobabilities.



Types of Models Ergodic model

Strongly connected - directed

path w/ positive probabilities

from each state i to state j

(but not necessarily complete directed graph)



Types of Models (cont.) Left-to-Right (LR) model

Index of state non-decreasing with time


Previous Next Back Outline AuxiliaryDiscrete Markov Model - Example

States – Rainy:1, Cloudy:2, Sunny:3

Matrix A –

Problem – given that the weather on day 1 (t=1) is sunny(3), what is the probability for the observation O:



Discrete Markov Model – Example (cont.)

The answer is -


Previous Next Back Outline Auxiliary Hidden Markov Models (probabilistic finite state automata)

Often we face scenarios where states cannot be directly observed.

We need an extension: Hidden Markov Modelsa11 a22

a33 a44

a12 a23a34

b11 b14

b12

b13

12 3

4

Observed

phenomenon

aij are state transition probabilities.

bik are observation (output) probabilities.

b11 + b12 + b13 + b14 = 1,

b21 + b22 + b23 + b24 = 1, etc.


Previous Next Back Outline AuxiliaryExample: Dishonest Casino

Actually, what is hidden in this model?


Previous Next Back Outline AuxiliaryBiological Example: CpG islands

In human genome, CpG dinucleotides are relatively rare

CpG pairs undergo a process called methylation that modifies the C nucleotide

A methylated C can (with relatively high probability) mutate to a T

Promoter regions are CpG rich These regions are not methylated, and

thus mutate less often These are called CpG islands



CpG Islands We construct two

Markov chains: One for CpG rich, one for CpG poor regions.

Using observations from 60K nucleotide, we get two models, + and - .



HMMs – Question I Given an observation sequence O = (O1 O2 O3 … OT),

and a model M = {A, B, }how do we efficiently compute P(O|M), the probability that the given model M produces the observation O in a run of length T ?

This probability can be viewed as a measure of the

quality of the model M. Viewed this way, it enables discrimination/selection among alternative models.


Previous Next Back Outline AuxiliaryHMM – Question II (Harder)

Given an observation sequence, O = (O1 O2 O3 … OT), and a model, M = {A, B, }how do we efficiently compute the most probable sequence(s) of states, Q?

That is, the sequence of states Q = (Q1 Q2 Q3 … QT) , which maximizes P(O|Q,M), the probability that the given model M produces the given observation O when it goes through the specific sequence of states Q .

Recall that given a model M, a sequence of observations O, and a sequence of states Q, we can efficiently compute P(O|Q,M) (should watch out for numeric underflows)


Previous Next Back Outline AuxiliaryHMM – Question III (Hardest)

Given an observation sequence O = (O1 O2 O3 … OT), and a

class of models, each of the form M = {A, B, }, which

specific model “best” explains the observations? A solution to question I enables the efficient computation of P(O|M) (the probability that a specific model M produces the observation O). Question III can be viewed as a learning problem: We want to use the sequence of observations in order to

“train” an HMM and learn the optimal underlying model parameters (transition and output probabilities).


Previous Next Back Outline AuxiliaryHMM Recognition (question I)

For a given model M = { A, B, } and a given state sequence

Q1 Q2 Q3 … QT ,, the probability of an observation sequence O1 O2 O3 … OT is P(O|Q,M) = bQ1O1 bQ2O2 bQ3O3

… bQTOT

For a given hidden Markov model M = { A, B, }the probability of the state sequence Q1 Q2 Q3 … QT

is (the initial probability of Q1 is taken to be Q1)

P(Q|M) = Q1 aQ1Q2 aQ2Q3 aQ3Q4 …

aQT-1QT

So, for a given hidden Markov model, Mthe probability of an observation sequence O1 O2 O3 … OT

is obtained by summing over all possible state sequences



HMM – Recognition (cont.)

P(O| M) = P(O|Q) P(Q|M)

= Q Q1 bQ1O1 aQ1Q2 bQ2O2 aQ2Q3 bQ2O2 …

Requires summing over exponentially many paths But can be made more efficient



HMM – Recognition (cont.) Why isn’t it efficient? – O(2TQ )

For a given state sequence of length T we have about 2T calculations

P(Q|M) = Q1 aQ1Q2 aQ2Q3 aQ3Q4 …

aQT-1QT

P(O|Q) = bQ1O1 bQ2O2 bQ3O3 …

bQTOT

There are Q possible state sequence So, if Q=5, and T=100, then the algorithm

requires 2 100 5 1.6 10 computations

We can use the forward-backward (F-B) algorithm

T

xx 100 ~~ x 72

T



The F-B Algorithm Some definitions 1. Legal final state – a state at which a path through the model may

end.

2. - a “forward-going”

3. – a “backward-going”

4. a(j|i) = aij ; b(O|i) = biO

5. O = the observation O1O2…Ot in times 1,2,…,t (O1 on t=1, O2 on t=2, etc.)

1

t



The F-B Algorithm (cont.) can be recursively calculated

Stopping condition

Moving from state i to state j

But we can enter state j from all others states



The F-B Algorithm (cont.) Now we can work sequentially

And on time t=T we get what we wanted -



The F-B Algorithm (cont.) The full algorithm –

Run Demo



The F-B Algorithm (cont.) The likelihood is measured using any

sequence of states of length T This is known as the “Any Path” Method

We can choose an HMM by the probability generated using the best possible sequence of states We’ll refer to this method as the “Best

Path” Method



Most Probable States Sequence (ques. II)

Idea: If we know the value of Qi , then the

most probable sequence on i+1,…,n does not depend on observations before time i

Let Vl(i) be the probability of the best sequence Q1,…,Qi such that Qi = l



Viterbi Algorithm A DP problem

Grid X – frame index, t (time) Q – State index, i

Constraints Every path must advance in time by one, and

only one, time step for each path segment Final grid points on any path must be of the

form (T, if ), where if is a legal final state in a model



Viterbi Algorithm (cont.) Cost

Node (t,i) – the probability to emit the observation y(t) on state i = biy

Transition from (t-1,i) to (t,j) – the probability to change state from i to j = aij

The total cost associated with the path is given by the product of the costs (type B)

Initial Transition cost: a0i = i

Goal The best path will be the one of maximum cost



Viterbi Algorithm (cont.) We can use the trick of taking

negative logarithms Multiplications of probabilities are

expansive and numerically problematic Sums of numerically stable numbers are

simpler The problem is turned into a minimal-cost

path search



Viterbi Algorithm (cont.) Run Demo



HMM – EM Training Using the Baum-Welch algorithm

Is an EM algorithm Estimate – approximate the result Maximize – and if needed, re-estimate

The estimation algorithm is based on DP algorithms (F-B & Viterbi)



HMM – EM Training (cont.) Initializing

Begin with an arbitrary model M Estimate

Evaluate the likelihood P(O|M) Along the way, keep track of some tallies Recalculate the matrixes A and B

e.g, aij = Maximize

If P(O|M) – P(O|M) ≥ , re-estimate with M=M Use several initial models to find a

favorable local maximum of P(O|M)

number of transitions from i to j

number of transitions exiting state i



HMM – Training (cont.) Why a local maximum?



Auxiliary

Physiology Model



Auxiliary cont.

Articulation



Auxiliary cont.

Spectrogram

Patterson - Barney Diagram

Mapping by the formants

Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.

Documents

discrete markov models

discrete markov model

computational genomics

current state

given state i

time slide

initial state

state transition probabilities