Stats 318: Lecture # 18 - Stanford University

Stats 318: Lecture # 18

Agenda: Hidden Markov Models

I Hidden Markov models

I Forward algorithm for state estimation (filtering)

I Forward-backward algorithm for state estimation (smoothing)

I Viterbi algorithm for most likely explanation

I EM algorithm (Baum Welch) for parameter estimation

Hidden Markov model

I {Ht} & {Yt} discrete time stochastic processes

I {Ht} Markov chain and not directly observable (”hidden”)

I {Yt} directly observable

P(Yt | H1:T ) = P(Yt | H1:t) = P(Yt | Ht)

I Terminology : Transition probabilities P(Ht+1 | Ht)

Emission probabilities P(Yt | Ht)

Homogeneous if above probabilities time independent (assumed henceforth)

Examples

I Speech recognition

I Finance forecasting

I DNA motif discovery

I . . .

Speech recognition

Copy number variations

How many duplications do we have? Do we have deletions?

Inference problems

1. What is the prob/likelihood of an observed sequence Y1:T ?

2. What is the prob/likelihood of the latent variable given Y ?

3. What is the most likely value of a latent variable? (’decoding problem’)

4. Given one or several observed sequences, how would we estimate model parameters?

i.e. transition & emission probabilities (and distribution of initial latent variable)

Probability of an observed sequence

P(y1:T ) =∑h1:T

P(y1:T , h1:T ) =∑h1:T

P(h1)T∏t=2

P(ht | ht−1)T∏t=1

P(yt | ht) (∗)

(∗) =∑hT

P(yT | hT )∑h1:T−1

P(hT | hT−1)P(h1)T−1∏t=2

P(ht | ht−1)T−1∏t=1

P(yt | ht)

Suggests dynamic programming solution

Forward algorithm

I Forward probabilities : αt(ht) = P(y1:t, ht) y1:T is given throughout

I Recursion: α1(h1) = P(h1)P(y1, h1) &

αt+1(ht+1) =∑ht

P(yt+1 | ht+1, ht, y1:t)P(ht+1 | ht, y1:t)αt(ht)

= P(yt+1 | ht+1)∑ht

P(ht+1 | ht)αt(ht)

One matrix-vector product per time step!

I Likelihood of an observed sequence is: P(y1:T ) =∑h

αT (h)

Probability of latent variables

Interested in conditional distribution of latent variables:

(a) Filtering : P(ht | y1:t) [forward algorithm]

(b) Smoothing (hindsight) : P(ht | y1:T ) [forward-backward algorithm]

(c) Most likely explanation : (arg)maxh1:T

P(h1:T | y1:T ) [Viterbi algorithm]

Filtering : P(ht | y1:t)

Forward probabilities αt(ht) = P(ht, y1:t)

P(ht | y1:t) =αt(ht)∑h

αt(h)

Smoothing : P(ht | y1:T )

Conditional independence of yt+1:T & y1:t | ht + Bayes’ rule give

P(ht | y1:T ) = P(ht | y1:t, yt+1:T ) ∝ P(yt+1:T | ht)︸︷︷︸backward prob

P(ht | y1:t)︸︷︷︸∼ forward prob

Backward probabilities :

βT (hT ) = 1

βt(ht) = P(yt+1:T | ht)

Recursion

βt(ht) =∑ht+1

P(yt+1:T , ht+1 | ht) =∑ht+1

P(ht+1 | ht)P(yt+1 | ht+1)βt+1(ht+1)

One matrix-vector product per time step!

Forward-backward algorithm

I Compute forward probabilities : αt(ht) = P(y1:t, ht)

I backward : βt(ht) = P(yt+1:T | ht)

P(ht | y1:T ) =P(y1:t, ht)P(yt+1:T | ht)

P(y1:T )=αt(ht)βt(ht)∑h

αT (h)= γt(ht)

Complexity of algorithm is O(Tn2) where n # hidden states

Most likely explanation : arg maxh1:T

P(h1:T | y1:T )

Value function : V1(j) = P(y1, h1 = j) = P(y1 | h1 = j)P(h1 = j)

Vt(j) =maxh1:t−1

P(h1:t−1, y1:t, ht = j)

Recursion

Vt+1(j) = maxh1:t

P(h1:t−1, y1:t, ht, yt+1, ht+1 = j)

= maxh1:t−1

maxi

P(h1:t−1, y1:t, ht = i)P(ht+1 = j | ht = i)P(yt+1 | ht+1 = j)

= maxi

Vt(i)P(ht+1 = j | ht = i)P(yt+1 | ht+1 = j) (∗)

Keeping track of optimal state : ψt(j) = arg max in (∗)ψT ( ) = arg max

jVT (j)

Viterbi algorithm

I Compute Vt for t = 1, . . . , T

I ψt

I Backtrack for most likely sequence :

HT = ψT ( )

For t = T − 1, T − 2, . . . , 1

Ht = ψt(Ht+1)

Parameter estimation (learning)

Given observed sequence(s) Y1:T , how can we estimate the model parameters?

Parameters (collectively denoted by θ)

I π : distribution of X1

I T (x, x′) transition probabilities

I E(x, y) = P(Yt = y | Xt = x) emission probabilities

Solution via maximum likelihood & Baum-Welch (EM) algorithm

Baum-Welch algorithm

I Complete likelihood

P(X1:T , Y1:T ) = P(X1)

T∏t=2

P(Xt | Xt−1)

T∏t=1

P(Yt | Xt)

I Parameter estimation from complete likelihood is easy (why?)

I EM iteration: current parameter value θk

E-step: Compute Q(θ || θk) = EX [logPθ(X,Y ) | Y, θk]

M-step: θk+1 = arg maxθ

Q(θ || θk)

E-step

I γt(x) = P(Xt = x | y1:T , θk) [forward-backward algorithm]

I ξt(x, x′) = P(Xt = x,Xt+1 = x′ | y1:T , θk) ∝ αt(x)T

(k)(x, x′)βt+1(x′)E(k)(x′, yt+1)

Above relation not hard to show...Probabilities use current guesses of model parameters

P (k)(x, x′) = P(Xt+1 = x′ | Xt = x, θk) E(k)(x, y) = P(Yt = y | Xt = x, θk)

I Log-likelihood:

logP (X,Y ) = log π(X1) +

T−1∑t=1

log T (Xt+1, Xt) +

T∑t=1

logE(Xt, yt)

E-step

Q(θ || θk) =∑x

γ1(x) log π(x) +

T−1∑t=1

∑x,x′

ξt(x, x′) log T (x, x′) +

T∑t=1

∑x

γt(x) logE(x, y)

Q-step

Q(θ || θk) =∑x

γ1(x) log π(x) +∑x,x′

T−1∑t=1

ξt(x, x′) log T (x, x′) +

∑x

T∑t=1

γt(x) logE(x, yt)

Q-step: update parameters π(·), T (·, ·), E(·, ·)

I π+(x) = γ1(x)I

T+(x, x′) =

∑T−1t=1 ξt(x, x

′)∑z

∑T−1t=1 ξt(x, z)

=

∑T−1t=1 ξt(x, x

′)∑T−1t=1 γt(x)

I

E+(x, v) =

∑Tt=1 1{yt = v}γx(t)∑T

t=1 γx(t)

That’s all folks!Enjoy the summer

Stats 318: Lecture # 18 - Stanford University

Documents