ECE8813 Statistical Natural Language Processing Lectures ...

Center of Signal and Image ProcessingGeorgia Institute of Technology

ECE8813Statistical Natural Language Processing

Chin-Hui LeeSchool of ECE, Georgia Tech

Atlanta, GA 30332, [email protected]

ECE8813, Spring 2009

Lectures 12-13: Hidden Markov Models

2 Center of Signal and Image ProcessingGeorgia Institute of Technology

ECE8813 Spring 2009

Markov Assumptions• Let X=(X1, .., XT) be a sequence of random variables

taking values in some finite set, S={s1, …, sN}, the state space. If X possesses the following properties, then X is a Markov Chain

1. Limited Horizon:P(Xt+1=sk|X1, .., Xt)=P(X t+1 = sk |Xt) i.e., a word’s state only depends on the previous state

2. Time Invariant (Stationary):P(Xt+1=sk|Xt)=P(X2 =sk|X1) i.e., the dependency does not change over time


ECE8813 Spring 2009

Definition of a Markov ChainA is a stochastic N x N matrix

of the probabilities of transitions with:aij = Pr{transition from si to sj} = Pr(Xt+1=sj | Xt=si)∀i, j, aij ≥ 0, and ∀t:

Π is a vector of N elements representing the initial state probability distribution with:πj = Pr{probability that the initial state is si} =

Pr(X1=si) ∀i

Can avoid this by creating a special start state s0

11

=∑=

n

j ija

11

=∑=

n

i iπ


ECE8813 Spring 2009

Markov Chain: An Example


Markov Process & Language Models• Bayes formula (chain rule):

P(W) = P(w1,w2,...,wT) = Pi=1..T p(wi|w1,w2,..,wi-n+1,..,wi-1)• n-gram language models:

Markov process (chain) of the order n-1:P(W) = P(w1,w2,...,wT) = Π i=1..T p(wi|wi-n+1,wi-n+2,..,wi-1)

Using just one distribution (Ex.: trigram model: p(wi|wi-2,wi-1)):Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Words: My car broke down , and within hours Bob ’s car broke down , too .

p(,|broke down) = p(w5|w3,w4)) = p(w14|w12,w13)


ECE8813 Spring 2009

Another Example with a Markov Chain

1.4

1

.3.3

.4

.6 1

.6

.4

te

h a p

i

Start


ECE8813 Spring 2009

Probability of a Sequence of States

P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12

∏=

×××=×××=

−

=

−

−

+

1

1

1121

1321121321

11

)|()|()()|()|()()(

T

txxx

TT

TTT

tta

xxPxxPxPxxxxxPxxPxPxxxxP

π

K

KKK


ECE8813 Spring 2009

Hidden Markov Models (HMMs)• Sometimes it is not possible to know precisely which

states the model passes through; all we can do is to observe some phenomena that occurs when in that state with some probability distribution

• An HMM is an appropriate model for cases when you don’t know the state sequence that the model passes through, but only some probabilistic function of it. For example:– Word recognition (discrete utterance or within continuous speech) – Phoneme recognition– Part-of-Speech Tagging– Linear Interpolation


ECE8813 Spring 2009

What is an HMM?

• Green circles are hidden states with its status dependent only on the previous state• Purple circles are observed events with their observations depend only of their emitting states


ECE8813 Spring 2009

HMM: An Occasionally Dishonest Casino

• Assume: A casino switches occasionally to a biased dice to increase winning odds !!

• Can we model it with HMM ?• How do we prove it cheats ?• Can we estimate the HMM ?• How many samples needed ?• Which dice used at what time?

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

Fair Biased

0.05

0.95 0.9

0.1


ECE8813 Spring 2009

Estimation: More vs. Less Data

1: 0.192: 0.193: 0.234: 0.085: 0.236: 0.08

1: 0.072: 0.103: 0.104: 0.175: 0.056: 0.52

0.27

0.73

0.29

0.71

1: 0.172: 0.173: 0.174: 0.155: 0.186: 0.16

1: 0.102: 0.113: 0.104: 0.115: 0.106: 0.48

0.07

0.93 0.88

0.12

Estimates with 300 rolls Estimates with 30000 rolls


ECE8813 Spring 2009

Observations and Hidden States• HMM is also called a probabilistic

function of a Markov chain– State transition follows a Markov

chain– In each state, it generates

observation symbols based on a probability function. Each state has its own probability function

– HMM is a doubly embedded stochastic process

• In HMM,– State is not directly observable

(hidden states)– Only observe observation

symbols generated from states

S =S = ω1, ω4, ω2, ω2, ω1, ω4 (hidden)(hidden)

O =O = v4, v1, v1, v4, v2, v3 (observed)(observed)


ECE8813 Spring 2009

An HMM Example: Urn & Ball

…

Urn 1 Urn NUrn N-1Urn 2

Pr(RED) = b1(1)

Pr(BLE) = b1(2)

Pr(GRN) = b1(3)

…

Pr(RED) = b2(1)

Pr(BLE) = b2(2)

Pr(GRN) = b2(3)

…

Pr(RED) = bN-1(1)

Pr(BLE) = bN-1(2)

Pr(GRN) = bN-1(3)

…

Pr(RED) = bN(1)

Pr(BLE) = bN(2)

Pr(GRN) = bN(3)

…

Observation: O = { GRN, GRN, BLE, RED, RED, … BLE}


ECE8813 Spring 2009

Elements of an HMM• HMM (the most general case):

S five-tuple (S, K, Π, A, B), where:S = {s1,s2,...,sN} is the set of states,K = {k1,k2,...,kM} is the output alphabet,Π = {πi} , i ∈ S,A = {aij}, i,j ∈ S,B = {bijk}, i,j ∈ S, k ∈ K (arc emission)B={bik} i ∈ S, k ∈ K (state emission)

• State Sequence: X=(x1, x2, .., xT+1), xt : S →{1,2,…, N}• Observation Sequence: O=(o1, .., oT), ot ∈ K


ECE8813 Spring 2009

HMM as a Generating Model• Given an HMM, denoted as and an

observation sequence O={O1,O2, …, OT}• The HMM can be viewed as a generator to produce O

as:1. Choose an initial state q1=Si according to the initial probability

distribution π2. Set t=13. Choose an observation Ot according to the symbol observation

probability distribution in state Si, i.e., bi(k)4. Transit to a new state qt+1 =Sj according to the state transition

probability distribution, i.e., aij5. Set t=t+1, return to step 3 if t<T6. Terminate the procedure

},,{ πBA=Λ


ECE8813 Spring 2009

Assumptions in HMM•• Markov AssumptionMarkov Assumption:

– State transition follows a 1st-order Markov chain– This assumption implies the duration in each state is a binomial

distribution:

•• Output Independence AssumptionOutput Independence Assumption: the probability that a particular observation symbol is emitted from HMM at time t depends only on the current state st and is conditionally independent of the past and future observations

• The two assumptions limit the memory of an HMM and may lead to model deficiency. But they significantly simplify HMM computation, also greatly reduce the number of free parameters to be estimated in practice– Some research works to relax these assumptions has been done in

the literature to enhance HMM in modeling speech signals

)1()()( 1jj

djjj aadp −= −


ECE8813 Spring 2009

Types of HMMs (I)• Different transition matrices:

– Ergodic HMM Topology

– Left-to-right HMM: states proceed from left to right

a11 a22 a33 a44

a12 a23 a34

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

333231

232221

131211

aaaaaaaaa

A

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

44

3433

2322

1211

00000

0000

aaa

aaaa

A


ECE8813 Spring 2009

Types of HMMs (II)• Different observation symbols: discrete vs. continuous

– Discrete density HMM (DDHMM): observation is discrete, one of afinite set. In discrete density HMM, observation function is a discrete probability density, i.e., a table. In state j,

– Continuous density HMM (CDHMM): observation x is continuous in an observation space. In CDHMM, state observation density is a probability density function (p.d.f.). The common function forms:

• Multivariate Gaussian density

• Gaussian mixture density∞<<∞−=

−

xC

Cμx μxCμx 2/)-()-(- 1t

e||)2(

1),;(n

Nπ

[ ]2.03.04.01.0)(4321

=kOvvvv

j

0101),;()(11

2 >≤≤== ∑∑ == imM

m mM

m mmm xNxMG σωωσμω


ECE8813 Spring 2009

HMM for Data Modeling• HMM is used as a powerful statistical model for sequential

and temporal observation data• HMM is theoretically (mathematically) sound; relatively

simple learning and decoding algorithms exist• HMM is widely used in pattern recognition, machine

learning, etc.– Speech recognition: modeling speech signals– Statistical language processing: modeling language

(word/semantics sequence)– OCR (optimal character recognition): modeling 2-d character image – Gene finding: modeling DNA sequence


ECE8813 Spring 2009

Three Fundamental Problems in HMM• How to use HMM to model sequential data ?

– The entire data sequence is viewed as one data sample OO– The HMM is characterized by its parameters .

•• Learning ProblemLearning Problem: HMM parameters Λ must be estimated from a data sample set {O1,O2, …, OT}– The HMM parameters are set so as to best explain known data

•• Evaluation ProblemEvaluation Problem: for an unknown data sample Ox, calculate the probability of the data sample given the model, p(Ox|Λ)

•• Decoding ProblemDecoding Problem: uncover the hidden information; for an observation sequence O={o1,o2,…,oT}, decode a best state sequence S={q1,q2,…,qT} which is optimal in explaining O

},,{ πBA=Λ


ECE8813 Spring 2009

HMM Formalism

{S, K, P, A, B} Π = {pi} are the initial state probabilitiesA = {aij} are the state transition probabilitiesB = {bik} are the observation probabilities

A

B

AAA

BB

SSS

KKK

S

K

S

K


Example of an Arc Emit HMM

×

3

1

4

20.6

10.4 1

0.12

enter

here

p(t)=.5p(o)=.2p(e)=.3

p(t)=.8p(o)=.1p(e)=.1

p(t)=0p(o)=0p(e)=1

p(t)=.1p(o)=.7p(e)=.2

p(t)=0p(o)=.4p(e)=.6

p(t)=0p(o)=1p(e)=0

0.88


ECE8813 Spring 2009

HMM Properties• N states in the model: a state has some measurable,

distinctive behaviors• At clock time t, you make a state transition (to a different

state or back to the same state), based on a transition probability distribution that depends on the current state (i.e., the one you're in before making the transition)

• After each transition, you output an observation output symbol according to a probability density distribution which depends on the current state or the current arc

Goal: From the observations, determine what model generated the output, e.g., in word recognition with a different HMM for each word, the goal is to choose the one that best fits the input word (based on state transitions and output observations)


ECE8813 Spring 2009

Example HMM• N states corresponding to urns: q1, q2, …,qN

• M colors for balls found in urns• B: N x M matrix; for each urn, there is a

probability density function for each color.– bij={COLOR=zj | URN = qi)

• A: N x N matrix; transition probability distribution (between urns)

• ∏: N vector; initial sate probability distribution (where do we start?)


ECE8813 Spring 2009

An HMM Evolution Example• Process:

– According to π, pick a state qi. Select a ball from urn i (state is hidden)

– Observe the color (observable) – Replace the ball– According to A, transition to the next urn from which to pick a

ball (state is hidden)• Design: Choose N and M. Specify a model Λ = (A, B,

∏) from the training data. Adjust the model parameters to maximize P(O| Λ)

• Use: Given O and Λ = (A, B, ∏), what is P(O| Λ)? If we compute P(O| Λ) for all models Λ, then we can determine which model most likely generated O


ECE8813 Spring 2009

Simulating a Markov Processt:= 1;Start in state si with probability πi (i.e., X1=i)Forever do

Move from state si to state sj with probability aij (i.e., Xt+1=j)

Emit observation symbol ot = k with probability bijk

t:= t+1End


ECE8813 Spring 2009

Why Use Hidden Markov Models?• HMMs are useful when one can think of

underlying events probabilistically generating surface events. Example: PoS tagging

• HMMs can be efficiently trained using the EM Algorithm

• Another example where HMMs are useful is in generating parameters for linear interpolation of n-gram models

• Assuming that some set of data was generated by a HMM, and then an HMM is useful for calculating the probabilities for possible underlying state sequences


ECE8813 Spring 2009

Fundamental Questions for HMMs• Given a model Λ =(A, B, Π), how do we

efficiently compute how likely a certain observation is, that is, P(O| Λ) ?

• Given an observation sequence O and a model Λ, how do we choose a state sequence (X1, …, X T+1) that best explains the observations?

• Given an observation sequence O, and a space of possible models found by varying the model parameters Λ = (A, B, π), how do we find the model that best explains the observed data?


ECE8813 Spring 2009

Probability of an Observation• Given the observation sequence O=(o1, …,

oT) and a model, Λ = (A, B, ∏ ), we wish to know how to efficiently compute P(O| Λ)

• For any state sequence, S=(q1, …, qT+1), we find: P(O| Λ)=ΣS P(O | S, Λ) P(S | Λ) which is simply the probability of an observation sequence given the model

• Direct evaluation of this expression is extremely inefficient; however, there are dynamic programming methods that compute it quite efficiently


ECE8813 Spring 2009

)|( Compute),,( ,)...( 1

μμ

OPBAooO T Π==

oTo1 otot-1 ot+1

Given an observation sequence and a model, compute the probability of the observation sequence

Probability of an Observation


ECE8813 Spring 2009

Probability of an Observation (Cont.)

TT oqoqoq bbbSOP ...),|(2211

=μ

oTo1 otot-1 ot+1

q1 qt+1 qTqtqt-1

TT qqqqqqq aaaSP132211

...)|(−

= πμ)|(),|()|,( μμμ SPSOPSOP =

∑=S

SPSOPOP )|(),|()|( μμμ


Probability of an Observation with an Arc Emit HMM

×

3

1

4

20.6

10.4 1

0.12

enter

here

p(t)=.5p(o)=.2p(e)=.3

p(toe) = .6×.8 ×.88×.7 ×1×.6 +.4×.5 ×1×1 ×.88×.2 +.4×.5 ×1×1 ×.12×1

≅ .237

p(t)=.8p(o)=.1p(e)=.1

p(t)=0p(o)=0p(e)=1

p(t)=.1p(o)=.7p(e)=.2

p(t)=0p(o)=.4p(e)=.6

p(t)=0p(o)=1p(e)=0

0.88


ECE8813 Spring 2009

Making Computation Efficient• To avoid computational complexity, use dynamic

programming or memorization techniques due to trellis structure of the problem

• Use an array of states versus time to compute the probability of being at each state at time t+1 in terms of the probabilities for being in each state at t

• A trellis can record the probability of all initial subpaths of the HMM that end in a certain state at a certain time. The probability of longer subpaths can then be worked out in terms of the shorter subpaths

• A forward probability, αi(t)= P(o1o2…o t-1, Xt=i| μ) is stored at (si, t) in the trellis and expresses the total probability of ending up in state si at time t


ECE8813 Spring 2009

The Trellis Structure


ECE8813 Spring 2009

)|,...()( 1 μα iqooPt tti ==

Forward Procedure

oTo1 otot-1 ot+1

q1 qt+1 qTqtqt-1

• Special structure gives us an efficient solution using dynamic programming

• Intuition: Probability of the first t observations is the same for all possible t +1 length state sequences (so don’t recompute it!)

• Define:


ECE8813 Spring 2009

)|(),...()()|()|...(

)()|...(),...(

1111

11111

1111

111

jqoPjqooPjqPjqoPjqooP

jqPjqooPjqooP

tttt

ttttt

ttt

tt

=======

=====

+++

++++

+++

++

oTo1 otot-1 ot+1

q1 qt+1 qTqtqt-1

Forward Procedure (Cont.)

)1( +tjα


ECE8813 Spring 2009

∑

∑

∑

∑

=

+++=

++=

+

++=

+

+=

=====

=====

====

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jqoPiqjxPiqooP

jqoPiqPiqjqooP

jqoPjqiqooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

α

oTo1 otot-1 ot+1

q1 qt+1 qTqtqt-1

Forward Procedure (Cont.)


ECE8813 Spring 2009

The Forward Algorithm• Forward variables are calculated as follows:

– Initialization: αi(1)= πi , 1≤ i ≤ N– Induction: αj(t+1)=Σi=1, N αi(t)aijbijot, 1≤ t≤T, 1≤ j≤N– Total: P(O|μ)= Σi=1,N αi(T+1)

• This algorithm requires 2N2T multiplications (much less than the direct method which takes (2T+1)NT+1


ECE8813 Spring 2009

The Backward Procedure• We could also compute these probabilities by

working backward through time• The backward procedure computes backward

variables which are the total probability of seeing the rest of the observation sequence given that we were in state si at time t

• βi(t) = P(ot…oT | qt = i, Λ) is a backward variable (probability)

• Backward variables are useful for the problem of parameter reestimation


ECE8813 Spring 2009

)|...()( iqooPt tTti ==β

oT

qT

o1

q1

otot-1 ot+1

qt+1qtqt-1

Backward Procedure (Cont.)

1)1( =+Tiβ

∑=

+=Nj

jioiji tbatt

...1)1()( ββ


ECE8813 Spring 2009

The Backward Algorithm• Backward variables can be calculated

working backward through the treillis as follows:– Initialization: βi(T+1) = 1, 1≤ i ≤ N– Induction: βi(t) =Σj=1,N aijbijotβj(t+1), 1≤ t ≤T, 1≤ i ≤ N– Total: P(O|μ)=Σi=1, N πiβi(1)

• Backward variables can also be combined with forward variables:

P(O|μ) = Σi=1,N αi(t)βi(t), 1≤ t ≤ T+1


ECE8813 Spring 2009

oTo1 otot-1 ot+1

q1 qt+1 qTqtqt-1

∑=

=N

ii TOP

1)()|( αμ Forward Procedure

Summary: Probability of an Observation

∑=

=N

iiiOP

1)1()|( βπμ Backward Procedure

)()()|(1

ttOP i

N

ii βαμ ∑

=

= Combination


ECE8813 Spring 2009

Finding the Best State Sequence• One method consists of optimizing on the states

individually• For each t, 1≤ t≤ T+1, we would like to find Xt that

maximizes P(Xt|O, Λ)• Let γi(t) = P(Xt = i |O, Λ) = P(Xt = i, O| Λ)/P(O| Λ) =

(αi(t)βi(t))/Σj=1,N αj(t)βj(t)• The individually most likely state is

Xt =argmax1≤i≤N γi(t), 1≤ t≤ T+1• This quantity maximizes the expected number of states

that will be guessed correctly. However, it may yield a quite unlikely state sequence

^


ECE8813 Spring 2009

The Viterbi Algorithm• The Viterbi algorithm efficiently computes the most

likely state sequence.• To find the most likely complete path compute:

argmaxS P(S|O, Λ)• To do this, it is sufficient to maximize for a fixed O:

argmaxS P(S,O| Λ)• We define

δj(t) = maxq1..qt-1 P(q1…qt-1, o1..ot-1, qt=j|μ) with ψj(t) records the node of the incoming arc that led to this most probable path


ECE8813 Spring 2009

oTo1 otot-1 ot+1

Viterbi Algorithm Properties

),,...,...(max)( 1111... 11ttttqqj ojqooqqPt

t

== −−−

δ

The state sub-sequence which maximizes the probability of seeing the observations to time t -1, landing in state j, and seeing the observation at time t

q1 qt-1 j


ECE8813 Spring 2009

oTo1 otot-1 ot+1

Viterbi Algorithm Properties (Cont.)

),,...,...(max)( 1111... 11ttttqqj ojqooqqPt

t

== −−−

δ

1)(max)1(

+=+

tjoijiij batt δδ

1)(maxarg)1(

+=+

tjoijii

j batt δψDP recursive computation

q1 qt-1 qt qt+1


ECE8813 Spring 2009

Finally: The Viterbi AlgorithmThe Viterbi Algorithm works as follows:• Initialization: δj(1) = πj, 1≤ j≤ N• Induction: δj(t+1) = max1≤ i≤N δi(t)aijbijot, 1≤ j≤ N• Store backtrace:

ψj(t+1) = argmax1≤ i≤N δj(t)aij bijot, 1≤ j≤ N

• Termination and path readout: next page


ECE8813 Spring 2009

oTo1 otot-1 ot+1

q1 qt-1 qt qt+1 qT

Viterbi Algorithm: Another Look

)(maxarg^

Tq ii

T δ=

)1(1

^

^+=

+

tqtq

t ψ

)(max)(^

TSP iiδ=

Compute the most likely state sequence by working backwards


ECE8813 Spring 2009

HMM Parameter Estimation• Given a certain collection of observation sequences,

we want to find the values of the model parameters Λ=(A, B, π) which best explain the observed data

• Using maximum likelihood estimation (MLE) to find values to maximize P(O| Λ), i.e. argmaxμ P(Otraining| μ)

• There is no known analytic method to choose μ to maximize P(O| Λ). However, we can locally maximize it by an iterative hill-climbing algorithm known as Baum-Welch or forward-backward algorithm (this is a special case of the EM algorithm which is used in many miss data problems in statistical inference)


ECE8813 Spring 2009

The Forward-Backward Algorithm• We don’t know what the model is, but we can

work out the probability of the observation sequence using some (perhaps randomly chosen) model

• Looking at that calculation, we can see which state transitions and symbol emissions were probably used the most

• By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence


ECE8813 Spring 2009

oTo1 otot-1 ot+1

A

B

AAA

BBB B

Parameter Estimation

• Given an observation sequence, find the model that is most likely to produce that sequence

• No analytic method• Given a model and training sequences, update the

model parameters to better fit the observations


ECE8813 Spring 2009

Some Definitions

∑=

+

+

+=

===

===

+

Nmmm

jijoiji

tt

ttt

tttbat

OpOjqiqpOjqiqpjip

t

...1

1

1

)()()1()(

)|()|,,(),|,(),(

1

βαβα

μμμ

• The probability of traversing a certain arc at time t givenObservation sequence O:

∑=

=N

jti jipt

1),()(γ

Let:


ECE8813 Spring 2009

The Probability of Traversing an Arc

si sj

aij bijot

……

t t+1t-1 t+2αi(t) βj(t+1)


ECE8813 Spring 2009

More Definitions

)(1

tT

ti∑

=

γ

• The expected number of transitions from state i in O:

• The expected number of transitions from state i to j in O

),(1

jipT

tt∑

=


ECE8813 Spring 2009

Reestimation Procedure• Begin with model μ perhaps selected at

random• Run O through the current model to estimate

the expectations of each model parameter• Change model to maximize the values of the

paths that are used a lot while respecting stochastic constraints

• Repeat this process (hoping to converge on optimal values for the model parameters μ)


ECE8813 Spring 2009

The Reestimation Formula

)1(ˆ ii γπ =The expected frequency in state i at time t=1:

∑

∑

=

==

=

T

ti

T

tt

ij

t

jip

ijia

1

1

)(

),(

state from ns transitio# expected to state from ns transitio# expectedˆ

γ

).|P(O)ˆ|P(O that Note).ˆ,B,A(ˆ derive ),B,(A, From

μμμμ

≥Π=Π=


ECE8813 Spring 2009

Reestimation Formula (Cont.)

∑

∑

=

≤≤==

=

T

tt

Ttkot t

ijk

jip

jipji

kjib

t

1

)1,:(

),(

),( to state from ns transitio# expected

observing to state from ns transitio# expectedˆ


ECE8813 Spring 2009

oTo1 otot-1 ot+1

A

B

AAA

BBB B

Parameter Estimation: Summary 1

∑=

+= +

Nmmm

jjoijit tt

tbatjip t

...1)()(

)1()(),( 1

βαβα Probability of

traversing an arcat time t

∑=

=Nj

ti jipt...1

),()(γ Probability of being in state i


ECE8813 Spring 2009

oTo1 otot-1 ot+1

A

B

AAA

BBB B

Parameter Estimation: Summary 2

)1(ˆ iγπ =i

Now we can compute the new estimates of the model parameters∑

∑=

== T

t i

T

t tij

t

jipa

1

1

)(

),(ˆ

γ

∑∑

=

== T

t i

kot i

ikt

tb t

1

}:{

)(

)(ˆ

γ

γ


ECE8813 Spring 2009

Baum-Welch Algorithm: DDHMM (I)

E-step:[ ]

);();();(

),|,Pr()(ln

),|,Pr(ln),|Pr(ln

),|()(lnln)(lnln

),|()|,(ln),|()|,(ln

,,|)|,,,,(ln);(

)()()(

)(

1 1 1 1

1 1 1

)(

21

1

)(

11

)(

1 221

1

)(

1

)(

1

)(111}{

)(

1

111

1

nnn

nl

L

l

N

i

M

m

T

tmtiltMi

L

l

N

i

N

j

nl

T

tjltiltij

L

l

nl

N

iili

nll

L

l ss

T

tts

T

tsslss

L

l S

nllll

L

l

nll

SS

L

lll

nLLLS

n

BBQAAQQ

Ovossvb

OssssaOss

OSpobaob

OSpSOpOSpSOp

OOSSOOpEQ

l

l

llTl

l

t

l

ttll

lL

l

++=

Λ==⋅+

Λ==⋅+Λ=⋅=

Λ⋅⎥⎦

⎤⎢⎣

⎡++⋅+=

Λ⋅Λ=Λ⋅⎥⎦

⎤⎢⎣

⎡Λ=

ΛΛ=ΛΛ

∑∑∑∑

∑∑∑∑∑∑

∑ ∑ ∑∑

∑∑∏∑ ∑

= = = =

= = = =−

= =

= ==

===

−

ππ

π

πL

L

LLL


ECE8813 Spring 2009

Baum-Welch Algorithm: DDHMM (II)M-step:

∑∑

∑∑

∑∑∑

∑∑

∑∑

∑∑

∑∑∑

∑∑

∑∑

∑

= =

= =

= = =

= =

=

−

=

= =−

= = =−

= =−

+

= =

=+

Λ=

Λ===

Λ==

Λ===⇒=

∂∂

Λ=

Λ===

Λ==

Λ===⇒=

∂∂

Λ=

Λ==⇒=

∂∂

L

l

T

t

nlit

L

l

nl

T

tmtit

L

l

T

t

M

m

nlmtit

L

l

nl

T

tmtit

mimi

n

L

l

T

t

nlilt

L

l

T

t

nljltilt

L

l

T

t

N

j

nljltilt

L

l

T

t

nljltilt

nij

ij

n

L

l

N

i

nlil

L

l

nlil

ni

i

n

l

l

l

l

l

l

l

l

Oss

Ovoss

Ovoss

Ovossvb

vbBBQ

Oss

Ossss

Ossss

Ossssa

aAAQ

Oss

OssQ

1 1

)(

1

)(

1

1 1 1

)(

1

)(

1)(

1

1

1

)(

1 2

)(1

1 2 1

)(1

1 2

)(1

)1()(

1 1

)(1

1

)(1

)1()(

),|Pr(

),|,Pr(

),|,Pr(

),|,Pr()(0

)();(

),|Pr(

),|,Pr(

),|,Pr(

),|,Pr(0);(

),|Pr(

),|Pr(0);( π

πππ


ECE8813 Spring 2009

Baum-Welch Algorithm: DDHMM (III)

How to calculate the posteriori probabilities of traversing an arc going from state i to j at time t?

si sj

t t+1)(itα )(1 it+β

)( tjij oba

),(

)|()()()(

)(

)()()()|Pr(

)()()(paths all of prob

at and 1at paths all of prob),|,Pr(

1

1

11

1

ji

OPjobai

i

jobaiO

jobai

tstsOssss

t

ttjijtN

iT

ttjijt

l

ttjijt

jijtit

ξ

βα

α

βαβα

≡

Λ

⋅⋅=

⋅⋅=

Λ

⋅⋅=

−=Λ==

+

=

++

−

∑

∑∑


ECE8813 Spring 2009

Baum-Welch Algorithm: DDHMM (IV)Define state occupancy probability:

Ojiji

Oit

jit

T

tt

T

ti

N

jti

in to state from ns transitioofnumber Expected ),(

in state from ns transitioofnumber Expected )(

),()( Note

1

1

1

=

=

=

∑

∑

∑

=

=

=

ξ

γ

ξγ

∑=

=Λ

Λ==Λ== N

jjj

iitti

tt

ttOP

OisPOisPt

1)()(

)()()|(

)|,(),|()(βα

βαγ


ECE8813 Spring 2009

Baum-Welch Algorithm: DDHMM (V)Final results: one iteration, from

∑ ∑

∑ ∑

∑ ∑ ∑

∑ ∑

∑ ∑ ∑

∑ ∑

∑ ∑ ∑

∑ ∑

∑ ∑∑ ∑

∑

= =

= =

= = =

= =

= = =

= =

= = =−

= =−

+

= =

= =

=+

−⋅=

Λ==

Λ===

=Λ==

Λ===

=Λ=

Λ==

L

l

T

t

lt

L

l

T

tmlt

lt

L

l

T

t

M

m

nlmtit

L

l

nl

T

tmtit

mi

L

l

T

t

N

j

lt

L

l

T

t

lt

L

l

T

t

N

j

nljltilt

L

l

T

t

nljltilt

nij

L

l

N

j

lL

l

N

i

nlil

L

l

nlil

ni

l

l

l

l

l

l

l

l

ji

voji

Ovoss

Ovossvb

ji

ji

Ossss

Ossssa

jiOss

Oss

1 1

)(

1 1

)(

1 1 1

)(

1

)(

1

1 2 1

)(

1 2

)(

1 2 1

)(1

1 2

)(1

)1(

1 1

)(1

1 1

)(1

1

)(1

)1(

),(

)(),(

),|,Pr(

),|,Pr()(

),(

),(

),|,Pr(

),|,Pr(

),(),|Pr(

),|Pr(

ξ

δξ

ξ

ξ

ξπ

},,{ )()()()( nnnn BA π=Λ


ECE8813 Spring 2009

Baum-Welch: Gaussian Mixture CDHMM (I)• Treat both state sequence Sl and mixture component label

sequence ll as missing data.• Only B estimation is different.• E-step:

• M-step:

),|,Pr()()(21||ln

2ln

),|,Pr()(ln);(

)(

1 1 1 1

1

)(

1 1 1 1

)(

nl

L

l

N

i

K

k

T

tltiltikltik

tikltikik

nl

L

l

N

i

K

k

T

tltiltltik

n

OklssXXn

OklssXbBBQ

l

l

Λ==⋅⎥⎦⎤

⎢⎣⎡ −∑−⋅−∑−=

Λ==⋅=

∑∑∑∑

∑∑∑∑

= = = =

−

= = = =

μμω

),|,Pr(

),|,Pr(0);(

)(

)(

1 1)1()(

nl

L L

ltilt

nl

L

l

L

tltiltlt

nik

ik

n

Oklss

OklssXBBQ

t

t

Λ==

Λ==⋅=⇒=

∂∂

∑ ∑

∑ ∑= =+μ

μ1 1l t= =


ECE8813 Spring 2009

Baum-Welch: Gaussian Mixture CDHMM (II)

),|,Pr(

),|,Pr(0);(

),|,Pr(

),|,Pr()()(0);(

)(

1 1 1

)(

1 1)1()(

)(

1 1

)(

1 1

)()(

)1()(

nl

L

l

L

t

K

kltilt

nl

L

l

L

tltilt

nik

ik

n

nl

L

l

L

tltilt

nl

L

l

L

tltilt

niklt

tniklt

nik

ik

n

Oklss

OklssBBQ

Oklss

OklssXXBBQ

t

t

t

t

Λ==

Λ===⇒=

∂∂

Λ==

Λ==⋅−⋅−=∑⇒=

∑∂∂

∑ ∑ ∑

∑ ∑

∑ ∑

∑ ∑

= = =

= =+

= =

= =+

ωω

μμ

∑

∑

=

=

∑⋅

∑⋅=

⋅

⋅⋅=≡Λ==

K

k

nik

niklt

nik

nik

niklt

nikl

ik

K

k

likl

lik

lt

ltl

tn

lltilt

XN

XNt

tP

tiikiOklss

1

)()()(

)()()()(

1

)(

)()()()()(

),|(

),|()( where

)(

)()()(),(),|,Pr(

μω

μωγ

γ

γβαζ

where the posteriori probabilities are calculated as:


ECE8813 Spring 2009

HMM Learning: Summary• For an HMM model and a training data

set D = {O1, O2, …, OL},1. Initialization: 2. n=0 ;3. For each observation sequence Ol (l=1,2,…,L):

• Calculate and based on • Calculate all other posteriori probabilities• Accumulate a numerator and a denominator for each HMM

parameter4. HMM parameters update: the numerators divided by

the denominators

5. n=n+1; Go to step 2 until convergence

},,{ πBA=Λ

},,{ )0()0()0()0( πBA=Λ

)(itα )(itβ )(nΛ

=Λ + )1(n


ECE8813 Spring 2009

HMM Applications

• Parameters for interpolated n-gram models• Part-of-speech tagging• Speech recognition• Machine translation• Many others


ECE8813 Spring 2009

Summary• Today’s and next classes

– Hidden Markov Models• Note

– Project summary due on 2/24, let’s start our discussion– Project plan finalize on 3/3 (presentation on 4/16 ???)– Lab3 assigned on 2/12 and due on 2/26– Midterm on 3/12 (???)– Final at 8am on 4/27 (shall we try a take-home ???)

• Reading Assignments– Manning and Schutze, Chapter 9

ECE8813 Statistical Natural Language Processing Lectures ...

Documents