Hidden Markov Models (HMMs)

1

Hidden Markov Models (HMMs)

(Lecture for CS410 Text Information Systems)

April 24, 2009

ChengXiang ZhaiDepartment of Computer Science

University of Illinois, Urbana-Champaign

2

Modeling a Multi-Topic Document

…text mining passage

food nutrition passage

text mining passage

text mining passage

food nutrition passage …

A document with 2 subtopics, thus 2 types of vocabulary

How do we model such a document?

How do we “generate” such a document?

How do we estimate our model?

We’ve already seen one solution – unigram mixture model + EMThis lecture is about another (better) solution – HMM + EM

3

Simple Unigram Mixture Model…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Model/topic 1p(w|1)

…food 0.25nutrition 0.1healthy 0.05diet 0.02…

Model/topic 2p(w|2)

=0.7

1-=0.3

p(w|1 2) = p(w|1)+(1- )p(w|2)

4

Deficiency of the Simple Mixture Model

•Adjacent words are sampled independently•Cannot ensure passage cohesion•Cannot capture the dependency between

neighboring wordsWe apply the text mining algorithm to the nutrition data to find patterns …

Topic 1= text miningp(w|1)

Topic 1= text miningp(w|1)

Topic 2= healthp(w|2)

Solution=?

5

•Basic idea: – Make the choice of model for a word

depend on the choice of model for the previous word

– Thus we model both the choice of model (“state”) and the words (“observations”)

The Hidden Markov Model Solution

O: We apply the text mining algorithm to the nutrition data to find patterns … S1: 1 1 1 1 1 1 1 1 1 1 1 1 1

……O: We apply the text mining algorithm to the nutrition data to find patterns …

S2: 1 1 1 1 1 1 1 1 2 1 1 1 1

P(O,S1) > p(O,S2)?

6

RelevantPassages

(“T”)

Background text (“B”)

Strategy I: Build passage index Need pre-processing Query independent boundaries Efficient

Strategy II: Extract relevant passages dynamically

No pre-processing Dynamically decide the boundaries Slow

Another Example: Passage Retrieval

O: We apply the text mining algorithm to the nutrition data to find patterns … S1: B T B T T T B B T T B T T

O:…......... ….. We apply the text mining … patterns. ……………… S2: BB...BB T T T T T …. T BB….. BBBB

7

A Simple HMM for Recognizing Relevant Passages

P(w|B)

the 0.2a 0.1we 0.01to 0.02…text 0.0001mining 0.00005

…

…text =0.125 mining = 0.1association =0.1clustering = 0.05

…

Parameters

Initial state prob: p(B)= 0.5; p(T)=0.5

State transition prob:p(BB)=0.8 p(BT)=0.2p(TB)=0.5 p(TT)=0.5

Output prob:P(“the”|B) = 0.2 …P(“text”|T) = 0.1 …

P(B)=0.5P(T)=0.5

P(w|B)

B T

0.8

0.20.5

0.5 P(w|T)

B T

0.8

0.20.5

0.5 P(w|T)

BackgroundTopic

8

( , , , , )HMM S V B A

( ) : " "i k k ib v prob of generating v at s

A General Definition of HMM

11

{ ,..., } 1N

N ii

:i iprob of starting at state s

1{ ,..., }MV v v

1{ ,..., }NS s sN states

M symbols

Initial state probability:

1

{ } 1 , 1N

ij ijj

A a i j N a

State transition probability:

1

{ ( )} 1 , 1 ( ) 1M

i k i kk

B b v i N k M b v

Output probability::ij i ja prob of going s s

9

How to “Generate” Text?

B T

0.8

0.20.5

0.5

P(w|B) P(w|T)the 0.2a 0.1we 0.01is 0.02…method 0.0001mining 0.00005

…

…text =0.125 mining = 0.1algorithm =0.1clustering = 0.05

…

the

P(B)=0.5 P(T)=0.5

text is clusteringalgorithm amining method

B T BB BTT T0.50.2 0.20.80.50.50.5

0.00010.050.10.020.10.10.1250.2

0.5

P(BTT…,”the text mining…”)=p(B)p(the|B) p(T|B)p(text|T) p(T|T)p(mining|T)… = 0.5*0.2 * 0.2*0.125 * 0.5*0.1…

10

HMM as a Probabilistic Model

1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S

1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S

Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …

Observation variable: O1 O2 O3 O4 …

Hidden state variable: S1 S2 S3 S4 …

Random variables/process

Sequential data

Probability of observations (incomplete likelihood):

1

1 2 1 2 1,...

( , ,..., ) ( , ,..., , ,... )T

T T TS S

p O O O p O O O S S

1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S

Joint probability (complete likelihood):

State transition prob:

Probability of observations with known state transitions:

Init state distr.

State trans. prob.

Output prob.

11

Three Problems1. Decoding – finding the most likely path

Given: model, parameters, observations (data)

Find: most likely states sequence (path)

2. Evaluation – computing observation likelihood

Given: model, parameters, observations (data)

Find: the likelihood to generate the data

1 2 1 2

* * *1 2 1 2 1 2

... ...... arg max ( ... | ) arg max ( ... , )

T T

T T TS S S S S S

S S S p S S S O p S S S O

1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S S

12

Three Problems (cont.) 3. Training – estimating parameters

- Supervised Given: model structure, labeled

data( data+states sequence) Find: parameter values - Unsupervised Given: model structure, data (unlabeled) Find: parameter values

* arg max ( | )p O

13

Problem I: Decoding/Parsing

Finding the most likely path

This is the most common way of using an HMM (e.g., extraction, structure analysis)…

14

What’s the most likely path?

B T

0.8

0.20.5

0.5

P(w|B) P(w|T)the 0.2a 0.1we 0.01is 0.02…method 0.0001mining 0.00005

…

…text =0.125 mining = 0.1algorithm =0.1clustering = 0.05

…

the

P(B)=0.5 P(T)=0.5

text is clusteringalgorithm amining method

? ? ?? ??? ??? ?????

????????

?

1 1 11 2 1 2

* * *1 2 1 2 1

... ... 2

... arg max ( ... , ) arg max ( ) ( ) ( )i i i i

T T

T

T T S o S S S oS S S S S S i

S S S p S S S O S b v a b v

15

Viterbi Algorithm: An Example

B T

0.8

0.2

0.5

0.5

P(w|B) P(w|T)the 0.2a 0.1we 0.01…algorithm 0.0001mining 0.005text 0.001

…the 0.05text =0.125 mining = 0.1algorithm =0.1clustering = 0.05

B

T TT T

P(B)=0.5 P(T)=0.5

the text …algorithmmining0.5

0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.5

0.5

0.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

VP(B): 0.5*0.2 (B) 0.5*0.2*0.8*0.001(BB) … *0.5*0.005 (BTB) …*0.5*0.0001(BTTB)

VP(T) 0.5*0.05(T) 0.5*0.2*0.2*0.125(BT) … *0.5*0.1 (BTT) …*0.5*0.1(BTTT) Winning path!

16

Viterbi AlgorithmObservation:

Algorithm:

1 2 1 2 11 1 1 1 1... ...

max ( ... , ... ) max[ max ( ... , ... , )]T i T

T T T T T iS S S s S S Sp o o S S p o o S S S s

1 1

1 1

1 1 1...

*1 1 1

...

( ) max ( ... , ... , )

( ) [arg max ( ... , ... , ) ] ( )t

t

t t t t iS S

t t t t iS S

VP i p o o S S S s

q i p o o S S S s i

*1 1 11. ( ) ( ), ( ) ( ), 1,...,i iVP i b o q i i for i N

1 11

* *1 1

1

2. 1 , ( ) max ( ) ( ),

( ) ( ) ( ), arg max ( ) ( ), 1,...,

t t ji i tj N

t t t ji i tj N

For t T VP i VP j a b o

q i q k i k VP j a b o for i N

(Dynamic programming)

* ( )TThe best path is q i Complexity: O(TN2)

17

Problem II: EvaluationComputing the data

likelihood Another use of an HMM, e.g., as a generative model for classificationAlso related to Problem III – parameter estimation

18

Data Likelihood: p(O|)

(" ..." | ) (" ..." | ... ) ( ... )(" ..." | ... ) ( ... )

... (" ..." | ... ) ( ... )

p the text p the text BB B p BB Bp the text BT B p BT B

p the text TT T p TT T

B

T TT T


0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.5

0.5

0.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

In general, 1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S S

Complexity of a naïve approach?

BB… BBT… BTT… T

enumerate all paths

19

The Forward AlgorithmB

T TT T

…BBB

…

α1

20

The Forward AlgorithmB

T TT T

…BBB

…

α1 α2

21

The Forward Algorithm

11

( ... | ) ( )N

T Ti

p o o i

The data likelihood is

Complexity: O(TN2)

B

T TT T

…BBB

…

1 2 1

1 2 1

1 2 1

1 1 1 2 11 ...

1 1 2 1...

1 1 1 2 1 1...

1 1 1 2 1

( ... | ) ( ... , ... , )

( ) ( ... , ... , )

( ... , ... ) ( | ) ( | )

[ ( ... , ... )] (

T

t

t

N

T T T T ii S S S

t t t t iS S S

t t t i t t iS S S

t t j ji i

p o o p o o S S S S s

i p o o S S S S s

p o o S S S p S s S p o S s

p o o S S S s a b

1 2 21 ...

11

)

( ) ( )

t

N

tj S S S

N

i t t jij

o

b o j a

Generating o1…ot

with ending state si

α1 α2 α3

α4

22

Forward Algorithm: Example

B

T TT T


0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.50.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

1(B): 0.5*p(“the”|B)

1(T): 0.5*p(“the”|T)

2(B): [1(B)*0.8+ 1(T)*0.5]*p(“text”|B) ……

2(T): [1(B)*0.2+ 1(T)*0.5]*p(“text”|T) ……

1 11 1

( ... | ) ( ) ( ) ( ) ( )N N

T T t i t t jii j

p o o i i b o j a

P(“the text mining algorithm”) = 4(B)+ 4(T)

23

The Backward Algorithm

2

2

1

1

1 1 1 21 ...

1 2 2 11 ...

1 1...

2 2 1 1 1 1...

( ... | ) ( ... , , ... )

( ) ( ... , ... | )

( ) ( ... , ... | )

( ... , ... | ) ( | ) ( |

T

T

t T

t

N

T T i Ti S S

N

i i T T ii S S

t t T t T t iS S

t T t T t t t i t tS S

p o o p o o S s S S

b o p o o S S S s

i p o o S S S s

p o o S S S p S S s p o S

2

1 2 2 11 ...

1 11

)

( ) ( ... , ... | )

( ) ( )

T

t T

N

ij j t t T t T t jj S S

N

ij j t tj

a b o p o o S S S s

a b o j

1 1 1 1 11 1 1

( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N

T i i t ti i i

p o o b o i i i i i for any t

The data likelihood is

Observation:

Algorithm: Starting from state si

Generating ot+1…oT

(o1…ot already generated)

Complexity: O(TN2)

24

Backward Algorithm: Example

B

T TT T


0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.50.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

4(B): 1 3(B): 0.8*p(“alg”|B)*4(B)+ 0.2*p(“alg”|T)*4(T)

P(“the text mining algorithm”) = 1(B)* 1(B)+ 1(T)* 1(T) = 2(B)* 2(B)+ 2(T)* 2(T)

4(T): 1 3(T): 0.5*p(“alg”|B)*4(B)+ 0.5*p(“alg”|T)*4(T)

……

1 1 1 1 11 1 1

( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N

T i i t ti i i

p o o b o i i i i i for any t

25

Problem III: TrainingEstimating ParametersWhere do we get the probability values

for all parameters?Supervised vs. Unsupervised

26

Supervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12, a22, a21

3. b1(a), b1(b), b2(a), b2(b)b1(a)=4/4=1.0; b1(b)=0/4=0;b2(a)=1/6=0.167; b2(b)=5/6=0.833

a11=2/4=0.5; a12=2/4=0.5a21=1/5=0.2; a22=4/5=0.8

1=1/1=1; 2=0/1=0

1 20.5

0.8

0.2

0.5

P(s1)=1P(s2)=0

P(a|s1)=1P(b|s1)=0

P(a|s2)=167P(b|s2)=0.833

27

Unsupervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12, a22, a21

3. b1(a), b1(b), b2(a), b2(b)

How could this be possible?

Maximum Likelihood:* arg max ( | )p O

28

Intuition

1

1

( , | ) [ (1) 1]

( , | )

K

k kk

i K

kk

p O q q

p O q

1

1 11

1 1

( , | ) [ ( ) , ( 1) ]

( , | ) [ ( ) ]

T K

k k kt k

ij T K

k kt k

p O q q t i q t ja

p O q q t i

O=aaaaabbbbb,

q1=1111111111

P(O,q1|)

q2=11111112211 … qK=2222222222

P(O,q2|)P(O,qK|)

1

1 11

1 1

( , | ) [ ( ) , ]( )

( , | ) [ ( ) ]

T K

k k t jt k

i j T K

k kt k

p O q q t i o vb v

p O q q t i

New ’

Computation of P(O,qk|) is expensive …

29

Baum-Welch Algorithm

( ) ( ( ) | , )( , ) ( ( ) , ( 1) | , )

t i

t i j

i p q t s Oi j p q t s q t s O

Basic “counters”: Being at state si at time t

Being at state si at time t and at state sj at time t+1

Complexity: O(N2)1

1 1

1

1 1

( ) ( )( )( ) ( )

( ) ( ) ( )( , )

( ) ( )

( ) ( )( )

( )

t tt N

t tj

t ij j t tt N

t tj

ij j t tt

t

i iij j

i a b o ji j

j j

a b o ji

i

Computation of counters:

30

Baum-Welch Algorithm (cont.)

Updating formulas:

'1

1

' 11

' 1 1

1

1

( )

( , )

( , ')

( ) [ ]( )

( )

i

T

tt

ij N T

tj t

T

t t kt

i k T

tt

i

i ja

i j

i o vb v

i

Overall complexity for each iteration: O(TN2)

31

An HMM for Information Extraction (Research Paper Headers)

32

What You Should Know

•Definition of an HMM•What are the three problems associated

with an HMM?•Know how the following algorithms work

– Viterbi algorithm– Forward & Backward algorithms

•Know the basic idea of the Baum-Welch algorithm

33

Readings

•Read [Rabiner 89] sections I, II, III•Read the “brief note”

Hidden Markov Models (HMMs)

Documents

text mining algorithm

choice of model state

nutrition data

b t b t t t b b t t

observations data

state transition probability

parametersinitial state

init state distr