1 Hidden Markov Models (HMMs) (Lecture for CS410 Text Information Systems) April 24, 2009 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
Feb 21, 2016
1
Hidden Markov Models (HMMs)
(Lecture for CS410 Text Information Systems)
April 24, 2009
ChengXiang ZhaiDepartment of Computer Science
University of Illinois, Urbana-Champaign
2
Modeling a Multi-Topic Document
…text mining passage
food nutrition passage
text mining passage
text mining passage
food nutrition passage …
A document with 2 subtopics, thus 2 types of vocabulary
How do we model such a document?
How do we “generate” such a document?
How do we estimate our model?
We’ve already seen one solution – unigram mixture model + EMThis lecture is about another (better) solution – HMM + EM
3
Simple Unigram Mixture Model…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…
Model/topic 1p(w|1)
…food 0.25nutrition 0.1healthy 0.05diet 0.02…
Model/topic 2p(w|2)
=0.7
1-=0.3
p(w|1 2) = p(w|1)+(1- )p(w|2)
4
Deficiency of the Simple Mixture Model
•Adjacent words are sampled independently•Cannot ensure passage cohesion•Cannot capture the dependency between
neighboring wordsWe apply the text mining algorithm to the nutrition data to find patterns …
Topic 1= text miningp(w|1)
Topic 1= text miningp(w|1)
Topic 2= healthp(w|2)
Solution=?
5
•Basic idea: – Make the choice of model for a word
depend on the choice of model for the previous word
– Thus we model both the choice of model (“state”) and the words (“observations”)
The Hidden Markov Model Solution
O: We apply the text mining algorithm to the nutrition data to find patterns … S1: 1 1 1 1 1 1 1 1 1 1 1 1 1
……O: We apply the text mining algorithm to the nutrition data to find patterns …
S2: 1 1 1 1 1 1 1 1 2 1 1 1 1
P(O,S1) > p(O,S2)?
6
RelevantPassages
(“T”)
Background text (“B”)
Strategy I: Build passage index Need pre-processing Query independent boundaries Efficient
Strategy II: Extract relevant passages dynamically
No pre-processing Dynamically decide the boundaries Slow
Another Example: Passage Retrieval
O: We apply the text mining algorithm to the nutrition data to find patterns … S1: B T B T T T B B T T B T T
O:…......... ….. We apply the text mining … patterns. ……………… S2: BB...BB T T T T T …. T BB….. BBBB
7
A Simple HMM for Recognizing Relevant Passages
P(w|B)
the 0.2a 0.1we 0.01to 0.02…text 0.0001mining 0.00005
…
…text =0.125 mining = 0.1association =0.1clustering = 0.05
…
Parameters
Initial state prob: p(B)= 0.5; p(T)=0.5
State transition prob:p(BB)=0.8 p(BT)=0.2p(TB)=0.5 p(TT)=0.5
Output prob:P(“the”|B) = 0.2 …P(“text”|T) = 0.1 …
P(B)=0.5P(T)=0.5
P(w|B)
B T
0.8
0.20.5
0.5 P(w|T)
B T
0.8
0.20.5
0.5 P(w|T)
BackgroundTopic
8
( , , , , )HMM S V B A
( ) : " "i k k ib v prob of generating v at s
A General Definition of HMM
11
{ ,..., } 1N
N ii
:i iprob of starting at state s
1{ ,..., }MV v v
1{ ,..., }NS s sN states
M symbols
Initial state probability:
1
{ } 1 , 1N
ij ijj
A a i j N a
State transition probability:
1
{ ( )} 1 , 1 ( ) 1M
i k i kk
B b v i N k M b v
Output probability::ij i ja prob of going s s
9
How to “Generate” Text?
B T
0.8
0.20.5
0.5
P(w|B) P(w|T)the 0.2a 0.1we 0.01is 0.02…method 0.0001mining 0.00005
…
…text =0.125 mining = 0.1algorithm =0.1clustering = 0.05
…
the
P(B)=0.5 P(T)=0.5
text is clusteringalgorithm amining method
B T BB BTT T0.50.2 0.20.80.50.50.5
0.00010.050.10.020.10.10.1250.2
0.5
P(BTT…,”the text mining…”)=p(B)p(the|B) p(T|B)p(text|T) p(T|T)p(mining|T)… = 0.5*0.2 * 0.2*0.125 * 0.5*0.1…
10
HMM as a Probabilistic Model
1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S
1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S
Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …
Observation variable: O1 O2 O3 O4 …
Hidden state variable: S1 S2 S3 S4 …
Random variables/process
Sequential data
Probability of observations (incomplete likelihood):
1
1 2 1 2 1,...
( , ,..., ) ( , ,..., , ,... )T
T T TS S
p O O O p O O O S S
1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S
Joint probability (complete likelihood):
State transition prob:
Probability of observations with known state transitions:
Init state distr.
State trans. prob.
Output prob.
11
Three Problems1. Decoding – finding the most likely path
Given: model, parameters, observations (data)
Find: most likely states sequence (path)
2. Evaluation – computing observation likelihood
Given: model, parameters, observations (data)
Find: the likelihood to generate the data
1 2 1 2
* * *1 2 1 2 1 2
... ...... arg max ( ... | ) arg max ( ... , )
T T
T T TS S S S S S
S S S p S S S O p S S S O
1 2
1 2 1 2...
( | ) ( | ... ) ( ... )T
T TS S S
p O p O S S S p S S S
12
Three Problems (cont.) 3. Training – estimating parameters
- Supervised Given: model structure, labeled
data( data+states sequence) Find: parameter values - Unsupervised Given: model structure, data (unlabeled) Find: parameter values
* arg max ( | )p O
13
Problem I: Decoding/Parsing
Finding the most likely path
This is the most common way of using an HMM (e.g., extraction, structure analysis)…
14
What’s the most likely path?
B T
0.8
0.20.5
0.5
P(w|B) P(w|T)the 0.2a 0.1we 0.01is 0.02…method 0.0001mining 0.00005
…
…text =0.125 mining = 0.1algorithm =0.1clustering = 0.05
…
the
P(B)=0.5 P(T)=0.5
text is clusteringalgorithm amining method
? ? ?? ??? ??? ?????
????????
?
1 1 11 2 1 2
* * *1 2 1 2 1
... ... 2
... arg max ( ... , ) arg max ( ) ( ) ( )i i i i
T T
T
T T S o S S S oS S S S S S i
S S S p S S S O S b v a b v
15
Viterbi Algorithm: An Example
B T
0.8
0.2
0.5
0.5
P(w|B) P(w|T)the 0.2a 0.1we 0.01…algorithm 0.0001mining 0.005text 0.001
…the 0.05text =0.125 mining = 0.1algorithm =0.1clustering = 0.05
B
T TT T
P(B)=0.5 P(T)=0.5
the text …algorithmmining0.5
0.2
0.50.5
0.8BBB
0.8 0.8
0.5 0.5
0.5
0.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
VP(B): 0.5*0.2 (B) 0.5*0.2*0.8*0.001(BB) … *0.5*0.005 (BTB) …*0.5*0.0001(BTTB)
VP(T) 0.5*0.05(T) 0.5*0.2*0.2*0.125(BT) … *0.5*0.1 (BTT) …*0.5*0.1(BTTT) Winning path!
16
Viterbi AlgorithmObservation:
Algorithm:
1 2 1 2 11 1 1 1 1... ...
max ( ... , ... ) max[ max ( ... , ... , )]T i T
T T T T T iS S S s S S Sp o o S S p o o S S S s
1 1
1 1
1 1 1...
*1 1 1
...
( ) max ( ... , ... , )
( ) [arg max ( ... , ... , ) ] ( )t
t
t t t t iS S
t t t t iS S
VP i p o o S S S s
q i p o o S S S s i
*1 1 11. ( ) ( ), ( ) ( ), 1,...,i iVP i b o q i i for i N
1 11
* *1 1
1
2. 1 , ( ) max ( ) ( ),
( ) ( ) ( ), arg max ( ) ( ), 1,...,
t t ji i tj N
t t t ji i tj N
For t T VP i VP j a b o
q i q k i k VP j a b o for i N
(Dynamic programming)
* ( )TThe best path is q i Complexity: O(TN2)
17
Problem II: EvaluationComputing the data
likelihood Another use of an HMM, e.g., as a generative model for classificationAlso related to Problem III – parameter estimation
18
Data Likelihood: p(O|)
(" ..." | ) (" ..." | ... ) ( ... )(" ..." | ... ) ( ... )
... (" ..." | ... ) ( ... )
p the text p the text BB B p BB Bp the text BT B p BT B
p the text TT T p TT T
B
T TT T
the text …algorithmmining0.5
0.2
0.50.5
0.8BBB
0.8 0.8
0.5 0.5
0.5
0.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
In general, 1 2
1 2 1 2...
( | ) ( | ... ) ( ... )T
T TS S S
p O p O S S S p S S S
Complexity of a naïve approach?
BB… BBT… BTT… T
enumerate all paths
19
The Forward AlgorithmB
T TT T
…BBB
…
α1
20
The Forward AlgorithmB
T TT T
…BBB
…
α1 α2
21
The Forward Algorithm
11
( ... | ) ( )N
T Ti
p o o i
The data likelihood is
Complexity: O(TN2)
B
T TT T
…BBB
…
1 2 1
1 2 1
1 2 1
1 1 1 2 11 ...
1 1 2 1...
1 1 1 2 1 1...
1 1 1 2 1
( ... | ) ( ... , ... , )
( ) ( ... , ... , )
( ... , ... ) ( | ) ( | )
[ ( ... , ... )] (
T
t
t
N
T T T T ii S S S
t t t t iS S S
t t t i t t iS S S
t t j ji i
p o o p o o S S S S s
i p o o S S S S s
p o o S S S p S s S p o S s
p o o S S S s a b
1 2 21 ...
11
)
( ) ( )
t
N
tj S S S
N
i t t jij
o
b o j a
Generating o1…ot
with ending state si
α1 α2 α3
α4
22
Forward Algorithm: Example
B
T TT T
the text …algorithmmining0.5
0.2
0.50.5
0.8BBB
0.8 0.8
0.5 0.50.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
1(B): 0.5*p(“the”|B)
1(T): 0.5*p(“the”|T)
2(B): [1(B)*0.8+ 1(T)*0.5]*p(“text”|B) ……
2(T): [1(B)*0.2+ 1(T)*0.5]*p(“text”|T) ……
1 11 1
( ... | ) ( ) ( ) ( ) ( )N N
T T t i t t jii j
p o o i i b o j a
P(“the text mining algorithm”) = 4(B)+ 4(T)
23
The Backward Algorithm
2
2
1
1
1 1 1 21 ...
1 2 2 11 ...
1 1...
2 2 1 1 1 1...
( ... | ) ( ... , , ... )
( ) ( ... , ... | )
( ) ( ... , ... | )
( ... , ... | ) ( | ) ( |
T
T
t T
t
N
T T i Ti S S
N
i i T T ii S S
t t T t T t iS S
t T t T t t t i t tS S
p o o p o o S s S S
b o p o o S S S s
i p o o S S S s
p o o S S S p S S s p o S
2
1 2 2 11 ...
1 11
)
( ) ( ... , ... | )
( ) ( )
T
t T
N
ij j t t T t T t jj S S
N
ij j t tj
a b o p o o S S S s
a b o j
1 1 1 1 11 1 1
( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N
T i i t ti i i
p o o b o i i i i i for any t
The data likelihood is
Observation:
Algorithm: Starting from state si
Generating ot+1…oT
(o1…ot already generated)
Complexity: O(TN2)
24
Backward Algorithm: Example
B
T TT T
the text …algorithmmining0.5
0.2
0.50.5
0.8BBB
0.8 0.8
0.5 0.50.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
4(B): 1 3(B): 0.8*p(“alg”|B)*4(B)+ 0.2*p(“alg”|T)*4(T)
P(“the text mining algorithm”) = 1(B)* 1(B)+ 1(T)* 1(T) = 2(B)* 2(B)+ 2(T)* 2(T)
4(T): 1 3(T): 0.5*p(“alg”|B)*4(B)+ 0.5*p(“alg”|T)*4(T)
……
1 1 1 1 11 1 1
( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N
T i i t ti i i
p o o b o i i i i i for any t
25
Problem III: TrainingEstimating ParametersWhere do we get the probability values
for all parameters?Supervised vs. Unsupervised
26
Supervised TrainingGiven:
1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters1. 1, 2
2. a11, a12, a22, a21
3. b1(a), b1(b), b2(a), b2(b)b1(a)=4/4=1.0; b1(b)=0/4=0;b2(a)=1/6=0.167; b2(b)=5/6=0.833
a11=2/4=0.5; a12=2/4=0.5a21=1/5=0.2; a22=4/5=0.8
1=1/1=1; 2=0/1=0
1 20.5
0.8
0.2
0.5
P(s1)=1P(s2)=0
P(a|s1)=1P(b|s1)=0
P(a|s2)=167P(b|s2)=0.833
27
Unsupervised TrainingGiven:
1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters1. 1, 2
2. a11, a12, a22, a21
3. b1(a), b1(b), b2(a), b2(b)
How could this be possible?
Maximum Likelihood:* arg max ( | )p O
28
Intuition
1
1
( , | ) [ (1) 1]
( , | )
K
k kk
i K
kk
p O q q
p O q
1
1 11
1 1
( , | ) [ ( ) , ( 1) ]
( , | ) [ ( ) ]
T K
k k kt k
ij T K
k kt k
p O q q t i q t ja
p O q q t i
O=aaaaabbbbb,
q1=1111111111
P(O,q1|)
q2=11111112211 … qK=2222222222
P(O,q2|)P(O,qK|)
1
1 11
1 1
( , | ) [ ( ) , ]( )
( , | ) [ ( ) ]
T K
k k t jt k
i j T K
k kt k
p O q q t i o vb v
p O q q t i
New ’
Computation of P(O,qk|) is expensive …
29
Baum-Welch Algorithm
( ) ( ( ) | , )( , ) ( ( ) , ( 1) | , )
t i
t i j
i p q t s Oi j p q t s q t s O
Basic “counters”: Being at state si at time t
Being at state si at time t and at state sj at time t+1
Complexity: O(N2)1
1 1
1
1 1
( ) ( )( )( ) ( )
( ) ( ) ( )( , )
( ) ( )
( ) ( )( )
( )
t tt N
t tj
t ij j t tt N
t tj
ij j t tt
t
i iij j
i a b o ji j
j j
a b o ji
i
Computation of counters:
30
Baum-Welch Algorithm (cont.)
Updating formulas:
'1
1
' 11
' 1 1
1
1
( )
( , )
( , ')
( ) [ ]( )
( )
i
T
tt
ij N T
tj t
T
t t kt
i k T
tt
i
i ja
i j
i o vb v
i
Overall complexity for each iteration: O(TN2)
31
An HMM for Information Extraction (Research Paper Headers)
32
What You Should Know
•Definition of an HMM•What are the three problems associated
with an HMM?•Know how the following algorithms work
– Viterbi algorithm– Forward & Backward algorithms
•Know the basic idea of the Baum-Welch algorithm
33
Readings
•Read [Rabiner 89] sections I, II, III•Read the “brief note”