CS 4705 Hidden Markov Models 10/2/19 1 Slides adapted from Dan Jurafsky, and James Martin
CS4705HiddenMarkovModels
10/2
/19
1
Slides adapted from Dan Jurafsky, and James Martin
Responseson“well”• Thereisanoilwellafewmilesawayfrommyhouse.Noun.• Well!IneverthoughtIwouldseethat!Interjec<on• Itisawelldesignedprogram.Adv• Tearswelledinhereyes.Verb• Thestoresellsfruitaswellasvegetables.Conjunc<on.• Areyouwell?Adj.• Heandhisfamilywerewelloff.Adjec<valPhrase(?)
10/2
/19
2
Announcements• Readingfortoday:Chapter7-7.5(NLP),C8.4SpeechandLanguage
• TheTaswillbeofferingtutorialsonthemathofneuralnetsandinpar<cular,backpropaga<on
10/2
/19
3
Disambiguating“race”
10/2
/19
4
Disambiguating“race”
10/2
/19
5
Disambiguating“race”
10/2
/19
6
Disambiguating“race”
10/2
/19
7
• P(NN|TO)=.00047• P(VB|TO)=.83• P(race|NN)=.00057• P(race|VB)=.00012• P(NR|VB)=.0027• P(NR|NN)=.0012• P(VB|TO)P(NR|VB)P(race|VB)=.00000027• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032• Sowe(correctly)choosetheverbreading,
10/2
/19
8
DeBinitions• Aweightedfinite-stateautomatonaddsprobabili<estothearcs
• Thesumoftheprobabili<esleavinganyarcmustsumtoone
• AMarkovchainisaspecialcaseofaWFST• theinputsequenceuniquelydetermineswhichstatestheautomatonwillgothrough
• Markovchainscan’trepresentinherentlyambiguousproblems• Assignsprobabili<estounambiguoussequences
10/2
/19
9
Markovchainforweather
10/2
/19
10
Markovchainforwords
10/2
/19
11
Markovchain=“First-orderobservableMarkovModel”• asetofstates
• Q=q1,q2…qN;thestateat<metisqt• Transi<onprobabili<es:
• asetofprobabili<esA=a01a02…an1…ann.• Eachaijrepresentstheprobabilityoftransi<oningfromstateitostatej• Thesetoftheseisthetransi<onprobabilitymatrixA
• Dis<nguishedstartandendstates
10/2
/19
12 €
aij = P(qt = j |qt−1 = i) 1≤ i, j ≤ N
€
aij =1; 1≤ i ≤ Nj=1
N
∑
Markovchain=“First-orderobservableMarkovModel”
• Currentstateonlydependsonpreviousstate
10/2
/19
13 €
P(qi |q1...qi−1) = P(qi |qi−1)
Anotherrepresentationforstartstate
• Insteadofstartstate
• Specialini<alprobabilityvectorπ
• Anini<aldistribu<onoverprobabilityofstartstates
• Constraints:
10/2/19 14
€
π i = P(q1 = i) 1≤ i ≤ N
€
π j =1j=1
N
∑
TheweatherBigureusingpi
10/2
/19
15
TheweatherBigure:speciBicexample
10/2
/19
16
Markovchainforweather• Whatistheprobabilityof4consecu<verainydays?
10/2
/19
17
10/2
/19
18
HiddenMarkovModels• Wedon’tobservePOStags
• Weinferthemfromthewordswesee
• Observedevents
• Hiddenevents
10/2
/19
19
HMMforIceCream• Youareaclimatologistintheyear2799• Studyingglobalwarming• Youcan’tfindanyrecordsoftheweatherinNewYork,NYforsummerof2007
• ButyoufindKathyMcKeown’sdiary• Whichlistshowmanyice-creamsKathyateeverydatethatsummer
• Ourjob:figureouthowhotitwas
10/2
/19
20
HiddenMarkovModel• ForMarkovchains,theoutputsymbolsarethesameasthestates.• Seehotweather:we’reinstatehot
• Butinpart-of-speechtagging(andotherthings)• Theoutputsymbolsarewords• Thehiddenstatesarepart-of-speechtags
• Soweneedanextension!• AHiddenMarkovModelisanextensionofaMarkovchaininwhichtheinputsymbolsarenotthesameasthestates.
• Thismeanswedon’tknowwhichstatewearein.
10/2
/19
21
HiddenMarkovModels• StatesQ = q1, q2…qN; • Observa<onsO= o1, o2…oN;
• Eachobserva<onisasymbolfromavocabularyV={v1,v2,…vV}• Transi<onprobabili<es
• Transition probability matrix A = {aij}
• Observa<onlikelihoods• Output probability matrix B={bi(k)}
• Specialini<alprobabilityvectorπ
€
π i = P(q1 = i) 1≤ i ≤ N
€
aij = P(qt = j |qt−1 = i) 1≤ i, j ≤ N
€
bi(k) = P(Xt = ok |qt = i)
HiddenMarkovModels
• Someconstraints
10/2
/19
23
€
π i = P(q1 = i) 1≤ i ≤ N
€
aij =1; 1≤ i ≤ Nj=1
N
∑
€
bi(k) =1k=1
M
∑
€
π j =1j=1
N
∑
Assumptions• Markovassump5on:
• Output-independenceassump5on
10/2
/19
24
€
P(qi |q1...qi−1) = P(qi |qi−1)
€
P(ot |O1t−1,q1
t ) = P(ot |qt )
McKeowntask• Given
• IceCreamObserva<onSequence:2,1,3,2,2,2,3…
• Produce:• WeatherSequence:H,C,H,H,H,C…
10/2
/19
25
HMMforicecream
10/2
/19
26
DifferenttypesofHMMstructure
Bakis = left-to-right
Ergodic = fully-connected
TransitionsbetweenthehiddenstatesofHMM,showingAprobs
10/2
/19
28
BobservationlikelihoodsforPOSHMM
ThreefundamentalProblemsforHMMs
• Likelihood:GivenanHMMλ=(A,B)andanobserva<onsequenceO,determinethelikelihoodP(O,λ).
• Decoding:Givenanobserva<onsequenceOandanHMMλ=(A,B),discoverthebesthiddenstatesequenceQ.
• Learning:Givenanobserva<onsequenceOandthesetofstatesintheHMM,learntheHMMparametersAandB.WhatkindofdatawouldweneedtolearntheHMMparameters?
10/2
/19
30
10/2
/19
31
Decoding• Thebesthiddensequence
• Weathersequenceintheicecreamtask• POSsequencegivenaninputsentence
• Wecoulduseargmaxovertheprobabilityofeachpossiblehiddenstatesequence• Whynot?
• Viterbialgorithm• Dynamicprogrammingalgorithm• Usesadynamicprogrammingtrellis
• Eachtrelliscellrepresents,vt(j),representstheprobabilitythattheHMMisinstatejanerseeingthefirsttobserva<onsandpassingthroughthemostlikelystatesequence
10/2
/19
32
Viterbiintuition:wearelookingforthebest‘path’
10/2
/19
33 promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
S1 S2 S4 S3 S5
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
Slide from Dekang Lin
Intuition• ThevalueineachcelliscomputedbytakingtheMAXoverallpathsthatleadtothiscell.
• Anextensionofapathfromstateiat<met-1iscomputedbymul<plying:
10/2
/19
34
TheViterbiAlgorithm
10/2
/19
35
TheAmatrixforthePOSHMM
10/2
/19
36
What is P(VB|TO)? What is P(NN|TO)? Why does this make sense? What is P(TO|VB)? What is P(TO|NN)? Why does this make sense?
10/2
/19
37
10/2
/19
38
Whydoesthismakesense?
10/2
/19
39
TheBmatrixforthePOSHMM
10/2
/19
40
Look at P(want|VB) and P(want|NN). Give an explanation for the difference in the probabilities.
Viterbiexample
10/2
/19
41 t=1
Viterbiexample
10/2
/19
42 t=1
X
Viterbiexample
10/2
/19
43 t=1
X
J=NN
I=S
TheAmatrixforthePOSHMM
10/2
/19
44
Viterbiexample
10/2
/19
45 t=1
X
J=NN
I=S
.041X
TheBmatrixforthePOSHMM
10/2
/19
46
Look at P(want|VB) and P(want|NN). Give an explanation for the difference in the probabilities.
Viterbiexample
10/2
/19
47 t=1
X
J=NN
I=S
.041X 0 0
Viterbiexample
10/2
/19
48 t=1
X
J=NN
I=S
.041X 0 0
0
0
.025
Viterbiexample
10/2
/19
49 t=1
J=NN
I=S
0
0
0
.025
Show the 4 formulas you would use to compute the value at this node and the max.
ComputingtheLikelihoodofanobservation
• Forwardalgorithm
• Exactlyliketheviterbialgorithm,except• Tocomputetheprobabilityofastate,sumtheprobabili<esfromeachpath
10/2
/19
50
ErrorAnalysis:ESSENTIAL!!!• Lookataconfusionmatrix
• Seewhaterrorsarecausingproblems• Noun(NN)vsProperNoun(NN)vsAdj(JJ)• Adverb(RB)vsPrep(IN)vsNoun(NN)• Preterite(VBD)vsPar<ciple(VBN)vsAdjec<ve(JJ)
10/2
/19
51