Lecture 14: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ANLP | 30 October 2017
Lecture14:
AlgorithmsforHMMs
NathanSchneider
(someslidesfromSharonGoldwater;
thankstoJonathanMayforbugfixes)
ANLP|30October2017
Recap:tagging
• POStaggingisasequencelabellingtask.
• Wecantackleitwithamodel(HMM)that
usestwosourcesofinformation:
– Theworditself– Thetagsassignedtosurroundingwords
• Thesecondsourceofinformationmeanswe
can’tjusttageachwordindependently.
LocalTagging
<s> one dog bit </s><s> CD NN NN </s>
NN VB VBD
PRP
Possibletags:
(orderedby
frequencyfor
eachword)
Words:
• Choosingthebesttagforeachwordindependently,
i.e.notconsideringtagcontext,givesthewrong
answer(<s>CDNNNN</s>).
• ThoughNNismorefrequentfor‘bit’,taggingitas
VBDmayyieldabettersequence(<s>CDNNVB</s>)
– becauseP(VBD|NN) andP(</s>|VBD)arehigh.
Recap:HMM
• ElementsofHMM:
– Setofstates(tags)– Outputalphabet(wordtypes)– Startstate(beginningofsentence)– StatetransitionprobabilitiesP(ti |ti-1)– OutputprobabilitiesfromeachstateP(wi |ti)
Recap:HMM
• GivenasentenceW=w1…wn withtagsT=t1…tn,computeP(W,T) as:
• Butwewanttofind without
enumeratingallpossibletagsequences T– Useagreedyapproximation,or
– UseViterbialgorithmtostorepartialcomputations.
-(.,/) =0- 12 32 - 32 3245
6
275
argmax/ -(/|.)
GreedyTagging
<s> one dog bit </s><s> CD NN NN </s>
NN VB VBD
PRP
Possibletags:
(orderedby
frequencyfor
eachword)
Words:
• Fori =1toN:choosethetagthatmaximizes
– transitionprobability- 32 3245 ×– emissionprobability- 12 32
• Thisusestagcontextbutisstillsuboptimal.Why?
– Itcommitstoatagbeforeseeingsubsequenttags.
– ItcouldbethecasethatALLpossiblenexttagshavelow
transitionprobabilities.E.g.,ifatagisunlikelytooccuratthe
endofthesentence,thatisdisregardedwhengoinglefttoright.
Greedyvs.DynamicProgramming
• Thegreedyalgorithmisfast:wejusthavetomakeonedecisionpertoken,andwe’redone.
– Runtimecomplexity?
– @(AB) withA tags,length-B sentence
• Butsubsequentwordshavenoeffectoneach
decision,sotheresultislikelytobesuboptimal.
• Dynamicprogrammingsearchgivesanoptimalglobalsolution,butrequiressomebookkeeping
(=morecomputation).Postponesdecisionabout
anytaguntilwecanbesureit’soptimal.
ViterbiTagging:intuition
<s> one dog bit </s><s> CD NN NN </s>
NN VB VBD
PRP
Possibletags:
(orderedby
frequencyfor
eachword)
Words:
• Supposewehavealreadycomputed
a) Thebesttagsequencefor<s> … bit thatendsinNN.b) Thebesttagsequencefor<s> … bit thatendsinVBD.
• Then,thebestfullsequencewouldbeeither
– sequence(a)extendedtoinclude</s>,or
– sequence(b)extendedtoinclude</s>.
ViterbiTagging:intuition
<s> one dog bit </s><s> CD NN NN </s>
NN VB VBD
PRP
Possibletags:
(orderedby
frequencyfor
eachword)
Words:
• Butsimilarly,toget
a) Thebesttagsequencefor<s> … bit thatendsinNN.
• Wecouldextendoneof:
– Thebesttagsequencefor<s> … dog thatendsinNN.– Thebesttagsequencefor<s> … dog thatendsinVB.
• Andsoon…
Viterbi:high-levelpicture
• Wanttofind
• Intuition:thebestpathoflengthi endinginstatetmustincludethebestpathoflengthi-1 tothepreviousstate.So,
– Findthebestpathoflengthi-1 toeachstate.
– Considerextendingeachofthoseby1step,tostatet.– Takethebestofthoseoptionsasthebestpathtostatet.
argmax/ -(/|.)
Viterbialgorithm
• Useachart tostorepartialresultsaswego– T× Ntable,whereb 3, c istheprobability*ofthebest
statesequenceforw1…wi thatendsinstatet.
*Specifically,v(t,i)storesthemaxofthejointprobabilityP(w1…wi,t1…ti-1,ti=t|λ)
Viterbialgorithm
• Useachart tostorepartialresultsaswego– T× Ntable,whereb 3, c istheprobability*ofthebest
statesequenceforw1…wi thatendsinstatet.
• Fillincolumnsfromlefttoright,with
– Themaxisovereachpossibleprevioustag3d
• Storeabacktrace toshow,foreachcell,whichstateatc − 1 wecamefrom.
b 3, c = maxUf b 3′, c − 1 [ -(3|3′) [ - 12|32
*Specifically,v(t,i)storesthemaxofthejointprobabilityP(w1…wi,t1…ti-1,ti=t|λ)
TransitionandOutputProbabilitiesTransitionmatrix:P(ti |ti-1):
Emissionmatrix:P(wi |ti):
Noun Verb Det Prep Adv </s><s> .3 .1 .3 .2 .1 0
Noun .2 .4 .01 .3 .04 .05
Verb .3 .05 .3 .2 .1 .05
Det .9 .01 .01 .01 .07 0
Prep .4 .05 .4 .1 .05 0
Adv .1 .5 .1 .1 .1 .1
a cat doctor in is the veryNoun 0 .5 .4 0 .1 0 0
Verb 0 0 .1 0 .9 0 0
Det .3 0 0 0 0 .7 0
Prep 0 0 0 1.0 0 0 0
Adv 0 0 0 .1 0 0 .9
Example
SupposeW=thedoctorisin.Ourinitiallyempty
table:
v w1=the w2=doctor w3=is w4=in </s>NounVerbDetPrepAdv
Fillinginthefirstcolumn
SupposeW=thedoctorisin.Ourinitiallyempty
table:
b Noun, the = - Noun <s> -(the|Noun)=.3(0)…
v w1=the w2=doctor w3=is w4=in </s>Noun 0
Verb 0
Det .21
Prep 0
Adv 0
b Det, the = - Det <s>)-(the|Det =.3(.7)
Thesecondcolumn
- Noun Det)-(doctor|Noun =.3(.4)
b Noun, doctor= maxUf b 3d, the [ -(Noun|3′) [ -(doctor|Noun)
v w1=the w2=doctor w3=is w4=in </s>Noun 0 ?
Verb 0
Det .21
Prep 0
Adv 0
Thesecondcolumn
- Noun Det)-(doctor|Noun =.9(.4)
b Noun, doctor= maxUf b 3d, the [ -(Noun|3′) [ -(doctor|Noun)=max{0,0,.21(.36),0,0}=.0756
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756
Verb 0
Det .21
Prep 0
Adv 0
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756
Verb 0 .00021
Det .21
Prep 0
Adv 0
Thesecondcolumn
- Verb Det)-(doctor|Verb =.01(.1)
b Verb, doctor= maxUf b 3d, the [ -(Verb|3′) [ -(doctor|Verb)=max{0,0,.21(.001),0,0}=.00021
Thesecondcolumn
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756
Verb 0 .00021
Det .21 0
Prep 0 0
Adv 0 0
- Verb Det)-(doctor|Verb =.01(.1)
b Verb, doctor= maxUf b 3d, the [ -(Verb|3′) [ -(doctor|Verb)=max{0,0,.21(.001),0,0}=.00021
Thethirdcolumn
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512
Verb 0 .00021
Det .21 0
Prep 0 0
Adv 0 0
- Noun Noun)-(is|Noun =.2(.1)=.02
b Noun, is= maxUf b 3d, doctor [ -(Noun|3′) [ -(is|Noun)=max{.0756(.02),.00021(.03),0,0,0}=.001512
- Noun Verb)-(is|Noun =.3(.1)=.03
Thethirdcolumn
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512
Verb 0 .00021 .027216
Det .21 0 0
Prep 0 0 0
Adv 0 0 0
- Verb Noun)-(is|Verb =.4(.9)=.36
b Verb, is= maxUf b 3d, doctor [ -(Verb|3′) [ -(is|Verb)=max{.0756(.36),.00021(.045),0,0,0}=.027216
- Verb Verb)-(is|Verb =.05(.9)=.045
Thefourthcolumn
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0
- Prep Noun)-(in|Prep =.3(1.0)
b Prep, in= maxUf b 3d, is [ -(Prep|3′) [ -(in|Prep)=max{.001512(.3),.027216(.2),0,0,0}=.005443
- Prep Verb)-(in|Prep =.2(1.0)
Thefourthcolumn
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0 .000272
- Adv Noun)-(in|Adv =.04(.1)
b Prep, in= maxUf b 3d, is [ -(Prep|3′) [ -(in|Prep)=max{.000504(.004),.027216(.01),0,0,0}=.000272
- Adv Verb)-(in|Adv =.1(.1)
Endofsentence
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
.000027
2
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0 .000272
- </s> Prep =0
b </s>= maxUf b 3d, in [ -(</s>|3′)=max{0,0,0,.005443(0),.000272(.1)}=.0000272
- </s> Adv =.1
CompletedViterbiChart
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
.000027
2
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0 .000272
FollowingtheBacktraces
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
.000027
2
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0 .000272
FollowingtheBacktraces
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
.000027
2
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0 .000272
FollowingtheBacktraces
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
.000027
2
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0 .000272
FollowingtheBacktraces
v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0
.000027
2
Verb 0 .00021 .027216 0
Det .21 0 0 0
Prep 0 0 0 .005443
Adv 0 0 0 .000272
Det Noun Verb Prep
Implementationandefficiency
• ForsequencelengthN withT possibletags,
– EnumerationtakesO(TN) timeandO(N) space.– BigramViterbitakesO(T2N) timeandO(TN) space.– Viterbiisexhaustive:furtherspeedupsmightbehad
usingmethodsthatprunethesearchspace.
• AswithN-grammodels,chartprobs getreally
tinyreallyfast,causingunderflow.
– So,weusecosts (neg logprobs)instead.– Takeminimumoversumofcosts,insteadofmaximum
overproductofprobs.
Higher-orderViterbi
• ForatagtrigrammodelwithT possibletags,
weeffectivelyneedT2 states
– n-gramViterbirequiresTn-1 states,takesO(TnN)timeandO(Tn-1N) space.
Noun Verb
VerbPrepVerbNoun
VerbVerb
Verb</s>
HMMs:whatelse?
• UsingViterbi,wecanfindthebesttagsfora
sentence(decoding),andget-(.,/).
• Wemightalsowantto
– Computethelikelihood -(.),i.e.,theprobabilityofasentenceregardlessofitstags(alanguagemodel!)
– learn thebestsetofparameters(transition&emission
probs.)givenonlyanunannotated corpusofsentences.
Computingthelikelihood
• Fromprobabilitytheory,weknowthat
• ThereareanexponentialnumberofTs.
• Again,bycomputingandstoringpartialresults,we
cansolveefficiently.
• (Advancedslidesshowthealgorithmforthosewhoareinterested!)
-(.) =Å-(.,/)�
/
Summary
• HMM:agenerativemodelofsentencesusing
hiddenstatesequence
• Greedytagging:fastbutsuboptimal
• Dynamicprogrammingalgorithmstocompute
– Besttagsequencegivenwords(Viterbialgorithm)
– Likelihood(forwardalgorithm—seeadvancedslides)
– Bestparametersfromunannotatedcorpus
(forward-backwardalgorithm,aninstanceofEM—
seeadvancedslides)
AdvancedTopics
(thefollowingslidesarejustforpeoplewhoareinterested)
Notation
• Sequenceofobservationsovertimeo1, o2, …, oN– here,wordsinsentence
• VocabularysizeV ofpossibleobservations
• Setofpossiblestatesq1, q2, …, qT (seenotenextslide)
– here,tags
• A,anT×T matrixoftransitionprobabilities
– aij:theprob oftransitioningfromstatei toj.
• B,anT×V matrixofoutputprobabilities
– bi(ot):theprob ofemittingot fromstatei.
Noteonnotation
• J&Museq1, q2, …, qN forsetofstates,butalso useq1, q2, …, qN forstatesequenceovertime.
– So,justseeingq1 isambiguous(thoughusually
disambiguatedfromcontext).
– I’llinsteaduseqi forstatenames,andqn forstateattimen.
– Sowecouldhaveqn = qi,meaning:thestatewe’reinat
timen is qi.
HMMexamplew/newnotation
• States{q1,q2}(or{<s>,q1,q2}):thinkNN,VB
• Outputsymbols{x,y,z}:thinkchair,dog,help
q1 q2
x y z
.6 .1 .3
x y z
.1 .7 .2
.5
.3
.5
.7
Start
AdaptedfromManning&Schuetze,Fig9.2
HMMexamplew/newnotation
• ApossiblesequenceofoutputsforthisHMM:
• ApossiblesequenceofstatesforthisHMM:
• Fortheseexamples,N = 9,q3= q2 ando3= y
z y y x y z x z z
q1 q2 q2 q1 q1 q2 q1 q1 q1
TransitionandOutputProbabilities
• TransitionmatrixA:aij =P(qj |qi)
Ex: P(qn=q2|qn-1=q1)=.3
• OutputmatrixB:bi(o)=P(o|qi)
Ex: P(on=y |qn=q1)=.1
q1 q2
<s> 1 0
q1 .7 .3
q2 .5 .5
x y zq1 .6 .1 .3
q2 .1 .7 .2
Forwardalgorithm
• Useatablewithcellsα(j,t): theprobabilityofbeinginstatej afterseeingo1…ot (forwardprobability).
• Fillincolumnsfromlefttoright,with
– SameasViterbi,butsuminsteadofmax(andnobacktrace).
â ä, 3 =Åâ c, 3 − 1Q
275
[ V2ã[ Rã OU
â(ä, 3) = -(O1, O2, … O3, P3 = ä|N)
Note:becausethere’sasum,wecan’tusethetrickthatreplacesprobs withcosts.For
implementationinfo,seehttp://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf and
http://stackoverflow.com/questions/13391625/underflow-in-forward-algorithm-for-hmms .
Example
• SupposeO=xzy.Ourinitiallyemptytable:
o1=x o2=z o3=yq1
q2
Fillingthefirstcolumn
o1=x o2=z o3=yq1 .6
q2 0
â 1,1 = V\]^5 [ R1 z) = 1 (.6â 2,1 = V\]^| [ R2 z) = 0 (.1
Startingthesecondcolumn
o1=x o2=z o3=yq1 .6 .126
q2 0
â 1,2 =Åâ c, 1Q
275
[ V25 [ R1 Z
= .6 .7 .3 + 0 .5 .3
= â 1,1 [ V55[ R5 Z + â 2,1 [ V|5[ R1(Z)
= .126
Finishingthesecondcolumn
o1=x o2=z o3=yq1 .6 .126
q2 0 .036
â 2,2 =Åâ c, 1Q
275
[ V2| [ R2 Z
= .6 .3 .2 + 0 .5 .2
= â 1,1 [ V5|[ R| Z + â 2,1 [ V||[ R2(Z)
= .036
Thirdcolumnandfinish
• Addupallprobabilitiesinlastcolumntogetthe
probabilityoftheentiresequence:
o1=x o2=z o3=yq1 .6 .126 .01062
q2 0 .036 .03906
- @|N =Åâ c, AQ
275
Learning
• Givenonly theoutputsequence,learnthebestsetofparametersλ =(A,B).
• Assume‘best’=maximum-likelihood.
• Otherdefinitionsarepossible,won’tdiscusshere.
Unsupervisedlearning
• TraininganHMMfromanannotatedcorpusis
simple.
– Supervised learning:wehaveexampleslabelledwiththe
right‘answers’(here,tags):nohiddenvariablesintraining.
• Trainingfromunannotatedcorpusistrickier.
– Unsupervised learning:wehavenoexampleslabelledwith
theright‘answers’:allweseeareoutputs,statesequence
ishidden.
Circularity
• Ifweknowthestatesequence,wecanfindthebestλ.– E.g.,useMLE:
• Ifweknowλ,wecanfindthebeststatesequence.– useViterbi
• Butwedon'tknoweither!
- Pä|Pc = ç(S2→Sã)ç(S2)
Expectation-maximization(EM)
Asinspellingcorrection,wecanuseEMtobootstrap,
iterativelyupdatingtheparametersandhiddenvariables.
• Initializeparametersλ(0)
• Ateachiterationk,– E-step:Computeexpectedcountsusingλ(k-1)
– M-step:Setλ(k) usingMLEontheexpectedcounts
• Repeatuntilλdoesn'tchange(orotherstoppingcriterion).
Expectedcounts??
Countingtransitionsfromqi→qj:
• Realcounts:
– count1eachtimeweseeqi→qj intruetagsequence.
• Expectedcounts:
– Withcurrentλ,computeprobs ofallpossibletagsequences.
– IfsequenceQ hasprobabilityp,count p foreachqi→qj inQ.– Addupthesefractionalcountsacrossallpossiblesequences.
Example
• Notionally,wecomputeexpectedcountsasfollows:
Possible
sequence
Probabilityof
sequence
Q1= q1 q1 q1 p1Q2= q1 q2 q1 p2Q3= q1 q1 q2 p3Q4= q1 q2 q2 p4Observs: x z y
Example
• Notionally,wecomputeexpectedcountsasfollows:
êë P1 → P1 = 2í1 + í3
Possible
sequence
Probabilityof
sequence
Q1= q1 q1 q1 p1Q2= q1 q2 q1 p2Q3= q1 q1 q2 p3Q4= q1 q2 q2 p4Observs: x z y
Forward-Backwardalgorithm
• Asusual,avoidenumeratingallpossiblesequences.
• Forward-Backward (Baum-Welch)algorithmcomputes
expectedcountsusingforwardprobabilitiesand
backwardprobabilities:
– Details,seeJ&M6.5
• EMideaismuchmoregeneral:canuseformanylatent
variablemodels.
ì(ä, 3) = -(P3 = ä, OUî5, OUî|, … OA|N)
Guarantees
• EMisguaranteedtofindalocalmaximumofthelikelihood.
• Notguaranteedtofindglobalmaximum.
• Practicalissues:initialization,randomrestarts,earlystopping.
Factis,itdoesn’tworkwellforlearningPOStaggers!
valuesofλ
P(O| λ)