Lecture 14: Algorithms for HMMs

Lecture14:

AlgorithmsforHMMs

NathanSchneider

(someslidesfromSharonGoldwater;

thankstoJonathanMayforbugfixes)

ANLP|30October2017

Recap:tagging

• POStaggingisasequencelabellingtask.

• Wecantackleitwithamodel(HMM)that

usestwosourcesofinformation:

– Theworditself– Thetagsassignedtosurroundingwords

• Thesecondsourceofinformationmeanswe

can’tjusttageachwordindependently.

LocalTagging

<s> one dog bit </s><s> CD NN NN </s>

NN VB VBD

PRP

Possibletags:

(orderedby

frequencyfor

eachword)

Words:

• Choosingthebesttagforeachwordindependently,

i.e.notconsideringtagcontext,givesthewrong

answer(<s>CDNNNN</s>).

• ThoughNNismorefrequentfor‘bit’,taggingitas

VBDmayyieldabettersequence(<s>CDNNVB</s>)

– becauseP(VBD|NN) andP(</s>|VBD)arehigh.

Recap:HMM

• ElementsofHMM:

– Setofstates(tags)– Outputalphabet(wordtypes)– Startstate(beginningofsentence)– StatetransitionprobabilitiesP(ti |ti-1)– OutputprobabilitiesfromeachstateP(wi |ti)

Recap:HMM

• GivenasentenceW=w1…wn withtagsT=t1…tn,computeP(W,T) as:

• Butwewanttofind without

enumeratingallpossibletagsequences T– Useagreedyapproximation,or

– UseViterbialgorithmtostorepartialcomputations.

-(.,/) =0- 12 32 - 32 3245

6

275

argmax/ -(/|.)

GreedyTagging


NN VB VBD

PRP

Possibletags:

(orderedby

frequencyfor

eachword)

Words:

• Fori =1toN:choosethetagthatmaximizes

– transitionprobability- 32 3245 ×– emissionprobability- 12 32

• Thisusestagcontextbutisstillsuboptimal.Why?

– Itcommitstoatagbeforeseeingsubsequenttags.

– ItcouldbethecasethatALLpossiblenexttagshavelow

transitionprobabilities.E.g.,ifatagisunlikelytooccuratthe

endofthesentence,thatisdisregardedwhengoinglefttoright.

Greedyvs.DynamicProgramming

• Thegreedyalgorithmisfast:wejusthavetomakeonedecisionpertoken,andwe’redone.

– Runtimecomplexity?

– @(AB) withA tags,length-B sentence

• Butsubsequentwordshavenoeffectoneach

decision,sotheresultislikelytobesuboptimal.

• Dynamicprogrammingsearchgivesanoptimalglobalsolution,butrequiressomebookkeeping

(=morecomputation).Postponesdecisionabout

anytaguntilwecanbesureit’soptimal.

ViterbiTagging:intuition


NN VB VBD

PRP

Possibletags:

(orderedby

frequencyfor

eachword)

Words:

• Supposewehavealreadycomputed

a) Thebesttagsequencefor<s> … bit thatendsinNN.b) Thebesttagsequencefor<s> … bit thatendsinVBD.

• Then,thebestfullsequencewouldbeeither

– sequence(a)extendedtoinclude</s>,or

– sequence(b)extendedtoinclude</s>.

ViterbiTagging:intuition


NN VB VBD

PRP

Possibletags:

(orderedby

frequencyfor

eachword)

Words:

• Butsimilarly,toget

a) Thebesttagsequencefor<s> … bit thatendsinNN.

• Wecouldextendoneof:

– Thebesttagsequencefor<s> … dog thatendsinNN.– Thebesttagsequencefor<s> … dog thatendsinVB.

• Andsoon…

Viterbi:high-levelpicture

• Wanttofind

• Intuition:thebestpathoflengthi endinginstatetmustincludethebestpathoflengthi-1 tothepreviousstate.So,

– Findthebestpathoflengthi-1 toeachstate.

– Considerextendingeachofthoseby1step,tostatet.– Takethebestofthoseoptionsasthebestpathtostatet.

argmax/ -(/|.)

Viterbialgorithm

• Useachart tostorepartialresultsaswego– T× Ntable,whereb 3, c istheprobability*ofthebest

statesequenceforw1…wi thatendsinstatet.

*Specifically,v(t,i)storesthemaxofthejointprobabilityP(w1…wi,t1…ti-1,ti=t|λ)

Viterbialgorithm

• Useachart tostorepartialresultsaswego– T× Ntable,whereb 3, c istheprobability*ofthebest

statesequenceforw1…wi thatendsinstatet.

• Fillincolumnsfromlefttoright,with

– Themaxisovereachpossibleprevioustag3d

• Storeabacktrace toshow,foreachcell,whichstateatc − 1 wecamefrom.

b 3, c = maxUf b 3′, c − 1 [ -(3|3′) [ - 12|32

*Specifically,v(t,i)storesthemaxofthejointprobabilityP(w1…wi,t1…ti-1,ti=t|λ)

TransitionandOutputProbabilitiesTransitionmatrix:P(ti |ti-1):

Emissionmatrix:P(wi |ti):

Noun Verb Det Prep Adv </s><s> .3 .1 .3 .2 .1 0

Noun .2 .4 .01 .3 .04 .05

Verb .3 .05 .3 .2 .1 .05

Det .9 .01 .01 .01 .07 0

Prep .4 .05 .4 .1 .05 0

Adv .1 .5 .1 .1 .1 .1

a cat doctor in is the veryNoun 0 .5 .4 0 .1 0 0

Verb 0 0 .1 0 .9 0 0

Det .3 0 0 0 0 .7 0

Prep 0 0 0 1.0 0 0 0

Adv 0 0 0 .1 0 0 .9

Example

SupposeW=thedoctorisin.Ourinitiallyempty

table:

v w1=the w2=doctor w3=is w4=in </s>NounVerbDetPrepAdv

Fillinginthefirstcolumn

SupposeW=thedoctorisin.Ourinitiallyempty

table:

b Noun, the = - Noun <s> -(the|Noun)=.3(0)…

v w1=the w2=doctor w3=is w4=in </s>Noun 0

Verb 0

Det .21

Prep 0

Adv 0

b Det, the = - Det <s>)-(the|Det =.3(.7)

Thesecondcolumn

- Noun Det)-(doctor|Noun =.3(.4)

b Noun, doctor= maxUf b 3d, the [ -(Noun|3′) [ -(doctor|Noun)

v w1=the w2=doctor w3=is w4=in </s>Noun 0 ?

Verb 0

Det .21

Prep 0

Adv 0

Thesecondcolumn

- Noun Det)-(doctor|Noun =.9(.4)

b Noun, doctor= maxUf b 3d, the [ -(Noun|3′) [ -(doctor|Noun)=max{0,0,.21(.36),0,0}=.0756

v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756

Verb 0

Det .21

Prep 0

Adv 0


Verb 0 .00021

Det .21

Prep 0

Adv 0

Thesecondcolumn

- Verb Det)-(doctor|Verb =.01(.1)

b Verb, doctor= maxUf b 3d, the [ -(Verb|3′) [ -(doctor|Verb)=max{0,0,.21(.001),0,0}=.00021

Thesecondcolumn


Verb 0 .00021

Det .21 0

Prep 0 0

Adv 0 0

- Verb Det)-(doctor|Verb =.01(.1)

b Verb, doctor= maxUf b 3d, the [ -(Verb|3′) [ -(doctor|Verb)=max{0,0,.21(.001),0,0}=.00021

Thethirdcolumn

v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512

Verb 0 .00021

Det .21 0

Prep 0 0

Adv 0 0

- Noun Noun)-(is|Noun =.2(.1)=.02

b Noun, is= maxUf b 3d, doctor [ -(Noun|3′) [ -(is|Noun)=max{.0756(.02),.00021(.03),0,0,0}=.001512

- Noun Verb)-(is|Noun =.3(.1)=.03

Thethirdcolumn

v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512

Verb 0 .00021 .027216

Det .21 0 0

Prep 0 0 0

Adv 0 0 0

- Verb Noun)-(is|Verb =.4(.9)=.36

b Verb, is= maxUf b 3d, doctor [ -(Verb|3′) [ -(is|Verb)=max{.0756(.36),.00021(.045),0,0,0}=.027216

- Verb Verb)-(is|Verb =.05(.9)=.045

Thefourthcolumn

v w1=the w2=doctor w3=is w4=in </s>Noun 0 .0756 .001512 0

Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0

- Prep Noun)-(in|Prep =.3(1.0)

b Prep, in= maxUf b 3d, is [ -(Prep|3′) [ -(in|Prep)=max{.001512(.3),.027216(.2),0,0,0}=.005443

- Prep Verb)-(in|Prep =.2(1.0)

Thefourthcolumn


Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0 .000272

- Adv Noun)-(in|Adv =.04(.1)

b Prep, in= maxUf b 3d, is [ -(Prep|3′) [ -(in|Prep)=max{.000504(.004),.027216(.01),0,0,0}=.000272

- Adv Verb)-(in|Adv =.1(.1)

Endofsentence


.000027

2

Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0 .000272

- </s> Prep =0

b </s>= maxUf b 3d, in [ -(</s>|3′)=max{0,0,0,.005443(0),.000272(.1)}=.0000272

- </s> Adv =.1

CompletedViterbiChart


.000027

2

Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0 .000272

FollowingtheBacktraces


.000027

2

Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0 .000272



.000027

2

Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0 .000272



.000027

2

Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0 .000272



.000027

2

Verb 0 .00021 .027216 0

Det .21 0 0 0

Prep 0 0 0 .005443

Adv 0 0 0 .000272

Det Noun Verb Prep

Implementationandefficiency

• ForsequencelengthN withT possibletags,

– EnumerationtakesO(TN) timeandO(N) space.– BigramViterbitakesO(T2N) timeandO(TN) space.– Viterbiisexhaustive:furtherspeedupsmightbehad

usingmethodsthatprunethesearchspace.

• AswithN-grammodels,chartprobs getreally

tinyreallyfast,causingunderflow.

– So,weusecosts (neg logprobs)instead.– Takeminimumoversumofcosts,insteadofmaximum

overproductofprobs.

Higher-orderViterbi

• ForatagtrigrammodelwithT possibletags,

weeffectivelyneedT2 states

– n-gramViterbirequiresTn-1 states,takesO(TnN)timeandO(Tn-1N) space.

Noun Verb

VerbPrepVerbNoun

VerbVerb

Verb</s>

HMMs:whatelse?

• UsingViterbi,wecanfindthebesttagsfora

sentence(decoding),andget-(.,/).

• Wemightalsowantto

– Computethelikelihood -(.),i.e.,theprobabilityofasentenceregardlessofitstags(alanguagemodel!)

– learn thebestsetofparameters(transition&emission

probs.)givenonlyanunannotated corpusofsentences.

Computingthelikelihood

• Fromprobabilitytheory,weknowthat

• ThereareanexponentialnumberofTs.

• Again,bycomputingandstoringpartialresults,we

cansolveefficiently.

• (Advancedslidesshowthealgorithmforthosewhoareinterested!)

-(.) =Å-(.,/)�

/

Summary

• HMM:agenerativemodelofsentencesusing

hiddenstatesequence

• Greedytagging:fastbutsuboptimal

• Dynamicprogrammingalgorithmstocompute

– Besttagsequencegivenwords(Viterbialgorithm)

– Likelihood(forwardalgorithm—seeadvancedslides)

– Bestparametersfromunannotatedcorpus

(forward-backwardalgorithm,aninstanceofEM—

seeadvancedslides)

AdvancedTopics

(thefollowingslidesarejustforpeoplewhoareinterested)

Notation

• Sequenceofobservationsovertimeo1, o2, …, oN– here,wordsinsentence

• VocabularysizeV ofpossibleobservations

• Setofpossiblestatesq1, q2, …, qT (seenotenextslide)

– here,tags

• A,anT×T matrixoftransitionprobabilities

– aij:theprob oftransitioningfromstatei toj.

• B,anT×V matrixofoutputprobabilities

– bi(ot):theprob ofemittingot fromstatei.

Noteonnotation

• J&Museq1, q2, …, qN forsetofstates,butalso useq1, q2, …, qN forstatesequenceovertime.

– So,justseeingq1 isambiguous(thoughusually

disambiguatedfromcontext).

– I’llinsteaduseqi forstatenames,andqn forstateattimen.

– Sowecouldhaveqn = qi,meaning:thestatewe’reinat

timen is qi.

HMMexamplew/newnotation

• States{q1,q2}(or{<s>,q1,q2}):thinkNN,VB

• Outputsymbols{x,y,z}:thinkchair,dog,help

q1 q2

x y z

.6 .1 .3

x y z

.1 .7 .2

.5

.3

.5

.7

Start

AdaptedfromManning&Schuetze,Fig9.2

HMMexamplew/newnotation

• ApossiblesequenceofoutputsforthisHMM:

• ApossiblesequenceofstatesforthisHMM:

• Fortheseexamples,N = 9,q3= q2 ando3= y

z y y x y z x z z

q1 q2 q2 q1 q1 q2 q1 q1 q1

TransitionandOutputProbabilities

• TransitionmatrixA:aij =P(qj |qi)

Ex: P(qn=q2|qn-1=q1)=.3

• OutputmatrixB:bi(o)=P(o|qi)

Ex: P(on=y |qn=q1)=.1

q1 q2

<s> 1 0

q1 .7 .3

q2 .5 .5

x y zq1 .6 .1 .3

q2 .1 .7 .2

Forwardalgorithm

• Useatablewithcellsα(j,t): theprobabilityofbeinginstatej afterseeingo1…ot (forwardprobability).

• Fillincolumnsfromlefttoright,with

– SameasViterbi,butsuminsteadofmax(andnobacktrace).

â ä, 3 =Åâ c, 3 − 1Q

275

[ V2ã[ Rã OU

â(ä, 3) = -(O1, O2, … O3, P3 = ä|N)

Note:becausethere’sasum,wecan’tusethetrickthatreplacesprobs withcosts.For

implementationinfo,seehttp://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf and

http://stackoverflow.com/questions/13391625/underflow-in-forward-algorithm-for-hmms .

Example

• SupposeO=xzy.Ourinitiallyemptytable:

o1=x o2=z o3=yq1

q2

Fillingthefirstcolumn

o1=x o2=z o3=yq1 .6

q2 0

â 1,1 = V\]^5 [ R1 z) = 1 (.6â 2,1 = V\]^| [ R2 z) = 0 (.1

Startingthesecondcolumn

o1=x o2=z o3=yq1 .6 .126

q2 0

â 1,2 =Åâ c, 1Q

275

[ V25 [ R1 Z

= .6 .7 .3 + 0 .5 .3

= â 1,1 [ V55[ R5 Z + â 2,1 [ V|5[ R1(Z)

= .126

Finishingthesecondcolumn

o1=x o2=z o3=yq1 .6 .126

q2 0 .036

â 2,2 =Åâ c, 1Q

275

[ V2| [ R2 Z

= .6 .3 .2 + 0 .5 .2

= â 1,1 [ V5|[ R| Z + â 2,1 [ V||[ R2(Z)

= .036

Thirdcolumnandfinish

• Addupallprobabilitiesinlastcolumntogetthe

probabilityoftheentiresequence:

o1=x o2=z o3=yq1 .6 .126 .01062

q2 0 .036 .03906

- @|N =Åâ c, AQ

275

Learning

• Givenonly theoutputsequence,learnthebestsetofparametersλ =(A,B).

• Assume‘best’=maximum-likelihood.

• Otherdefinitionsarepossible,won’tdiscusshere.

Unsupervisedlearning

• TraininganHMMfromanannotatedcorpusis

simple.

– Supervised learning:wehaveexampleslabelledwiththe

right‘answers’(here,tags):nohiddenvariablesintraining.

• Trainingfromunannotatedcorpusistrickier.

– Unsupervised learning:wehavenoexampleslabelledwith

theright‘answers’:allweseeareoutputs,statesequence

ishidden.

Circularity

• Ifweknowthestatesequence,wecanfindthebestλ.– E.g.,useMLE:

• Ifweknowλ,wecanfindthebeststatesequence.– useViterbi

• Butwedon'tknoweither!

- Pä|Pc = ç(S2→Sã)ç(S2)

Expectation-maximization(EM)

Asinspellingcorrection,wecanuseEMtobootstrap,

iterativelyupdatingtheparametersandhiddenvariables.

• Initializeparametersλ(0)

• Ateachiterationk,– E-step:Computeexpectedcountsusingλ(k-1)

– M-step:Setλ(k) usingMLEontheexpectedcounts

• Repeatuntilλdoesn'tchange(orotherstoppingcriterion).

Expectedcounts??

Countingtransitionsfromqi→qj:

• Realcounts:

– count1eachtimeweseeqi→qj intruetagsequence.

• Expectedcounts:

– Withcurrentλ,computeprobs ofallpossibletagsequences.

– IfsequenceQ hasprobabilityp,count p foreachqi→qj inQ.– Addupthesefractionalcountsacrossallpossiblesequences.

Example

• Notionally,wecomputeexpectedcountsasfollows:

Possible

sequence

Probabilityof

sequence

Q1= q1 q1 q1 p1Q2= q1 q2 q1 p2Q3= q1 q1 q2 p3Q4= q1 q2 q2 p4Observs: x z y

Example

• Notionally,wecomputeexpectedcountsasfollows:

êë P1 → P1 = 2í1 + í3

Possible

sequence

Probabilityof

sequence

Q1= q1 q1 q1 p1Q2= q1 q2 q1 p2Q3= q1 q1 q2 p3Q4= q1 q2 q2 p4Observs: x z y

Forward-Backwardalgorithm

• Asusual,avoidenumeratingallpossiblesequences.

• Forward-Backward (Baum-Welch)algorithmcomputes

expectedcountsusingforwardprobabilitiesand

backwardprobabilities:

– Details,seeJ&M6.5

• EMideaismuchmoregeneral:canuseformanylatent

variablemodels.

ì(ä, 3) = -(P3 = ä, OUî5, OUî|, … OA|N)

Guarantees

• EMisguaranteedtofindalocalmaximumofthelikelihood.

• Notguaranteedtofindglobalmaximum.

• Practicalissues:initialization,randomrestarts,earlystopping.

Factis,itdoesn’tworkwellforlearningPOStaggers!

valuesofλ

P(O| λ)

Lecture 14: Algorithms for HMMs

Documents