Lecture 5: Sequence Models II

Lecture5:SequenceModelsII

AlanRi7er(many slides from Greg Durrett, Dan Klein, Vivek Srikumar, Chris Manning, Yoav Artzi)

Recall:HMMs

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

Recall:HMMs


y1 y2 yn

x1 x2 xn

…

Recall:HMMs


y1 y2 yn

x1 x2 xn

… P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

Recall:HMMs


y1 y2 yn

x1 x2 xn

… P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

‣ Training:maximumlikelihoodesBmaBon(withsmoothing)

Recall:HMMs

‣ Inferenceproblem:


y1 y2 yn

x1 x2 xn

… P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

argmaxyP (y|x) = argmaxyP (y,x)

P (x)


Recall:HMMs

‣ Inferenceproblem:

‣ Viterbi:


y1 y2 yn

x1 x2 xn

… P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

argmaxyP (y|x) = argmaxyP (y,x)

P (x)


scorei(s) = maxyi�1

P (s|yi�1)P (xi|s)scorei�1(yi�1)

ThisLecture

‣ (ifBme)Beamsearch

‣ CRFs:model(+featuresforNER),inference,learning

‣ NamedenBtyrecogniBon(NER)

NamedEnBtyRecogniBon


BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.



PERSON LOC ORG



PERSON LOC ORG

B-PER I-PER O O O B-LOC B-ORGO O O O O

‣ BIOtagset:begin,inside,outside



PERSON LOC ORG



‣WhymightanHMMnotdosowellhere?

‣ Sequenceoftags—shouldweuseanHMM?



PERSON LOC ORG




‣ LotsofO’s,sotagsaren’tasinformaBveaboutcontext




PERSON LOC ORG




‣ LotsofO’s,sotagsaren’tasinformaBveaboutcontext


‣ Insufficientfeatures/capacitywithmulBnomials(especiallyforunks)

CRFs

CondiBonalRandomFields

‣ HMMsareexpressibleasBayesnets(factorgraphs)

y1 y2 yn

x1 x2 xn

…



y1 y2 yn

x1 x2 xn

…

‣ ThisreflectsthefollowingdecomposiBon:

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .



y1 y2 yn

x1 x2 xn

…

‣ ThisreflectsthefollowingdecomposiBon:

‣ Locallynormalizedmodel:eachfactorisaprobabilitydistribuBonthatnormalizes

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

CondiBonalRandomFields‣ HMMs: P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .


‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:

‣ HMMs: P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .



‣ HMMs:

P (y|x) = 1

Z

Y

k

exp(�k(x,y))

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .



‣ HMMs:

P (y|x) = 1

Z

Y

k

exp(�k(x,y))

normalizer

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .


anyreal-valuedscoringfuncBonofitsarguments


‣ HMMs:

P (y|x) = 1

Z

Y

k

exp(�k(x,y))

normalizer

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .




‣ HMMs:

‣ NaiveBayes:logisBcregression::HMMs:CRFslocalvs.globalnormalizaBon<->generaBvevs.discriminaBve

P (y|x) = 1

Z

Y

k

exp(�k(x,y))

normalizer

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .




‣ HMMs:


P (y|x) = 1

Z

Y

k

exp(�k(x,y))

normalizer

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

‣ LocallynormalizeddiscriminaBvemodelsdoexist(MEMMs)



‣ Howdowemaxovery?Intractableingeneral—canwefixthis?


‣ HMMs:


P (y|x) = 1

Z

Y

k

exp(�k(x,y))

normalizer

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

‣ LocallynormalizeddiscriminaBvemodelsdoexist(MEMMs)

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

P (y|x) /Y

k

exp(�k(x,y))

‣ HMMs:

‣ CRFs:

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

P (y|x) /Y

k

exp(�k(x,y))

P (y|x) / exp(�o(y1))nY

i=2

exp(�t(yi�1, yi))nY

i=1

exp(�e(xi, yi))

‣ HMMs:

‣ CRFs:

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

P (y|x) /Y

k

exp(�k(x,y))

y1 y2 yn

x1 x2 xn

…


i=2


i=1

exp(�e(xi, yi))

‣ HMMs:

‣ CRFs:

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

P (y|x) /Y

k

exp(�k(x,y))

y1 y2 yn

x1 x2 xn

…�o


i=2


i=1

exp(�e(xi, yi))

‣ HMMs:

‣ CRFs:

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

P (y|x) /Y

k

exp(�k(x,y))

y1 y2 yn

x1 x2 xn

…�t

�o


i=2


i=1

exp(�e(xi, yi))

‣ HMMs:

‣ CRFs:

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

P (y|x) /Y

k

exp(�k(x,y))

y1 y2 yn

x1 x2 xn

…�t

�e

�o


i=2


i=1

exp(�e(xi, yi))

‣ HMMs:

‣ CRFs:

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

�t

�e

�o


i=2


i=1

exp(�e(xi, yi))

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

�t

�e

�o


i=2


i=1

exp(�e(xi, yi))

‣WecondiBononx,soeveryfactorcandependonallofx(includingtransiBons,butwewon’tdothis)

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

�t

�e

�o


i=2


i=1

exp(�e(xi, yi))


nY

i=1

exp(�e(yi, i,x))

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

�t

�e

�o


i=2


i=1

exp(�e(xi, yi))


nY

i=1

exp(�e(yi, i,x))

tokenindex—letsuslookatcurrentword

SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

�t

�e

�o y1 y2 yn…

�t

�e

�o

x


i=2


i=1

exp(�e(xi, yi))


nY

i=1

exp(�e(yi, i,x))


SequenBalCRFs

y1 y2 yn

x1 x2 xn

…

�t

�e

�o y1 y2 yn…

�t

�e

�o

x


i=2


i=1

exp(�e(xi, yi))


nY

i=1

exp(�e(yi, i,x))

‣ ycan’tdependarbitrarilyonxinageneraBvemodel


SequenBalCRFs

y1 y2 yn…

�t

�e

�o

x

SequenBalCRFs

‣ NotaBon:omitxfromthefactorgraphenBrely(implicit)

y1 y2 yn…

�t

�e

�o

x

SequenBalCRFs


y1 y2 yn…

�t

�e

�o

x

y1 y2 yn…

�t

�e

�o

SequenBalCRFs


y1 y2 yn…

�t

�e

�o

x

y1 y2 yn…

�t

�e

�o

‣ Don’tincludeiniBaldistribuBon,canbakeintootherfactors

SequenBalCRFs


y1 y2 yn…

�t

�e

�o

x

y1 y2 yn…

�t

�e

�o

‣ Don’tincludeiniBaldistribuBon,canbakeintootherfactors

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

SequenBalCRFs:

FeatureFuncBons

y1 y2 yn…

�e

�t

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

FeatureFuncBons

y1 y2 yn…

�e

�t

‣ Phiscanbealmostanything!HereweuselinearfuncBonsofsparsefeatures

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

FeatureFuncBons

y1 y2 yn…

�e

�t


�e(yi, i,x) = w>fe(yi, i,x)

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

FeatureFuncBons

y1 y2 yn…

�e

�t


�t(yi�1, yi) = w>ft(yi�1, yi)�e(yi, i,x) = w>fe(yi, i,x)

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

FeatureFuncBons

y1 y2 yn…

�e

�t


�t(yi�1, yi) = w>ft(yi�1, yi)

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#�e(yi, i,x) = w>fe(yi, i,x)

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

FeatureFuncBons

y1 y2 yn…

�e

�t


‣ LookslikeoursingleweightvectormulBclasslogisBcregressionmodel

�t(yi�1, yi) = w>ft(yi�1, yi)

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#�e(yi, i,x) = w>fe(yi, i,x)

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

BasicFeaturesforNER


P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

BasicFeaturesforNER


OB-LOC

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

BasicFeaturesforNER


OB-LOC

TransiBons: ft(yi�1, yi) = Ind[yi�1 & yi]

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

=Ind[O—B-LOC]

BasicFeaturesforNER


OB-LOC

TransiBons:

Emissions:

ft(yi�1, yi) = Ind[yi�1 & yi]

fe(y6, 6,x) =

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

=Ind[O—B-LOC]

BasicFeaturesforNER


OB-LOC

TransiBons:

Emissions: Ind[B-LOC&Currentword=Hangzhou]


fe(y6, 6,x) =

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

=Ind[O—B-LOC]

BasicFeaturesforNER


OB-LOC

TransiBons:

Emissions: Ind[B-LOC&Currentword=Hangzhou]Ind[B-LOC&Prevword=to]


fe(y6, 6,x) =

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

=Ind[O—B-LOC]

FeaturesforNER

Leicestershireisaniceplacetovisit…

Itookavaca=ontoBoston

Applereleasedanewversion…

AccordingtotheNewYorkTimes…

ORG

ORG

LOC

LOC

TexasgovernorGregAbboIsaid

LeonardoDiCapriowonanaward…

PER

PER

LOC

�e(yi, i,x)

FeaturesforNER

‣ Contextfeatures(can’tuseinHMM!)‣Wordsbefore/ager‣ Tagsbefore/ager

‣Wordfeatures(canuseinHMM)‣ CapitalizaBon‣Wordshape‣ Prefixes/suffixes‣ Lexicalindicators

‣ Gaze7eers‣Wordclusters

Leicestershire

Boston

Applereleasedanewversion…

AccordingtotheNewYorkTimes…

CRFsOutline

‣Model: P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

‣ Inference

‣ Learning

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

CompuBng(arg)maxes

y1 y2 yn…

�e

�t

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

CompuBng(arg)maxes

y1 y2 yn…

�e

�t

‣ :canuseViterbiexactlyasinHMMcase

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

CompuBng(arg)maxes

y1 y2 yn…

�e

�t


P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

maxy1,...,yn

e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)

CompuBng(arg)maxes

y1 y2 yn…

�e

�t


P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

maxy1,...,yn


= maxy2,...,yn

e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1

e�t(y1,y2)e�e(y1,1,x)

CompuBng(arg)maxes

y1 y2 yn…

�e

�t


P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

maxy1,...,yn


= maxy2,...,yn


e�t(y1,y2)e�e(y1,1,x)

CompuBng(arg)maxes

y1 y2 yn…

�e

�t


P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

{maxy1,...,yn


= maxy2,...,yn


e�t(y1,y2)e�e(y1,1,x)

CompuBng(arg)maxes

y1 y2 yn…

�e

�t


P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

{maxy1,...,yn


= maxy2,...,yn


e�t(y1,y2)e�e(y1,1,x)

= maxy3,...,yn

e�t(yn�1,yn)e�e(yn,n,x) · · ·maxy2

e�t(y2,y3)e�e(y2,2,x) maxy1

e�t(y1,y2)score1(y1)

CompuBng(arg)maxes

y1 y2 yn…

�e

�t


P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

{maxy1,...,yn


= maxy2,...,yn


e�t(y1,y2)e�e(y1,1,x)

= maxy3,...,yn



e�t(y1,y2)score1(y1){

CompuBng(arg)maxes

y1 y2 yn…

�e

�t


‣ andplaytheroleofthePsnow,samedynamicprogramexp(�t(yi�1, yi)) exp(�e(yi, i,x))

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

argmaxyP (y|x)

{maxy1,...,yn


= maxy2,...,yn


e�t(y1,y2)e�e(y1,1,x)

= maxy3,...,yn



e�t(y1,y2)score1(y1){

InferenceinGeneralCRFs

y1 y2 yn…

�e

�t


y1 y2 yn…

�e

�t

‣ Candoinferenceinanytree-structuredCRF


y1 y2 yn…

�e

�t

‣ Candoinferenceinanytree-structuredCRF

‣Max-productalgorithm:generalizaBonofViterbitoarbitrarytree-structuredgraphs(sum-productisgeneralizaBonofforward-backward)

CRFsOutline


Z

nY

i=2


i=1

exp(�e(yi, i,x))

‣ Inference:argmaxP(y|x)fromViterbi

‣ Learning

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

TrainingCRFs

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

TrainingCRFs

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

P (y|x) / expw>f(x, y)‣ LogisBcregression:

TrainingCRFs

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#


‣Maximize L(y⇤,x) = logP (y⇤|x)

TrainingCRFs

‣ GradientiscompletelyanalogoustologisBcregression:

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#



TrainingCRFs


P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#



@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

TrainingCRFs


P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#



intractable!

@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

TrainingCRFs@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

TrainingCRFs

‣ Let’sfocusonemissionfeatureexpectaBon

@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

TrainingCRFs


@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

Ey

"nX

i=1

fe(yi, i,x)

#

TrainingCRFs


@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

Ey

"nX

i=1

fe(yi, i,x)

#=

X

y2YP (y|x)

"nX

i=1

fe(yi, i,x)

#

TrainingCRFs


@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

Ey

"nX

i=1

fe(yi, i,x)

#=

X

y2YP (y|x)

"nX

i=1

fe(yi, i,x)

#=

nX

i=1

X

y2YP (y|x)fe(yi, i,x)

TrainingCRFs


@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

Ey

"nX

i=1

fe(yi, i,x)

#=

X

y2YP (y|x)

"nX

i=1

fe(yi, i,x)

#=

nX

i=1

X

y2YP (y|x)fe(yi, i,x)

=nX

i=1

X

s

P (yi = s|x)fe(s, i,x)

CompuBngMarginals

y1 y2 yn…

�e

�t

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

CompuBngMarginals

y1 y2 yn…

�e

�t

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

Z =X

y

nY

i=2


i=1

exp(�e(yi, i,x))‣ Normalizingconstant

CompuBngMarginals

y1 y2 yn…

�e

�t

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

Z =X

y

nY

i=2


i=1

exp(�e(yi, i,x))‣ Normalizingconstant

‣ AnalogoustoP(x)forHMMs

CompuBngMarginals

y1 y2 yn…

�e

�t

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

Z =X

y

nY

i=2


i=1

exp(�e(yi, i,x))

‣ ForbothHMMsandCRFs:

‣ Normalizingconstant

P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s

0)backwardi(s0)


CompuBngMarginals

y1 y2 yn…

�e

�t

P (y|x) = 1

Z

nY

i=2


i=1

exp(�e(yi, i,x))

Z =X

y

nY

i=2


i=1

exp(�e(yi, i,x))

‣ ForbothHMMsandCRFs:

‣ Normalizingconstant


0)backwardi(s0)

ZforCRFs,P(x)forHMMs


Posteriorsvs.ProbabiliBes


0)backwardi(s0)

‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)



0)backwardi(s0)


HMM Modelparameter(usuallymulBnomialdistribuBon)

P (xi|yi), P (yi|yi�1)



0)backwardi(s0)



P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)



0)backwardi(s0)



InferredquanBtyfromforward-backward




0)backwardi(s0)


HMM

CRF

Modelparameter(usuallymulBnomialdistribuBon)





0)backwardi(s0)


HMM

CRF



Undefined(modelisbydefiniBoncondiBonedonx)




0)backwardi(s0)


HMM

CRF




Undefined(modelisbydefiniBoncondiBonedonx)


TrainingCRFs‣ Foremissionfeatures:

@

@wL(y⇤,x) =

nX

i=1

fe(y⇤i , i,x)�

nX

i=1

X

s


TrainingCRFs‣ Foremissionfeatures:

goldfeatures—expectedfeaturesundermodel

@

@wL(y⇤,x) =

nX

i=1

fe(y⇤i , i,x)�

nX

i=1

X

s


TrainingCRFs

‣ TransiBonfeatures:needtocomputeP (yi = s1, yi+1 = s2|x)usingforward-backwardaswell

‣ Foremissionfeatures:

goldfeatures—expectedfeaturesundermodel

@

@wL(y⇤,x) =

nX

i=1

fe(y⇤i , i,x)�

nX

i=1

X

s


CRFsOutline


Z

nY

i=2


i=1

exp(�e(yi, i,x))

‣ Inference:argmaxP(y|x)fromViterbi

‣ Learning:runforward-backwardtocomputeposteriorprobabiliBes;then

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

@

@wL(y⇤,x) =

nX

i=1

fe(y⇤i , i,x)�

nX

i=1

X

s


Pseudocode

foreachepoch

foreachexample

Pseudocode

foreachepoch

foreachexample

extractfeaturesoneachemissionandtransiBon(lookupincache)

Pseudocode

foreachepoch

foreachexample


computepotenBalsphibasedonfeatures+weights

Pseudocode

foreachepoch

foreachexample


computemarginalprobabiliBeswithforward-backward


Pseudocode

foreachepoch

foreachexample


computemarginalprobabiliBeswithforward-backward


accumulategradientoverallemissionsandtransiBons

StructuredPerceptron


argmaxy2Yw>f(x, y)y =

<latexit sha1_base64="lZVohhKf8gIklCebjvLxzG2Hzk8=">AAACG3icbVDLSsNAFJ3UV42vqks3g0VwVRIVdCMU3bisYB/ShjKZTNqhM5MwMxFCyFe4tV/jTty68GMEJ20WtvXChcM593LPPX7MqNKO821V1tY3Nreq2/bO7t7+Qe3wqKOiRGLSxhGLZM9HijAqSFtTzUgvlgRxn5GuP7kv9O4LkYpG4kmnMfE4GgkaUoy0oZ4HY6SzNIe3w1rdaTizgqvALUEdlNUa1n4GQYQTToTGDCnVd51YexmSmmJGcnuQKBIjPEEj0jdQIE6Ul80M5/DMMAEMI2laaDhj/25kiCuVct9McqTHalkryP+0fqLDGy+jIk40EXh+KEwY1BEsvocBlQRrlhqAsKTGK8RjJBHWJiN74UygCm8Lj2TxKDSmc9u2TV7ucjqroHPRcC8bzuNVvXlXJlcFJ+AUnAMXXIMmeAAt0AYYcPAK3sDUmlrv1of1OR+tWOXOMVgo6+sXnA2hXQ==</latexit>

w = w + f(x, y⇤)� f(x, y)<latexit sha1_base64="SHhE6uWGl/Q8vXXxUKr5184hkg8=">AAACM3icbVDLSsNAFJ3UV42vqEsXDhah9VESFXQjFN24rGAf0NYymU7aoZMHMxNrCF36NW7tx4g7cesnCE7aLGzrgQuHc+7l3nvsgFEhTfNdyywsLi2vZFf1tfWNzS1je6cq/JBjUsE+83ndRoIw6pGKpJKResAJcm1Ganb/NvFrT4QL6nsPMgpIy0VdjzoUI6mktrE/gNdwAI+hk38+iR6PCvB0TJs9JONoWGgbObNojgHniZWSHEhRbhs/zY6PQ5d4EjMkRMMyA9mKEZcUMzLUm6EgAcJ91CUNRT3kEtGKx48M4aFSOtDxuSpPwrH6dyJGrhCRa6tOF8memPUS8T+vEUrnqhVTLwgl8fBkkRMyKH2YpAI7lBMsWaQIwpyqWyHuIY6wVNnpU2s6Irlt6pE46Drq6KGu6yovazadeVI9K1rnRfP+Ile6SZPLgj1wAPLAApegBO5AGVQABi/gFbyBkTbSPrRP7WvSmtHSmV0wBe37Fy2gqAc=</latexit>

‣ StructuredPerceptronUpdate:





‣ StructuredPerceptronUpdate:Viterbi Algorithm






‣ ComparetogradientofCRF:

@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

Viterbi Algorithm






‣ ComparetogradientofCRF:

@

@wL(y⇤,x) =

nX

i=2

ft(y⇤i�1, y

⇤i ) +

nX

i=1

fe(y⇤i , i,x)

�Ey

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

Viterbi Algorithm

Replaces ExpectationWith argmax

NER

NER

NER

‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem

NER


‣ OtherpiecesofinformaBonthatmanysystemscapture

NER



‣Worldknowledge:

NER



‣Worldknowledge:

ThedelegaBonmetthepresidentattheairport,Tanjugsaid.

NER



‣Worldknowledge:


NonlocalFeatures


FinkelandManning(2008),RaBnovandRoth(2009)

ORG?PER?

NonlocalFeatures



ORG?PER?

NonlocalFeatures


ThenewsagencyTanjugreportedontheoutcomeofthemeeBng.


ORG?PER?

NonlocalFeatures




ORG?PER?

NonlocalFeatures



‣Morecomplexfactorgraphstructurescanletyoucapturethis,orjustdecodesentencesinorderandusefeaturesonprevioussentences


Semi-MarkovModels


SarawagiandCohen(2004)

Semi-MarkovModels


‣ Chunk-levelpredicBonratherthantoken-levelBIO


Semi-MarkovModels



{ { { { { {

PER O LOC ORG OO


Semi-MarkovModels



‣ yisasetoftouchingspansofthesentence

{ { { { { {

PER O LOC ORG OO


Semi-MarkovModels




{ { { { { {

PER O LOC ORG OO

‣ Pros:featurescanlookatwholespanatonce


Semi-MarkovModels




‣ Cons:there’sanextrafactorofninthedynamicprograms

{ { { { { {

PER O LOC ORG OO

‣ Pros:featurescanlookatwholespanatonce


EvaluaBngNER


PERSON LOC ORG


EvaluaBngNER

‣ PredicBonofallOssBllgets66%accuracyonthisexample!


PERSON LOC ORG


EvaluaBngNER



PERSON LOC ORG


‣Whatwereallywanttoknow:howmanynamedenBtychunkpredicBonsdidwegetright?

EvaluaBngNER



PERSON LOC ORG



‣ Precision:oftheoneswepredicted,howmanyareright?

EvaluaBngNER



PERSON LOC ORG




‣ Recall:ofthegoldnamedenBBes,howmanydidwefind?

EvaluaBngNER



PERSON LOC ORG




‣ Recall:ofthegoldnamedenBBes,howmanydidwefind?

‣ F-measure:harmonicmeanofthesetwo

HowwelldoNERsystemsdo?

RaBnovandRoth(2009)


RaBnovandRoth(2009)

Lampleetal.(2016)


RaBnovandRoth(2009)

Lampleetal.(2016)

BiLSTM-CRF+ELMo Petersetal.(2018)

92.2

BeamSearch

ViterbiTimeComplexity

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN



VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ nwordsentence,stagstoconsider—whatistheBmecomplexity?



VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN


tags

sentence



VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN


tags

sentence

‣ O(ns2)—sis~40forPOS,nis~20



VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN


‣Manytagsaretotallyimplausible


VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN



‣ Cananyofthesebe:‣ Determiners?‣ PreposiBons?‣ AdjecBves?


VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN



‣ Cananyofthesebe:‣ Determiners?‣ PreposiBons?‣ AdjecBves?‣ FeaturesquicklyeliminatemanyoutcomesfromconsideraBon—don’tneedtoconsiderthesegoingforward


VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

BeamSearch

BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep


‣ Expandallstates,onlykeepktophypothesesatnewBmestep



Fed raises



Fed raises



Fed

NNP

raises

+0.9



Fed

VBD

NNP

raises

+1.2

+0.9



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

Notexpanded



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

Notexpanded



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ -2.0

Notexpanded



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ -2.0

Notexpanded



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ -2.0

Notexpanded



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ -2.0

Notexpanded

VBZ -2.0



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

NNS -1.0

-2.0



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

NNS -1.0

-2.0



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ +1.2

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

NNS -1.0

-2.0



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ +1.2

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

NNS

+1.2

-1.0

-2.0



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ +1.2

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

NNS

+1.2

-1.0…

…

-2.0



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ +1.2

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

DT

NNS

+1.2

-1.0

-5.3

…

…PRP -5.8

Notexpanded

-2.0



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ +1.2

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

DT

NNS

+1.2

-1.0

-5.3

…

…PRP -5.8

Notexpanded

-2.0 ‣Maintainpriorityqueuetoefficientlyaddthings



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ +1.2

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

DT

NNS

+1.2

-1.0

-5.3

…

…PRP -5.8

Notexpanded

‣ Beamsizeofk,Bmecomplexity

-2.0 ‣Maintainpriorityqueuetoefficientlyaddthings



Fed

VBD

VBN

NNP

raises

+1.2

+0.9

+0.7NN +0.3

VBZ +1.2

VBZ -2.0NNS -1.0

Notexpanded

… VBZ

DT

NNS

+1.2

-1.0

-5.3

…

…PRP -5.8

Notexpanded

‣ Beamsizeofk,Bmecomplexity

-2.0

O(nkslog(ks))

‣Maintainpriorityqueuetoefficientlyaddthings

Howgoodisbeamsearch?

Howgoodisbeamsearch?‣ k=1:greedysearch


‣ Choosingbeamsize:



‣ 2isusuallybe7erthan1




‣ Usuallydon’tuselargerthan50





‣ Dependsonproblemstructure





‣ Dependsonproblemstructure

‣ IfbeamsearchismuchfasterthancompuBngfullsums,canusestructuredperceptronSVMinsteadofCRFs‣ VerysimilartostructuredSVM

Lecture 5: Sequence Models II

Documents