Lecture 5: Sequence Models II Alan Ri7er (many slides from Greg Durrett, Dan Klein,Vivek Srikumar, Chris Manning,Yoav Artzi)
Lecture5:SequenceModelsII
AlanRi7er(many slides from Greg Durrett, Dan Klein, Vivek Srikumar, Chris Manning, Yoav Artzi)
Recall:HMMs
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
Recall:HMMs
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
…
Recall:HMMs
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
… P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
Recall:HMMs
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
… P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
‣ Training:maximumlikelihoodesBmaBon(withsmoothing)
Recall:HMMs
‣ Inferenceproblem:
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
… P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
argmaxyP (y|x) = argmaxyP (y,x)
P (x)
‣ Training:maximumlikelihoodesBmaBon(withsmoothing)
Recall:HMMs
‣ Inferenceproblem:
‣ Viterbi:
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
… P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
argmaxyP (y|x) = argmaxyP (y,x)
P (x)
‣ Training:maximumlikelihoodesBmaBon(withsmoothing)
scorei(s) = maxyi�1
P (s|yi�1)P (xi|s)scorei�1(yi�1)
ThisLecture
‣ (ifBme)Beamsearch
‣ CRFs:model(+featuresforNER),inference,learning
‣ NamedenBtyrecogniBon(NER)
NamedEnBtyRecogniBon
NamedEnBtyRecogniBon
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
NamedEnBtyRecogniBon
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
NamedEnBtyRecogniBon
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ BIOtagset:begin,inside,outside
NamedEnBtyRecogniBon
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ BIOtagset:begin,inside,outside
‣WhymightanHMMnotdosowellhere?
‣ Sequenceoftags—shouldweuseanHMM?
NamedEnBtyRecogniBon
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ BIOtagset:begin,inside,outside
‣WhymightanHMMnotdosowellhere?
‣ LotsofO’s,sotagsaren’tasinformaBveaboutcontext
‣ Sequenceoftags—shouldweuseanHMM?
NamedEnBtyRecogniBon
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ BIOtagset:begin,inside,outside
‣WhymightanHMMnotdosowellhere?
‣ LotsofO’s,sotagsaren’tasinformaBveaboutcontext
‣ Sequenceoftags—shouldweuseanHMM?
‣ Insufficientfeatures/capacitywithmulBnomials(especiallyforunks)
CRFs
CondiBonalRandomFields
‣ HMMsareexpressibleasBayesnets(factorgraphs)
y1 y2 yn
x1 x2 xn
…
CondiBonalRandomFields
‣ HMMsareexpressibleasBayesnets(factorgraphs)
y1 y2 yn
x1 x2 xn
…
‣ ThisreflectsthefollowingdecomposiBon:
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields
‣ HMMsareexpressibleasBayesnets(factorgraphs)
y1 y2 yn
x1 x2 xn
…
‣ ThisreflectsthefollowingdecomposiBon:
‣ Locallynormalizedmodel:eachfactorisaprobabilitydistribuBonthatnormalizes
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields‣ HMMs: P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields
‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:
‣ HMMs: P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields
‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:
‣ HMMs:
P (y|x) = 1
Z
Y
k
exp(�k(x,y))
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields
‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:
‣ HMMs:
P (y|x) = 1
Z
Y
k
exp(�k(x,y))
normalizer
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields
anyreal-valuedscoringfuncBonofitsarguments
‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:
‣ HMMs:
P (y|x) = 1
Z
Y
k
exp(�k(x,y))
normalizer
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields
anyreal-valuedscoringfuncBonofitsarguments
‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:
‣ HMMs:
‣ NaiveBayes:logisBcregression::HMMs:CRFslocalvs.globalnormalizaBon<->generaBvevs.discriminaBve
P (y|x) = 1
Z
Y
k
exp(�k(x,y))
normalizer
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiBonalRandomFields
anyreal-valuedscoringfuncBonofitsarguments
‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:
‣ HMMs:
‣ NaiveBayes:logisBcregression::HMMs:CRFslocalvs.globalnormalizaBon<->generaBvevs.discriminaBve
P (y|x) = 1
Z
Y
k
exp(�k(x,y))
normalizer
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
‣ LocallynormalizeddiscriminaBvemodelsdoexist(MEMMs)
CondiBonalRandomFields
anyreal-valuedscoringfuncBonofitsarguments
‣ Howdowemaxovery?Intractableingeneral—canwefixthis?
‣ CRFs:discriminaBvemodelswiththefollowingglobally-normalizedform:
‣ HMMs:
‣ NaiveBayes:logisBcregression::HMMs:CRFslocalvs.globalnormalizaBon<->generaBvevs.discriminaBve
P (y|x) = 1
Z
Y
k
exp(�k(x,y))
normalizer
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
‣ LocallynormalizeddiscriminaBvemodelsdoexist(MEMMs)
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
P (y|x) /Y
k
exp(�k(x,y))
‣ HMMs:
‣ CRFs:
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
P (y|x) /Y
k
exp(�k(x,y))
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣ HMMs:
‣ CRFs:
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
P (y|x) /Y
k
exp(�k(x,y))
y1 y2 yn
x1 x2 xn
…
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣ HMMs:
‣ CRFs:
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
P (y|x) /Y
k
exp(�k(x,y))
y1 y2 yn
x1 x2 xn
…�o
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣ HMMs:
‣ CRFs:
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
P (y|x) /Y
k
exp(�k(x,y))
y1 y2 yn
x1 x2 xn
…�t
�o
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣ HMMs:
‣ CRFs:
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
P (y|x) /Y
k
exp(�k(x,y))
y1 y2 yn
x1 x2 xn
…�t
�e
�o
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣ HMMs:
‣ CRFs:
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
�t
�e
�o
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
�t
�e
�o
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣WecondiBononx,soeveryfactorcandependonallofx(includingtransiBons,butwewon’tdothis)
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
�t
�e
�o
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣WecondiBononx,soeveryfactorcandependonallofx(includingtransiBons,butwewon’tdothis)
nY
i=1
exp(�e(yi, i,x))
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
�t
�e
�o
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣WecondiBononx,soeveryfactorcandependonallofx(includingtransiBons,butwewon’tdothis)
nY
i=1
exp(�e(yi, i,x))
tokenindex—letsuslookatcurrentword
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
�t
�e
�o y1 y2 yn…
�t
�e
�o
x
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣WecondiBononx,soeveryfactorcandependonallofx(includingtransiBons,butwewon’tdothis)
nY
i=1
exp(�e(yi, i,x))
tokenindex—letsuslookatcurrentword
SequenBalCRFs
y1 y2 yn
x1 x2 xn
…
�t
�e
�o y1 y2 yn…
�t
�e
�o
x
P (y|x) / exp(�o(y1))nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(xi, yi))
‣WecondiBononx,soeveryfactorcandependonallofx(includingtransiBons,butwewon’tdothis)
nY
i=1
exp(�e(yi, i,x))
‣ ycan’tdependarbitrarilyonxinageneraBvemodel
tokenindex—letsuslookatcurrentword
SequenBalCRFs
y1 y2 yn…
�t
�e
�o
x
SequenBalCRFs
‣ NotaBon:omitxfromthefactorgraphenBrely(implicit)
y1 y2 yn…
�t
�e
�o
x
SequenBalCRFs
‣ NotaBon:omitxfromthefactorgraphenBrely(implicit)
y1 y2 yn…
�t
�e
�o
x
y1 y2 yn…
�t
�e
�o
SequenBalCRFs
‣ NotaBon:omitxfromthefactorgraphenBrely(implicit)
y1 y2 yn…
�t
�e
�o
x
y1 y2 yn…
�t
�e
�o
‣ Don’tincludeiniBaldistribuBon,canbakeintootherfactors
SequenBalCRFs
‣ NotaBon:omitxfromthefactorgraphenBrely(implicit)
y1 y2 yn…
�t
�e
�o
x
y1 y2 yn…
�t
�e
�o
‣ Don’tincludeiniBaldistribuBon,canbakeintootherfactors
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
SequenBalCRFs:
FeatureFuncBons
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
FeatureFuncBons
y1 y2 yn…
�e
�t
‣ Phiscanbealmostanything!HereweuselinearfuncBonsofsparsefeatures
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
FeatureFuncBons
y1 y2 yn…
�e
�t
‣ Phiscanbealmostanything!HereweuselinearfuncBonsofsparsefeatures
�e(yi, i,x) = w>fe(yi, i,x)
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
FeatureFuncBons
y1 y2 yn…
�e
�t
‣ Phiscanbealmostanything!HereweuselinearfuncBonsofsparsefeatures
�t(yi�1, yi) = w>ft(yi�1, yi)�e(yi, i,x) = w>fe(yi, i,x)
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
FeatureFuncBons
y1 y2 yn…
�e
�t
‣ Phiscanbealmostanything!HereweuselinearfuncBonsofsparsefeatures
�t(yi�1, yi) = w>ft(yi�1, yi)
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#�e(yi, i,x) = w>fe(yi, i,x)
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
FeatureFuncBons
y1 y2 yn…
�e
�t
‣ Phiscanbealmostanything!HereweuselinearfuncBonsofsparsefeatures
‣ LookslikeoursingleweightvectormulBclasslogisBcregressionmodel
�t(yi�1, yi) = w>ft(yi�1, yi)
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#�e(yi, i,x) = w>fe(yi, i,x)
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
BasicFeaturesforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
BasicFeaturesforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
OB-LOC
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
BasicFeaturesforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
OB-LOC
TransiBons: ft(yi�1, yi) = Ind[yi�1 & yi]
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
=Ind[O—B-LOC]
BasicFeaturesforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
OB-LOC
TransiBons:
Emissions:
ft(yi�1, yi) = Ind[yi�1 & yi]
fe(y6, 6,x) =
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
=Ind[O—B-LOC]
BasicFeaturesforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
OB-LOC
TransiBons:
Emissions: Ind[B-LOC&Currentword=Hangzhou]
ft(yi�1, yi) = Ind[yi�1 & yi]
fe(y6, 6,x) =
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
=Ind[O—B-LOC]
BasicFeaturesforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
OB-LOC
TransiBons:
Emissions: Ind[B-LOC&Currentword=Hangzhou]Ind[B-LOC&Prevword=to]
ft(yi�1, yi) = Ind[yi�1 & yi]
fe(y6, 6,x) =
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
=Ind[O—B-LOC]
FeaturesforNER
Leicestershireisaniceplacetovisit…
Itookavaca=ontoBoston
Applereleasedanewversion…
AccordingtotheNewYorkTimes…
ORG
ORG
LOC
LOC
TexasgovernorGregAbboIsaid
LeonardoDiCapriowonanaward…
PER
PER
LOC
�e(yi, i,x)
FeaturesforNER
‣ Contextfeatures(can’tuseinHMM!)‣Wordsbefore/ager‣ Tagsbefore/ager
‣Wordfeatures(canuseinHMM)‣ CapitalizaBon‣Wordshape‣ Prefixes/suffixes‣ Lexicalindicators
‣ Gaze7eers‣Wordclusters
Leicestershire
Boston
Applereleasedanewversion…
AccordingtotheNewYorkTimes…
CRFsOutline
‣Model: P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ Inference
‣ Learning
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
= maxy2,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1
e�t(y1,y2)e�e(y1,1,x)
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
= maxy2,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1
e�t(y1,y2)e�e(y1,1,x)
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
{maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
= maxy2,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1
e�t(y1,y2)e�e(y1,1,x)
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
{maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
= maxy2,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1
e�t(y1,y2)e�e(y1,1,x)
= maxy3,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · ·maxy2
e�t(y2,y3)e�e(y2,2,x) maxy1
e�t(y1,y2)score1(y1)
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
{maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
= maxy2,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1
e�t(y1,y2)e�e(y1,1,x)
= maxy3,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · ·maxy2
e�t(y2,y3)e�e(y2,2,x) maxy1
e�t(y1,y2)score1(y1){
CompuBng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
‣ andplaytheroleofthePsnow,samedynamicprogramexp(�t(yi�1, yi)) exp(�e(yi, i,x))
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
{maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
= maxy2,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1
e�t(y1,y2)e�e(y1,1,x)
= maxy3,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · ·maxy2
e�t(y2,y3)e�e(y2,2,x) maxy1
e�t(y1,y2)score1(y1){
InferenceinGeneralCRFs
y1 y2 yn…
�e
�t
InferenceinGeneralCRFs
y1 y2 yn…
�e
�t
‣ Candoinferenceinanytree-structuredCRF
InferenceinGeneralCRFs
y1 y2 yn…
�e
�t
‣ Candoinferenceinanytree-structuredCRF
‣Max-productalgorithm:generalizaBonofViterbitoarbitrarytree-structuredgraphs(sum-productisgeneralizaBonofforward-backward)
CRFsOutline
‣Model: P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ Inference:argmaxP(y|x)fromViterbi
‣ Learning
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ LogisBcregression:
TrainingCRFs
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ LogisBcregression:
‣Maximize L(y⇤,x) = logP (y⇤|x)
TrainingCRFs
‣ GradientiscompletelyanalogoustologisBcregression:
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ LogisBcregression:
‣Maximize L(y⇤,x) = logP (y⇤|x)
TrainingCRFs
‣ GradientiscompletelyanalogoustologisBcregression:
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ LogisBcregression:
‣Maximize L(y⇤,x) = logP (y⇤|x)
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs
‣ GradientiscompletelyanalogoustologisBcregression:
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ LogisBcregression:
‣Maximize L(y⇤,x) = logP (y⇤|x)
intractable!
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs
‣ Let’sfocusonemissionfeatureexpectaBon
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs
‣ Let’sfocusonemissionfeatureexpectaBon
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
Ey
"nX
i=1
fe(yi, i,x)
#
TrainingCRFs
‣ Let’sfocusonemissionfeatureexpectaBon
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
Ey
"nX
i=1
fe(yi, i,x)
#=
X
y2YP (y|x)
"nX
i=1
fe(yi, i,x)
#
TrainingCRFs
‣ Let’sfocusonemissionfeatureexpectaBon
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
Ey
"nX
i=1
fe(yi, i,x)
#=
X
y2YP (y|x)
"nX
i=1
fe(yi, i,x)
#=
nX
i=1
X
y2YP (y|x)fe(yi, i,x)
TrainingCRFs
‣ Let’sfocusonemissionfeatureexpectaBon
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
Ey
"nX
i=1
fe(yi, i,x)
#=
X
y2YP (y|x)
"nX
i=1
fe(yi, i,x)
#=
nX
i=1
X
y2YP (y|x)fe(yi, i,x)
=nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
CompuBngMarginals
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
CompuBngMarginals
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
Z =X
y
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))‣ Normalizingconstant
CompuBngMarginals
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
Z =X
y
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))‣ Normalizingconstant
‣ AnalogoustoP(x)forHMMs
CompuBngMarginals
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
Z =X
y
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ ForbothHMMsandCRFs:
‣ Normalizingconstant
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ AnalogoustoP(x)forHMMs
CompuBngMarginals
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
Z =X
y
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ ForbothHMMsandCRFs:
‣ Normalizingconstant
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
ZforCRFs,P(x)forHMMs
‣ AnalogoustoP(x)forHMMs
Posteriorsvs.ProbabiliBes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)
Posteriorsvs.ProbabiliBes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)
HMM Modelparameter(usuallymulBnomialdistribuBon)
P (xi|yi), P (yi|yi�1)
Posteriorsvs.ProbabiliBes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)
HMM Modelparameter(usuallymulBnomialdistribuBon)
P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)
Posteriorsvs.ProbabiliBes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)
HMM Modelparameter(usuallymulBnomialdistribuBon)
InferredquanBtyfromforward-backward
P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)
Posteriorsvs.ProbabiliBes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)
HMM
CRF
Modelparameter(usuallymulBnomialdistribuBon)
InferredquanBtyfromforward-backward
P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)
Posteriorsvs.ProbabiliBes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)
HMM
CRF
Modelparameter(usuallymulBnomialdistribuBon)
InferredquanBtyfromforward-backward
Undefined(modelisbydefiniBoncondiBonedonx)
P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)
Posteriorsvs.ProbabiliBes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiBonedonx!)
HMM
CRF
Modelparameter(usuallymulBnomialdistribuBon)
InferredquanBtyfromforward-backward
InferredquanBtyfromforward-backward
Undefined(modelisbydefiniBoncondiBonedonx)
P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)
TrainingCRFs‣ Foremissionfeatures:
@
@wL(y⇤,x) =
nX
i=1
fe(y⇤i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
TrainingCRFs‣ Foremissionfeatures:
goldfeatures—expectedfeaturesundermodel
@
@wL(y⇤,x) =
nX
i=1
fe(y⇤i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
TrainingCRFs
‣ TransiBonfeatures:needtocomputeP (yi = s1, yi+1 = s2|x)usingforward-backwardaswell
‣ Foremissionfeatures:
goldfeatures—expectedfeaturesundermodel
@
@wL(y⇤,x) =
nX
i=1
fe(y⇤i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
CRFsOutline
‣Model: P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ Inference:argmaxP(y|x)fromViterbi
‣ Learning:runforward-backwardtocomputeposteriorprobabiliBes;then
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
@
@wL(y⇤,x) =
nX
i=1
fe(y⇤i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
Pseudocode
foreachepoch
foreachexample
Pseudocode
foreachepoch
foreachexample
extractfeaturesoneachemissionandtransiBon(lookupincache)
Pseudocode
foreachepoch
foreachexample
extractfeaturesoneachemissionandtransiBon(lookupincache)
computepotenBalsphibasedonfeatures+weights
Pseudocode
foreachepoch
foreachexample
extractfeaturesoneachemissionandtransiBon(lookupincache)
computemarginalprobabiliBeswithforward-backward
computepotenBalsphibasedonfeatures+weights
Pseudocode
foreachepoch
foreachexample
extractfeaturesoneachemissionandtransiBon(lookupincache)
computemarginalprobabiliBeswithforward-backward
computepotenBalsphibasedonfeatures+weights
accumulategradientoverallemissionsandtransiBons
StructuredPerceptron
StructuredPerceptron
argmaxy2Yw>f(x, y)y =
<latexit sha1_base64="lZVohhKf8gIklCebjvLxzG2Hzk8=">AAACG3icbVDLSsNAFJ3UV42vqks3g0VwVRIVdCMU3bisYB/ShjKZTNqhM5MwMxFCyFe4tV/jTty68GMEJ20WtvXChcM593LPPX7MqNKO821V1tY3Nreq2/bO7t7+Qe3wqKOiRGLSxhGLZM9HijAqSFtTzUgvlgRxn5GuP7kv9O4LkYpG4kmnMfE4GgkaUoy0oZ4HY6SzNIe3w1rdaTizgqvALUEdlNUa1n4GQYQTToTGDCnVd51YexmSmmJGcnuQKBIjPEEj0jdQIE6Ul80M5/DMMAEMI2laaDhj/25kiCuVct9McqTHalkryP+0fqLDGy+jIk40EXh+KEwY1BEsvocBlQRrlhqAsKTGK8RjJBHWJiN74UygCm8Lj2TxKDSmc9u2TV7ucjqroHPRcC8bzuNVvXlXJlcFJ+AUnAMXXIMmeAAt0AYYcPAK3sDUmlrv1of1OR+tWOXOMVgo6+sXnA2hXQ==</latexit>
w = w + f(x, y⇤)� f(x, y)<latexit sha1_base64="SHhE6uWGl/Q8vXXxUKr5184hkg8=">AAACM3icbVDLSsNAFJ3UV42vqEsXDhah9VESFXQjFN24rGAf0NYymU7aoZMHMxNrCF36NW7tx4g7cesnCE7aLGzrgQuHc+7l3nvsgFEhTfNdyywsLi2vZFf1tfWNzS1je6cq/JBjUsE+83ndRoIw6pGKpJKResAJcm1Ganb/NvFrT4QL6nsPMgpIy0VdjzoUI6mktrE/gNdwAI+hk38+iR6PCvB0TJs9JONoWGgbObNojgHniZWSHEhRbhs/zY6PQ5d4EjMkRMMyA9mKEZcUMzLUm6EgAcJ91CUNRT3kEtGKx48M4aFSOtDxuSpPwrH6dyJGrhCRa6tOF8memPUS8T+vEUrnqhVTLwgl8fBkkRMyKH2YpAI7lBMsWaQIwpyqWyHuIY6wVNnpU2s6Irlt6pE46Drq6KGu6yovazadeVI9K1rnRfP+Ile6SZPLgj1wAPLAApegBO5AGVQABi/gFbyBkTbSPrRP7WvSmtHSmV0wBe37Fy2gqAc=</latexit>
‣ StructuredPerceptronUpdate:
StructuredPerceptron
argmaxy2Yw>f(x, y)y =
<latexit sha1_base64="lZVohhKf8gIklCebjvLxzG2Hzk8=">AAACG3icbVDLSsNAFJ3UV42vqks3g0VwVRIVdCMU3bisYB/ShjKZTNqhM5MwMxFCyFe4tV/jTty68GMEJ20WtvXChcM593LPPX7MqNKO821V1tY3Nreq2/bO7t7+Qe3wqKOiRGLSxhGLZM9HijAqSFtTzUgvlgRxn5GuP7kv9O4LkYpG4kmnMfE4GgkaUoy0oZ4HY6SzNIe3w1rdaTizgqvALUEdlNUa1n4GQYQTToTGDCnVd51YexmSmmJGcnuQKBIjPEEj0jdQIE6Ul80M5/DMMAEMI2laaDhj/25kiCuVct9McqTHalkryP+0fqLDGy+jIk40EXh+KEwY1BEsvocBlQRrlhqAsKTGK8RjJBHWJiN74UygCm8Lj2TxKDSmc9u2TV7ucjqroHPRcC8bzuNVvXlXJlcFJ+AUnAMXXIMmeAAt0AYYcPAK3sDUmlrv1of1OR+tWOXOMVgo6+sXnA2hXQ==</latexit>
w = w + f(x, y⇤)� f(x, y)<latexit sha1_base64="SHhE6uWGl/Q8vXXxUKr5184hkg8=">AAACM3icbVDLSsNAFJ3UV42vqEsXDhah9VESFXQjFN24rGAf0NYymU7aoZMHMxNrCF36NW7tx4g7cesnCE7aLGzrgQuHc+7l3nvsgFEhTfNdyywsLi2vZFf1tfWNzS1je6cq/JBjUsE+83ndRoIw6pGKpJKResAJcm1Ganb/NvFrT4QL6nsPMgpIy0VdjzoUI6mktrE/gNdwAI+hk38+iR6PCvB0TJs9JONoWGgbObNojgHniZWSHEhRbhs/zY6PQ5d4EjMkRMMyA9mKEZcUMzLUm6EgAcJ91CUNRT3kEtGKx48M4aFSOtDxuSpPwrH6dyJGrhCRa6tOF8memPUS8T+vEUrnqhVTLwgl8fBkkRMyKH2YpAI7lBMsWaQIwpyqWyHuIY6wVNnpU2s6Irlt6pE46Drq6KGu6yovazadeVI9K1rnRfP+Ile6SZPLgj1wAPLAApegBO5AGVQABi/gFbyBkTbSPrRP7WvSmtHSmV0wBe37Fy2gqAc=</latexit>
‣ StructuredPerceptronUpdate:Viterbi Algorithm
StructuredPerceptron
argmaxy2Yw>f(x, y)y =
<latexit sha1_base64="lZVohhKf8gIklCebjvLxzG2Hzk8=">AAACG3icbVDLSsNAFJ3UV42vqks3g0VwVRIVdCMU3bisYB/ShjKZTNqhM5MwMxFCyFe4tV/jTty68GMEJ20WtvXChcM593LPPX7MqNKO821V1tY3Nreq2/bO7t7+Qe3wqKOiRGLSxhGLZM9HijAqSFtTzUgvlgRxn5GuP7kv9O4LkYpG4kmnMfE4GgkaUoy0oZ4HY6SzNIe3w1rdaTizgqvALUEdlNUa1n4GQYQTToTGDCnVd51YexmSmmJGcnuQKBIjPEEj0jdQIE6Ul80M5/DMMAEMI2laaDhj/25kiCuVct9McqTHalkryP+0fqLDGy+jIk40EXh+KEwY1BEsvocBlQRrlhqAsKTGK8RjJBHWJiN74UygCm8Lj2TxKDSmc9u2TV7ucjqroHPRcC8bzuNVvXlXJlcFJ+AUnAMXXIMmeAAt0AYYcPAK3sDUmlrv1of1OR+tWOXOMVgo6+sXnA2hXQ==</latexit>
w = w + f(x, y⇤)� f(x, y)<latexit sha1_base64="SHhE6uWGl/Q8vXXxUKr5184hkg8=">AAACM3icbVDLSsNAFJ3UV42vqEsXDhah9VESFXQjFN24rGAf0NYymU7aoZMHMxNrCF36NW7tx4g7cesnCE7aLGzrgQuHc+7l3nvsgFEhTfNdyywsLi2vZFf1tfWNzS1je6cq/JBjUsE+83ndRoIw6pGKpJKResAJcm1Ganb/NvFrT4QL6nsPMgpIy0VdjzoUI6mktrE/gNdwAI+hk38+iR6PCvB0TJs9JONoWGgbObNojgHniZWSHEhRbhs/zY6PQ5d4EjMkRMMyA9mKEZcUMzLUm6EgAcJ91CUNRT3kEtGKx48M4aFSOtDxuSpPwrH6dyJGrhCRa6tOF8memPUS8T+vEUrnqhVTLwgl8fBkkRMyKH2YpAI7lBMsWaQIwpyqWyHuIY6wVNnpU2s6Irlt6pE46Drq6KGu6yovazadeVI9K1rnRfP+Ile6SZPLgj1wAPLAApegBO5AGVQABi/gFbyBkTbSPrRP7WvSmtHSmV0wBe37Fy2gqAc=</latexit>
‣ StructuredPerceptronUpdate:
‣ ComparetogradientofCRF:
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
Viterbi Algorithm
StructuredPerceptron
argmaxy2Yw>f(x, y)y =
<latexit sha1_base64="lZVohhKf8gIklCebjvLxzG2Hzk8=">AAACG3icbVDLSsNAFJ3UV42vqks3g0VwVRIVdCMU3bisYB/ShjKZTNqhM5MwMxFCyFe4tV/jTty68GMEJ20WtvXChcM593LPPX7MqNKO821V1tY3Nreq2/bO7t7+Qe3wqKOiRGLSxhGLZM9HijAqSFtTzUgvlgRxn5GuP7kv9O4LkYpG4kmnMfE4GgkaUoy0oZ4HY6SzNIe3w1rdaTizgqvALUEdlNUa1n4GQYQTToTGDCnVd51YexmSmmJGcnuQKBIjPEEj0jdQIE6Ul80M5/DMMAEMI2laaDhj/25kiCuVct9McqTHalkryP+0fqLDGy+jIk40EXh+KEwY1BEsvocBlQRrlhqAsKTGK8RjJBHWJiN74UygCm8Lj2TxKDSmc9u2TV7ucjqroHPRcC8bzuNVvXlXJlcFJ+AUnAMXXIMmeAAt0AYYcPAK3sDUmlrv1of1OR+tWOXOMVgo6+sXnA2hXQ==</latexit>
w = w + f(x, y⇤)� f(x, y)<latexit sha1_base64="SHhE6uWGl/Q8vXXxUKr5184hkg8=">AAACM3icbVDLSsNAFJ3UV42vqEsXDhah9VESFXQjFN24rGAf0NYymU7aoZMHMxNrCF36NW7tx4g7cesnCE7aLGzrgQuHc+7l3nvsgFEhTfNdyywsLi2vZFf1tfWNzS1je6cq/JBjUsE+83ndRoIw6pGKpJKResAJcm1Ganb/NvFrT4QL6nsPMgpIy0VdjzoUI6mktrE/gNdwAI+hk38+iR6PCvB0TJs9JONoWGgbObNojgHniZWSHEhRbhs/zY6PQ5d4EjMkRMMyA9mKEZcUMzLUm6EgAcJ91CUNRT3kEtGKx48M4aFSOtDxuSpPwrH6dyJGrhCRa6tOF8memPUS8T+vEUrnqhVTLwgl8fBkkRMyKH2YpAI7lBMsWaQIwpyqWyHuIY6wVNnpU2s6Irlt6pE46Drq6KGu6yovazadeVI9K1rnRfP+Ile6SZPLgj1wAPLAApegBO5AGVQABi/gFbyBkTbSPrRP7WvSmtHSmV0wBe37Fy2gqAc=</latexit>
‣ StructuredPerceptronUpdate:
‣ ComparetogradientofCRF:
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
Viterbi Algorithm
Replaces ExpectationWith argmax
NER
NER
NER
‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem
NER
‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem
‣ OtherpiecesofinformaBonthatmanysystemscapture
NER
‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem
‣ OtherpiecesofinformaBonthatmanysystemscapture
‣Worldknowledge:
NER
‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem
‣ OtherpiecesofinformaBonthatmanysystemscapture
‣Worldknowledge:
ThedelegaBonmetthepresidentattheairport,Tanjugsaid.
NER
‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem
‣ OtherpiecesofinformaBonthatmanysystemscapture
‣Worldknowledge:
ThedelegaBonmetthepresidentattheairport,Tanjugsaid.
NonlocalFeatures
ThedelegaBonmetthepresidentattheairport,Tanjugsaid.
FinkelandManning(2008),RaBnovandRoth(2009)
ORG?PER?
NonlocalFeatures
ThedelegaBonmetthepresidentattheairport,Tanjugsaid.
FinkelandManning(2008),RaBnovandRoth(2009)
ORG?PER?
NonlocalFeatures
ThedelegaBonmetthepresidentattheairport,Tanjugsaid.
ThenewsagencyTanjugreportedontheoutcomeofthemeeBng.
FinkelandManning(2008),RaBnovandRoth(2009)
ORG?PER?
NonlocalFeatures
ThedelegaBonmetthepresidentattheairport,Tanjugsaid.
ThenewsagencyTanjugreportedontheoutcomeofthemeeBng.
FinkelandManning(2008),RaBnovandRoth(2009)
ORG?PER?
NonlocalFeatures
ThedelegaBonmetthepresidentattheairport,Tanjugsaid.
ThenewsagencyTanjugreportedontheoutcomeofthemeeBng.
‣Morecomplexfactorgraphstructurescanletyoucapturethis,orjustdecodesentencesinorderandusefeaturesonprevioussentences
FinkelandManning(2008),RaBnovandRoth(2009)
Semi-MarkovModels
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
SarawagiandCohen(2004)
Semi-MarkovModels
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
‣ Chunk-levelpredicBonratherthantoken-levelBIO
SarawagiandCohen(2004)
Semi-MarkovModels
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
‣ Chunk-levelpredicBonratherthantoken-levelBIO
{ { { { { {
PER O LOC ORG OO
SarawagiandCohen(2004)
Semi-MarkovModels
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
‣ Chunk-levelpredicBonratherthantoken-levelBIO
‣ yisasetoftouchingspansofthesentence
{ { { { { {
PER O LOC ORG OO
SarawagiandCohen(2004)
Semi-MarkovModels
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
‣ Chunk-levelpredicBonratherthantoken-levelBIO
‣ yisasetoftouchingspansofthesentence
{ { { { { {
PER O LOC ORG OO
‣ Pros:featurescanlookatwholespanatonce
SarawagiandCohen(2004)
Semi-MarkovModels
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
‣ Chunk-levelpredicBonratherthantoken-levelBIO
‣ yisasetoftouchingspansofthesentence
‣ Cons:there’sanextrafactorofninthedynamicprograms
{ { { { { {
PER O LOC ORG OO
‣ Pros:featurescanlookatwholespanatonce
SarawagiandCohen(2004)
EvaluaBngNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
EvaluaBngNER
‣ PredicBonofallOssBllgets66%accuracyonthisexample!
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
EvaluaBngNER
‣ PredicBonofallOssBllgets66%accuracyonthisexample!
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣Whatwereallywanttoknow:howmanynamedenBtychunkpredicBonsdidwegetright?
EvaluaBngNER
‣ PredicBonofallOssBllgets66%accuracyonthisexample!
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣Whatwereallywanttoknow:howmanynamedenBtychunkpredicBonsdidwegetright?
‣ Precision:oftheoneswepredicted,howmanyareright?
EvaluaBngNER
‣ PredicBonofallOssBllgets66%accuracyonthisexample!
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣Whatwereallywanttoknow:howmanynamedenBtychunkpredicBonsdidwegetright?
‣ Precision:oftheoneswepredicted,howmanyareright?
‣ Recall:ofthegoldnamedenBBes,howmanydidwefind?
EvaluaBngNER
‣ PredicBonofallOssBllgets66%accuracyonthisexample!
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣Whatwereallywanttoknow:howmanynamedenBtychunkpredicBonsdidwegetright?
‣ Precision:oftheoneswepredicted,howmanyareright?
‣ Recall:ofthegoldnamedenBBes,howmanydidwefind?
‣ F-measure:harmonicmeanofthesetwo
HowwelldoNERsystemsdo?
RaBnovandRoth(2009)
HowwelldoNERsystemsdo?
RaBnovandRoth(2009)
Lampleetal.(2016)
HowwelldoNERsystemsdo?
RaBnovandRoth(2009)
Lampleetal.(2016)
BiLSTM-CRF+ELMo Petersetal.(2018)
92.2
BeamSearch
ViterbiTimeComplexity
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
ViterbiTimeComplexity
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
‣ nwordsentence,stagstoconsider—whatistheBmecomplexity?
ViterbiTimeComplexity
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
‣ nwordsentence,stagstoconsider—whatistheBmecomplexity?
tags
sentence
ViterbiTimeComplexity
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
‣ nwordsentence,stagstoconsider—whatistheBmecomplexity?
tags
sentence
‣ O(ns2)—sis~40forPOS,nis~20
ViterbiTimeComplexity
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
ViterbiTimeComplexity
‣Manytagsaretotallyimplausible
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
ViterbiTimeComplexity
‣Manytagsaretotallyimplausible
‣ Cananyofthesebe:‣ Determiners?‣ PreposiBons?‣ AdjecBves?
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
ViterbiTimeComplexity
‣Manytagsaretotallyimplausible
‣ Cananyofthesebe:‣ Determiners?‣ PreposiBons?‣ AdjecBves?‣ FeaturesquicklyeliminatemanyoutcomesfromconsideraBon—don’tneedtoconsiderthesegoingforward
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
BeamSearch
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed raises
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed raises
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
NNP
raises
+0.9
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
NNP
raises
+1.2
+0.9
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
Notexpanded
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
Notexpanded
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ -2.0
Notexpanded
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ -2.0
Notexpanded
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ -2.0
Notexpanded
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ -2.0
Notexpanded
VBZ -2.0
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
NNS -1.0
-2.0
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
NNS -1.0
-2.0
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ +1.2
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
NNS -1.0
-2.0
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ +1.2
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
NNS
+1.2
-1.0
-2.0
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ +1.2
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
NNS
+1.2
-1.0…
…
-2.0
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ +1.2
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
DT
NNS
+1.2
-1.0
-5.3
…
…PRP -5.8
Notexpanded
-2.0
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ +1.2
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
DT
NNS
+1.2
-1.0
-5.3
…
…PRP -5.8
Notexpanded
-2.0 ‣Maintainpriorityqueuetoefficientlyaddthings
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ +1.2
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
DT
NNS
+1.2
-1.0
-5.3
…
…PRP -5.8
Notexpanded
‣ Beamsizeofk,Bmecomplexity
-2.0 ‣Maintainpriorityqueuetoefficientlyaddthings
BeamSearch‣MaintainabeamofkplausiblestatesatthecurrentBmestep
‣ Expandallstates,onlykeepktophypothesesatnewBmestep
Fed
VBD
VBN
NNP
raises
+1.2
+0.9
+0.7NN +0.3
VBZ +1.2
VBZ -2.0NNS -1.0
Notexpanded
… VBZ
DT
NNS
+1.2
-1.0
-5.3
…
…PRP -5.8
Notexpanded
‣ Beamsizeofk,Bmecomplexity
-2.0
O(nkslog(ks))
‣Maintainpriorityqueuetoefficientlyaddthings
Howgoodisbeamsearch?
Howgoodisbeamsearch?‣ k=1:greedysearch
Howgoodisbeamsearch?‣ k=1:greedysearch
‣ Choosingbeamsize:
Howgoodisbeamsearch?‣ k=1:greedysearch
‣ Choosingbeamsize:
‣ 2isusuallybe7erthan1
Howgoodisbeamsearch?‣ k=1:greedysearch
‣ Choosingbeamsize:
‣ 2isusuallybe7erthan1
‣ Usuallydon’tuselargerthan50
Howgoodisbeamsearch?‣ k=1:greedysearch
‣ Choosingbeamsize:
‣ 2isusuallybe7erthan1
‣ Usuallydon’tuselargerthan50
‣ Dependsonproblemstructure
Howgoodisbeamsearch?‣ k=1:greedysearch
‣ Choosingbeamsize:
‣ 2isusuallybe7erthan1
‣ Usuallydon’tuselargerthan50
‣ Dependsonproblemstructure
‣ IfbeamsearchismuchfasterthancompuBngfullsums,canusestructuredperceptronSVMinsteadofCRFs‣ VerysimilartostructuredSVM