POS Tagging / Parsing I Taylor Berg-Kirkpatrick – CMU Slides: Dan Klein – UC Berkeley Algorithms for NLP
WhatNeedstobeLearned?
§ Emissions:P(x|phoneclass)§ XisMFCC-valued
§ Transitions:P(state|prev state)§ Ifbetweenwords,thisisP(word|history)§ Ifinsidewords,thisisP(advance|phoneclass)§ (Reallyahierarchicalmodel)
s s s
x x x
Estimation fromAlignedData§ Whatifeachtimestepwaslabeledwithits(context-
dependentsub)phone?
§ CanestimateP(x|/ae/)asempiricalmeanand(co-)varianceofx’swithlabel/ae/
§ Problem:Don’tknowalignmentattheframeandphonelevel
/k/ /ae/ /ae/
x x x
/ae/ /t/
x x
Forced Alignment§ WhatiftheacousticmodelP(x|phone)wasknown?
§ …andalsothecorrectsequencesofwords/phones
§ Canpredictthebestalignmentofframestophones
§ Called“forcedalignment”
ssssssssppppeeeeeeetshshshshllllaeaeaebbbbb
“speech lab”
ForcedAlignment§ Createanewstatespacethatforcesthehiddenvariablestotransition
throughphonesinthe(known)order
§ Stillhaveuncertaintyaboutdurations
§ InthisHMM,alltheparametersareknown§ Transitionsdeterminedbyknownutterance§ Emissionsassumedtobeknown§ Minordetail:self-loopprobabilities
§ JustrunViterbi(orapproximations)togetthebestalignment
/s/ /p/ /ee/ /ch/ /l/ /ae/ /b/
EMforAlignment§ Input:acousticsequenceswithword-leveltranscriptions
§ Wedon’tknoweithertheemissionmodelortheframealignments
§ ExpectationMaximization(HardEMfornow)§ Alternatingoptimization§ Imputecompletionsforunlabeledvariables(here,thestatesateach
timestep)§ Re-estimatemodelparameters(here,Gaussianmeans,variances,
mixtureids)§ Repeat§ OneoftheearliestusesofEM!
SoftEM§ HardEMusesthebestsinglecompletion
§ Here,singlebestalignment§ Notalwaysrepresentative§ Certainlybadwhenyourparametersareinitializedandthealignments
arealltied§ Usesthecountofvariousconfigurations(e.g.howmanytokensof
/ae/haveself-loops)
§ Whatwe’dreallylikeistoknowthefractionofpathsthatincludeagivencompletion§ E.g.0.32ofthepathsalignthisframeto/p/,0.21alignitto/ee/,etc.§ Formallywanttoknowtheexpectedcountofconfigurations§ Keyquantity:P(st |x)
FractionalCounts
§ Computingfractional(expected)counts§ Computeforward/backwardprobabilities§ Foreachposition,computemarginalposteriors§ Accumulateexpectations§ Re-estimateparameters(e.g.means,variances,self-loopprobabilities)fromratiosoftheseexpectedcounts
StagedTrainingandStateTying
§ CreatingCDphones:§ Startwithmonophone,doEM
training§ CloneGaussiansintotriphones§ Builddecisiontreeandcluster
Gaussians§ Cloneandtrainmixtures
(GMMs)
§ Generalidea:§ Introducecomplexitygradually§ Interleaveconstraintwith
flexibility
Parts-of-Speech(English)§ Onebasickindoflinguisticstructure:syntacticwordclasses
Open class (lexical) words
Closed class (functional)
Nouns Verbs
Proper Common
Auxiliary
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBMItaly
cat / catssnow
seeregistered
canhad
yellow
slowly
to with
off up
the some
and or
he its
Numbers
122,312one
CC conjunction, coordinating and both but either orCD numeral, cardinal mid-1890 nine-thirty 0.5 oneDT determiner a all an every no that theEX existential there there FW foreign word gemeinschaft hund ich jeuxIN preposition or conjunction, subordinating among whether out on by ifJJ adjective or numeral, ordinal third ill-mannered regrettable
JJR adjective, comparative braver cheaper tallerJJS adjective, superlative bravest cheapest tallestMD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity
NNP noun, proper, singular Motown Cougar Yvette LiverpoolNNPS noun, proper, plural Americans Materials StatesNNS noun, common, plural undergraduates bric-a-brac averagesPOS genitive marker ' 's PRP pronoun, personal hers himself it we themPRP$ pronoun, possessive her his mine my our ours their thy your
RB adverb occasionally maddeningly adventurouslyRBR adverb, comparative further gloomier heavier less-perfectlyRBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open throughTO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heckVB verb, base form ask bring fire see take
VBD verb, past tense pleaded swiped registered sawVBG verb, present participle or gerund stirring focusing approaching erasingVBN verb, past participle dilapidated imitated reunifed unsettledVBP verb, present tense, not 3rd person singular twist appear comprise mold postponeVBZ verb, present tense, 3rd person singular bases reconstructs marks usesWDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whomWP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why
Part-of-SpeechAmbiguity§ Wordscanhavemultiplepartsofspeech
§ Twobasicsourcesofconstraint:§ Grammaticalenvironment§ Identityofthecurrentword
§ Manymorepossiblefeatures:§ Suffixes,capitalization,namedatabases(gazetteers),etc…
Fed raises interest rates 0.5 percentNNP NNS NN NNS CD NNVBN VBZ VBP VBZVBD VB
WhyPOSTagging?§ Usefulinandofitself(morethanyou’dthink)
§ Text-to-speech:record,lead§ Lemmatization:saw[v]® see,saw[n]® saw§ Quick-and-dirtyNP-chunkdetection:grep{JJ|NN}*{NN|NNS}
§ Usefulasapre-processingstepforparsing§ Lesstagambiguitymeansfewerparses§ However,sometagchoicesarebetterdecidedbyparsers
DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …
DT NNP NN VBD VBN RP NN NNSThe Georgia branch had taken on loan commitments …
IN
VDN
ClassicSolution:HMMs§ Wewantamodelofsequencessandobservationsw
§ Assumptions:§ Statesaretagn-grams§ Usuallyadedicatedstartandendstate/word§ Tag/statesequenceisgeneratedbyamarkov model§ Wordsarechosenindependently,conditionedonlyonthetag/state§ Thesearetotallybrokenassumptions:why?
s1 s2 sn
w1 w2 wn
s0
States§ Statesencodewhatisrelevantaboutthepast§ TransitionsP(s|s’)encodewell-formedtagsequences
§ Inabigramtagger,states=tags
§ Inatrigramtagger,states=tagpairs
<¨,¨>
s1 s2 sn
w1 w2 wn
s0
< ¨, t1> < t1, t2> < tn-1, tn>
<¨>
s1 s2 sn
w1 w2 wn
s0
< t1> < t2> < tn>
EstimatingTransitions
§ Usestandardsmoothingmethodstoestimatetransitions:
§ Cangetalotfancier(e.g.KNsmoothing)orusehigherorders,butinthiscaseitdoesn’tbuymuch
§ Oneoption:encodemoreintothestate,e.g.whetherthepreviouswordwascapitalized(Brants 00)
§ BIGIDEA:Thebasicapproachofstate-splitting/refinementturnsouttobeveryimportantinarangeoftasks
)(ˆ)1()|(ˆ),|(ˆ),|( 211121221 iiiiiiiii tPttPtttPtttP llll --++= -----
EstimatingEmissions
§ Emissionsaretrickier:§ Wordswe’veneverseenbefore§ Wordswhichoccurwithtagswe’veneverseenthemwith§ Oneoption:breakoutthefancysmoothing(e.g.KN,Good-Turing)§ Issue:unknownwordsaren’tblackboxes:
§ Basicsolution:unknownwordsclasses(affixesorshapes)
§ Commonapproach:EstimateP(t|w)andinvert§ [Brants 00]usedasuffixtrie asits(inverted)emissionmodel
343,127.23 11-year Minteria reintroducibly
D+,D+.D+ D+-x+ Xx+ x+-“ly”
Disambiguation(Inference)§ Problem:findthemostlikely(Viterbi)sequenceunderthemodel
§ Givenmodelparameters,wecanscoreanytagsequence
§ Inprinciple,we’redone– listallpossibletagsequences,scoreeachone,pickthebestone(theViterbistatesequence)
Fed raises interest rates 0.5 percent .NNP VBZ NN NNS CD NN .
P(NNP|<¨,¨>) P(Fed|NNP) P(VBZ|<NNP,¨>) P(raises|VBZ) P(NN|VBZ,NNP)…..
NNP VBZ NN NNS CD NNNNP NNS NN NNS CD NNNNP VBZ VB NNS CD NN
logP = -23
logP = -29logP = -27
<¨,¨> <¨,NNP> <NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP>
FindingtheBestTrajectory§ Toomanytrajectories(statesequences)tolist§ Option1:BeamSearch
§ Abeamisasetofpartialhypotheses§ Startwithjustthesingleemptytrajectory§ Ateachderivationstep:
§ Considerallcontinuationsofprevioushypotheses§ Discardmost,keeptopk,orthosewithinafactorofthebest
§ Beamsearchworksokinpractice§ …butsometimesyouwanttheoptimalanswer§ …andyouneedoptimalanswerstovalidateyourbeamsearch§ …andthere’susuallyabetteroptionthannaïvebeams
<>Fed:NNP
Fed:VBN
Fed:VBD
Fed:NNP raises:NNS
Fed:NNP raises:VBZFed:VBN raises:NNS
Fed:VBN raises:VBZ
TheStateLattice/Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
TheStateLattice/Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
TheViterbiAlgorithm§ Dynamicprogramforcomputing
§ Thescoreofabestpathuptopositioniendinginstates
§ Alsocanstoreabacktrace(butnoonedoes)
§ Memoizedsolution§ Iterativesolution
)...,...(max)( 1110... 10--
-
= iisssi wwsssPsi
d
)'()'|()'|(max)( 1'sswPssPs isi -= dd
îíì >••=<
=otherwisesif
s0
,1)(0d
)'()'|()'|(maxarg)( 1'
sswPssPs is
i -= dy
SoHowWellDoesItWork?§ Choosethemostcommontag
§ 90.3%withabadunknownwordmodel§ 93.7%withagoodone
§ TnT (Brants,2000):§ Acarefullysmoothedtrigramtagger§ Suffixtreesforemissions§ 96.7%onWSJtext(SOTAis97+%)
§ Noiseinthedata§ Manyerrorsinthetrainingandtestcorpora
§ Probablyabout2%guaranteederrorfromnoise(onthisdata)
NN NN NNchief executive officer
JJ NN NNchief executive officer
JJ JJ NNchief executive officer
NN JJ NNchief executive officer
DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …
Overview:Accuracies§ Roadmapof(known/unknown)accuracies:
§ Mostfreq tag: ~90%/~50%
§ TrigramHMM: ~95%/~55%
§ TnT (HMM++): 96.2%/86.0%
§ Maxent P(t|w): 93.7%/82.6%§ MEMMtagger: 96.9%/86.9%§ State-of-the-art: 97+%/89+%§ Upperbound: ~98%
Most errors on unknown
words
CommonErrors§ Commonerrors[fromToutanova&Manning00]
NN/JJ NN
official knowledge
VBD RP/IN DT NN
made up the story
RB VBD/VBN NNS
recently sold shares
BetterFeatures§ Candosurprisinglywelljustlookingatawordbyitself:
§ Word the:the® DT§ Lowercasedword Importantly:importantly® RB§ Prefixes unfathomable:un-® JJ§ Suffixes Surprisingly:-ly® RB§ Capitalization Meridian:CAP® NNP§ Wordshapes 35-year:d-x® JJ
§ Thenbuildamaxent(orwhatever)modeltopredicttag§ MaxentP(t|w): 93.7%/82.6% s3
w3
WhyLinearContextisUseful§ Lotsofrichlocalinformation!
§ Wecouldfixthiswithafeaturethatlookedatthenextword
§ Wecouldfixthisbylinkingcapitalizedwordstotheirlowercaseversions
§ Solution:discriminativesequencemodels(MEMMs,CRFs)
§ Realitycheck:§ Taggersarealreadyprettygoodonnewswiretext…§ Whattheworldneedsistaggersthatworkonothertext!
PRP VBD IN RB IN PRP VBD .They left as soon as he arrived .
NNP NNS VBD VBN .Intrinsic flaws remained undetected .
RB
JJ
Sequence-FreeTagging?
§ Whataboutlookingatawordanditsenvironment,butnosequenceinformation?
§ Addinprevious/nextword the__§ Previous/nextwordshapes X__X§ Occurrencepatternfeatures [X:xXoccurs]§ Crudeentitydetection __…..(Inc.|Co.)§ Phrasalverbinsentence? put……__§ Conjunctionsofthesethings
§ Allfeaturesexceptsequence:96.6%/86.8%§ Useslotsoffeatures:>200K§ Whyisn’tthisthestandardapproach?
t3
w3 w4w2
NamedEntityRecognition§ Othersequencetasksusesimilarmodels
§ Example:nameentityrecognition(NER)
Prev Cur NextState Other ??? ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx
Local Context
Tim Boon has signed a contract extension with Leicestershire which will keep him at Grace Road .
PER PER O O O O O O ORG O O O O O LOC LOC O
MEMMTaggers§ Idea:left-to-rightlocaldecisions,conditiononprevioustags
andalsoentireinput
§ TrainupP(ti|w,ti-1,ti-2)asanormalmaxent model,thenusetoscoresequences
§ ThisisreferredtoasanMEMMtagger[Ratnaparkhi 96]§ Beamsearcheffective!(Why?)§ Whataboutbeamsize1?
§ Subtleissueswithlocalnormalization(cf.Laffertyetal01)
NERFeatures
Feature Type Feature PERS LOCPrevious word at -0.73 0.94Current word Grace 0.03 0.00Beginning bigram <G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68
Prev Cur NextState Other ??? ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx
Local Context
Feature WeightsBecause of regularization term, the more common prefixes have larger weights even though entire-word features are more specific.
Decoding§ DecodingMEMMtaggers:
§ JustlikedecodingHMMs,differentlocalscores§ Viterbi,beamsearch,posteriordecoding
§ Viterbialgorithm(HMMs):
§ Viterbialgorithm(MEMMs):
§ General:
MaximumEntropyII
§ Remember:maximumentropyobjective
§ Problem:lotsoffeaturesallowperfectfittotrainingset§ Regularization(comparetosmoothing)
DerivativeforMaximumEntropy
Big weights are bad
Total count of feature n in correct candidates
Expected count of feature n in predicted
candidates
Perceptron§ Linearmodel:
§ …thatdecomposealongthesequence
§ …allowustopredictwiththeViterbialgorithm
§ …whichmeanswecantrainwiththeperceptronalgorithm(orrelatedupdates,likeMIRA)
[Collins 01]
CRFs§ Likeanymaxent model,derivativeis:
§ Soallweneedistobeabletocomputetheexpectationofeachfeature(forexamplethenumberoftimesthelabelpairDT-NNoccurs,orthenumberoftimesNN-interestoccurs)underthemodeldistribution
§ Criticalquantity:countsofposteriormarginals:
ComputingPosteriorMarginals§ Howmany(expected)timesiswordwtaggedwiths?
§ Howtocomputethatmarginal?^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
GlobalDiscriminativeTaggers
§ Newer,higher-powereddiscriminativesequencemodels§ CRFs(alsoperceptrons,M3Ns)§ Donotdecomposetrainingintoindependentlocalregions§ Canbedeathlyslowtotrain– requirerepeatedinferenceontraining
set§ DifferencestendnottobetooimportantforPOStagging§ Differencesmoresubstantialonothersequencetasks§ However:oneissueworthknowingaboutinlocalmodels
§ “Labelbias”andotherexplainingawayeffects§ MEMMtaggers’localscorescanbenearonewithouthavingboth
good“transitions”and“emissions”§ Thismeansthatoftenevidencedoesn’tflowproperly§ Whyisn’tthisabigdealforPOStagging?§ Also:indecoding,conditiononpredicted,notgold,histories
Transformation-BasedLearning
§ [Brill95]presentsatransformation-basedtagger§ Labelthetrainingsetwithmostfrequenttags
DTMDVBDVBD.Thecanwasrusted.
§ Addtransformationruleswhichreducetrainingmistakes
§ MD® NN:DT__§ VBD® VBN:VBD__.
§ Stopwhennotransformationsdosufficientgood§ Doesthisremindanyoneofanything?
§ Probablythemostwidelyusedtagger(esp.outsideNLP)§ …butdefinitelynotthemostaccurate:96.6%/82.0%
EngCGTagger
§ Englishconstraintgrammartagger§ [TapanainenandVoutilainen94]§ Somethingelseyoushouldknowabout§ Hand-writtenandknowledgedriven§ “Don’tguessifyouknow”(generalpoint
aboutmodelingmorestructure!)§ Tagsetdoesn’tmakeallofthehard
distinctionsasthestandardtagset(e.g.JJ/NN)
§ Theygetstellaraccuracies:99%ontheirtagset
§ Linguisticrepresentationmatters…§ …butit’seasiertowinwhenyoumakeup
therules
DomainEffects§ Accuraciesdegradeoutsideofdomain
§ Uptotripleerrorrate§ Usuallymakethemosterrorsonthethingsyoucareaboutinthedomain(e.g.proteinnames)
§ Openquestions§ Howtoeffectivelyexploitunlabeleddatafromanewdomain(whatcouldwegain?)
§ Howtobestincorporatedomainlexicainaprincipledway(e.g.UMLSspecialistlexicon,ontologies)
UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:
§ Rawsentencesin§ Taggedsentencesout
§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults
EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the
tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach
kindoftransitionandemissionwehaveundercurrentparams:
§ SamequantitiesweneededtotrainaCRF!
TheStateLattice/Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]
§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels
§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples
§ Onnexamples,re-estimatewithEM
§ Note:weknowallowedtagsbutnotfrequencies
DistributionalClustering
president the __ ofpresident the __ saidgovernor the __ ofgovernor the __ appointedsaid sources __ ¨said president __ thatreported sources __ ¨
presidentgovernor
saidreported
thea
¨ the president said that the downturn was over ¨
[Finch and Chater 92, Shuetze 93, many others]
DistributionalClustering§ Threemainvariantsonthesameidea:
§ Pairwisesimilaritiesandheuristicclustering§ E.g.[FinchandChater92]§ Producesdendrograms
§ Vectorspacemethods§ E.g.[Shuetze93]§ Modelsofambiguity
§ Probabilisticmethods§ Variousformulations,e.g.[LeeandPereira99]
VectorSpaceVersion§ [Shuetze93]clusterswordsaspointsinRn
§ Vectorstoosparse,useSVDtoreduce
Mw
context counts
US V
w
context counts
Cluster these 50-200 dim vectors instead.
Õ -=i
iiii ccPcwPCSP )|()|(),( 1
Õ +-=i
iiiiii cwwPcwPcPCSP )|,()|()(),( 11
AProbabilisticVersion?
¨ the president said that the downturn was over ¨
c1 c2 c6c5 c7c3 c4 c8
¨ the president said that the downturn was over ¨
c1 c2 c6c5 c7c3 c4 c8
WhatElse?§ Variousnewerideas:
§ Contextdistributionalclustering[Clark00]§ Morphology-drivenmodels[Clark03]§ Contrastiveestimation[SmithandEisner05]§ Feature-richinduction[HaghighiandKlein06]
§ Also:§ Whataboutambiguouswords?§ Usingwidercontextsignatureshasbeenusedforlearningsynonyms(what’swrongwiththisapproach?)
§ Canextendtheseideasforgrammarinduction(later)
ParseTrees
The move followed a round of similar increases by other lenders, reflecting a continuing decline in that market
PhraseStructureParsing§ Phrasestructureparsing
organizessyntaxintoconstituentsorbrackets
§ Ingeneral,thisinvolvesnestedtrees
§ Linguistscan,anddo,argueaboutdetails
§ Lotsofambiguity
§ Nottheonlykindofsyntax…
new art critics write reviews with computers
PP
NPNP
N’
NP
VP
S
ConstituencyTests
§ Howdoweknowwhatnodesgointhetree?
§ Classicconstituencytests:§ Substitutionbyproform§ Questionanswers§ Semanticgounds
§ Coherence§ Reference§ Idioms
§ Dislocation§ Conjunction
§ Cross-linguisticarguments,too
ConflictingTests§ Constituencyisn’talwaysclear
§ Unitsoftransfer:§ thinkabout~penser ৠtalkabout~hablar de
§ Phonologicalreduction:§ Iwillgo® I’llgo§ Iwanttogo® Iwanna go§ alecentre® aucentre
§ Coordination§ Hewenttoandcamefromthestore.
La vélocité des ondes sismiques
ClassicalNLP:Parsing
§ Writesymbolicorlogicalrules:
§ Usedeductionsystemstoproveparsesfromwords§ Minimalgrammaron“Fedraises”sentence:36parses§ Simple10-rulegrammar:592parses§ Real-sizegrammar:manymillionsofparses
§ Thisscaledverybadly,didn’tyieldbroad-coveragetools
Grammar (CFG) Lexicon
ROOT ® S
S ® NP VP
NP ® DT NN
NP ® NN NNS
NN ® interest
NNS ® raises
VBP ® interest
VBZ ® raises
…
NP ® NP PP
VP ® VBP NP
VP ® VBP NP PP
PP ® IN NP
Attachments
§ Icleanedthedishesfromdinner
§ Icleanedthedisheswithdetergent
§ Icleanedthedishesinmypajamas
§ Icleanedthedishesinthesink
SyntacticAmbiguitiesI
§ Prepositionalphrases:Theycookedthebeansinthepotonthestovewithhandles.
§ Particlevs.preposition:Thepuppytoreupthestaircase.
§ ComplementstructuresThetouristsobjectedtotheguidethattheycouldn’thear.Sheknowsyoulikethebackofherhand.
§ Gerundvs.participialadjectiveVisitingrelativescanbeboring.Changingschedulesfrequentlyconfusedpassengers.
SyntacticAmbiguitiesII§ ModifierscopewithinNPs
impracticaldesignrequirementsplasticcupholder
§ MultiplegapconstructionsThechickenisreadytoeat.Thecontractorsarerichenoughtosue.
§ Coordinationscope:Smallratsandmicecansqueezeintoholesorcracksinthewall.
DarkAmbiguities
§ Darkambiguities: mostanalysesareshockinglybad(meaning,theydon’thaveaninterpretationyoucangetyourmindaround)
§ Unknownwordsandnewusages§ Solution:Weneedmechanismstofocusattentiononthebestones,probabilistictechniquesdothis
Thisanalysiscorrespondstothecorrectparseof
“Thiswillpanicbuyers!”
ProbabilisticContext-FreeGrammars
§ Acontext-freegrammarisatuple<N,T,S,R>§ N :thesetofnon-terminals
§ Phrasalcategories:S,NP,VP,ADJP,etc.§ Parts-of-speech(pre-terminals):NN,JJ,DT,VB
§ T :thesetofterminals(thewords)§ S :thestartsymbol
§ OftenwrittenasROOTorTOP§ Notusuallythesentencenon-terminalS
§ R :thesetofrules§ OftheformX® Y1 Y2 …Yk,withX,Yi Î N§ Examples:S® NPVP,VP® VPCCVP§ Alsocalledrewrites,productions,orlocaltrees
§ APCFGadds:§ Atop-downproductionprobabilityperruleP(Y1 Y2 …Yk|X)
TreebankGrammars
§ NeedaPCFGforbroadcoverageparsing.§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):
§ Betterresultsbyenrichingthegrammar(e.g.,lexicalization).§ Canalsogetstate-of-the-artparserswithoutlexicalization.
ROOT ® S 1
S ® NP VP . 1
NP ® PRP 1
VP ® VBD ADJP 1
…..
PLURAL NOUN
NOUNDETDET
ADJ
NOUN
NP NP
CONJ
NP PP
TreebankGrammarScale
§ Treebankgrammarscanbeenormous§ AsFSAs,therawgrammarhas~10Kstates,excludingthelexicon§ Betterparsersusuallymakethegrammarslarger,notsmaller
NP
ChomskyNormalForm
§ Chomskynormalform:§ AllrulesoftheformX® YZorX® w§ Inprinciple,thisisnolimitationonthespaceof(P)CFGs
§ N-aryrulesintroducenewnon-terminals
§ Unaries/emptiesare“promoted”§ Inpracticeit’skindofapain:
§ Reconstructingn-ariesiseasy§ Reconstructingunariesistrickier§ Thestraightforwardtransformationsdon’tpreservetreescores
§ Makesparsingalgorithmssimpler!
VP
[VP ® VBD NP •]
VBD NP PP PP
[VP ® VBD NP PP •]
VBD NP PP PP
VP
ARecursiveParser
§ Willthisparserwork?§ Whyorwhynot?§ Memoryrequirements?
bestScore(X,i,j,s)if (j = i+1)
return tagScore(X,s[i])else
return max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)
AMemoizedParser§ Onesmallchange:
bestScore(X,i,j,s)if (scores[X][i][j] == null)
if (j = i+1)score = tagScore(X,s[i])
elsescore = max score(X->YZ) *
bestScore(Y,i,k) *bestScore(Z,k,j)
scores[X][i][j] = scorereturn scores[X][i][j]
§ Canalsoorganizethingsbottom-up
ABottom-UpParser(CKY)
bestScore(s)for (i : [0,n-1])for (X : tags[s[i]])score[X][i][i+1] =
tagScore(X,s[i])for (diff : [2,n])for (i : [0,n-diff])j = i + difffor (X->YZ : rule)for (k : [i+1, j-1])score[X][i][j] = max score[X][i][j],
score(X->YZ) *score[Y][i][k] *score[Z][k][j]
Y Z
X
i k j
UnaryRules§ Unaryrules?
bestScore(X,i,j,s)if (j = i+1)
return tagScore(X,s[i])else
return max max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)
max score(X->Y) *bestScore(Y,i,j)
CNF+UnaryClosure
§ Weneedunariestobenon-cyclic§ Canaddressbypre-calculatingtheunaryclosure§ Ratherthanhavingzeroormoreunaries,alwayshaveexactlyone
§ Alternateunaryandbinarylayers§ Reconstructunarychainsafterwards
NP
DT NN
VP
VBDNP
DT NN
VP
VBD NP
VP
S
SBAR
VP
SBAR
AlternatingLayers
bestScoreU(X,i,j,s)if (j = i+1)
return tagScore(X,s[i])else
return max max score(X->Y) *bestScoreB(Y,i,j)
bestScoreB(X,i,j,s)return max max score(X->YZ) *
bestScoreU(Y,i,k) *bestScoreU(Z,k,j)
Memory§ Howmuchmemorydoesthisrequire?
§ Havetostorethescorecache§ Cachesize:|symbols|*n2 doubles§ Fortheplaintreebankgrammar:
§ X~20K,n=40,double~8bytes=~256MB§ Big,butworkable.
§ Pruning:Beams§ score[X][i][j]cangettoolarge(when?)§ Cankeepbeams(truncatedmapsscore[i][j])whichonlystorethebestfew
scoresforthespan[i,j]
§ Pruning:Coarse-to-Fine§ UseasmallergrammartoruleoutmostX[i,j]§ Muchmoreonthislater…
Time:Theory§ Howmuchtimewillittaketoparse?
§ Foreachdiff(<=n)§ Foreachi (<=n)
§ ForeachruleX® YZ§ ForeachsplitpointkDoconstantwork
§ Totaltime:|rules|*n3
§ Somethinglike5secforanunoptimized parseofa20-wordsentence
Y Z
X
i k j
Time:Practice
§ Parsingwiththevanillatreebank grammar:
§ Why’sitworseinpractice?§ Longersentences“unlock”moreofthegrammar§ Allkindsofsystemsissuesdon’tscale
~ 20K Rules
(not an optimized parser!)
Observed exponent:
3.6
Same-SpanReachability
ADJP ADVPFRAG INTJ NPPP PRN QP SSBAR UCP VP
WHNP
TOP
LST
CONJP
WHADJP
WHADVP
WHPP
NX
NAC
SBARQ
SINV
RRCSQ X
PRT
RuleStateReachability
§ Manystatesaremorelikelytomatchlargerspans!
Example: NP CC •
NP CC
0 nn-11 Alignment
Example: NP CC NP •
NP CC
0 nn-k-1n AlignmentsNP
n-k
EfficientCKY
§ LotsoftrickstomakeCKYefficient§ Someofthemarelittleengineeringdetails:
§ E.g.,firstchoosek,thenenumeratethroughtheY:[i,k]whicharenon-zero,thenloopthroughrulesbyleftchild.
§ Optimallayoutofthedynamicprogramdependsongrammar,input,evensystemdetails.
§ Anotherkindismoreimportant(andinteresting):§ ManyX[i,j]canbesuppressedonthebasisoftheinputstring§ We’llseethisnextclassasfigures-of-merit,A*heuristics,coarse-to-fine,etc
Agenda-BasedParsing§ Agenda-basedparsingislikegraphsearch(butovera
hypergraph)§ Concepts:
§ Numbering:wenumberfencepostsbetweenwords§ “Edges”oritems:spanswithlabels,e.g.PP[3,5],representthesetsof
treesoverthosewordsrootedatthatlabel(cf.searchstates)§ Achart:recordsedgeswe’veexpanded(cf.closedset)§ Anagenda:aqueuewhichholdsedges(cf.afringeoropenset)
0 1 2 3 4 5critics write reviews with computers
PP
WordItems§ Buildinganitemforthefirsttimeiscalleddiscovery.Itemsgo
intotheagendaondiscovery.§ Toinitialize,wediscoverallworditems(withscore1.0).
critics write reviews with computers
critics[0,1], write[1,2], reviews[2,3], with[3,4], computers[4,5]
0 1 2 3 4 5
AGENDA
CHART [EMPTY]
UnaryProjection§ Whenwepopaworditem,thelexicontellsusthetagitem
successors(andscores)whichgoontheagenda
critics write reviews with computers
0 1 2 3 4 5critics write reviews with computers
critics[0,1] write[1,2]NNS[0,1]
reviews[2,3] with[3,4] computers[4,5]VBP[1,2] NNS[2,3] IN[3,4] NNS[4,5]
ItemSuccessors§ Whenwepopitemsoffoftheagenda:
§ Graphsuccessors:unaryprojections(NNS® critics,NP® NNS)
§ Hypergraph successors:combinewithitemsalreadyinourchart
§ Enqueue /promoteresultingitems(ifnotinchartalready)§ Recordbacktraces asappropriate§ Stickthepoppededgeinthechart(closedset)
§ Queriesachartmustsupport:§ IsedgeX[i,j]inthechart?(Whatscore?)§ WhatedgeswithlabelYendatpositionj?§ WhatedgeswithlabelZstartatpositioni?
Y[i,j] with X ® Y forms X[i,j]
Y[i,j] and Z[j,k] with X ® Y Z form X[i,k]
Y Z
X
AnExample
0 1 2 3 4 5critics write reviews with computers
NNS VBP NNS IN NNS
NNS[0,1] VBP[1,2] NNS[2,3] IN[3,4] NNS[3,4] NP[0,1] NP[2,3] NP[4,5]
NP NP NP
VP[1,2] S[0,2]
VP
PP[3,5]
PP
VP[1,3]
VP
ROOT[0,2]
SROOT
SROOT
S[0,3] VP[1,5]
VP
NP[2,5]
NP
ROOT[0,3] S[0,5] ROOT[0,5]
S
ROOT
EmptyElements§ Sometimeswewanttopositnodesinaparsetreethatdon’t
containanypronouncedwords:
§ Theseareeasytoaddtoaagenda-basedparser!§ Foreachpositioni,addthe“word”edgee[i,i]§ AddruleslikeNP® e tothegrammar§ That’sit!
0 1 2 3 4 5I like to parse empties
e e e e e e
NP VP
I want you to parse this sentence
I want [ ] to parse this sentence
UCS/A*
§ Withweightededges,ordermatters§ Mustexpandoptimalparsefrom
bottomup(subparsesfirst)§ CKYdoesthisbyprocessingsmaller
spansbeforelargerones§ UCSpopsitemsofftheagendain
orderofdecreasingViterbiscore§ A*searchalsowelldefined
§ Youcanalsospeedupthesearchwithoutsacrificingoptimality§ Canselectwhichitemstoprocessfirst§ Candowithany“figureofmerit”
[Charniak98]§ Ifyourfigure-of-meritisavalidA*
heuristic,nolossofoptimiality[KleinandManning03]
X
n0 i j
(Speech)Lattices§ Therewasnothingmagicalaboutwordsspanningexactly
oneposition.§ Whenworkingwithspeech,wegenerallydon’tknow
howmanywordsthereare,orwheretheybreak.§ Wecanrepresentthepossibilitiesasalatticeandparse
thesejustaseasily.
Iawe
of
van
eyes
sawa
‘ve
an
Ivan
UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:
§ Rawsentencesin§ Taggedsentencesout
§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults
EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the
tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach
kindoftransitionandemissionwehaveundercurrentparams:
§ SamequantitiesweneededtotrainaCRF!
TheStateLattice/Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]
§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels
§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples
§ Onnexamples,re-estimatewithEM
§ Note:weknowallowedtagsbutnotfrequencies
DistributionalClustering
president the __ ofpresident the __ saidgovernor the __ ofgovernor the __ appointedsaid sources __ ¨said president __ thatreported sources __ ¨
presidentgovernor
saidreported
thea
¨ the president said that the downturn was over ¨
[Finch and Chater 92, Shuetze 93, many others]
DistributionalClustering§ Threemainvariantsonthesameidea:
§ Pairwisesimilaritiesandheuristicclustering§ E.g.[FinchandChater92]§ Producesdendrograms
§ Vectorspacemethods§ E.g.[Shuetze93]§ Modelsofambiguity
§ Probabilisticmethods§ Variousformulations,e.g.[LeeandPereira99]