Top Banner
Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource Acquisition Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008
36

Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Mar 29, 2018

Download

Documents

vuanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

UnsupervisedApproachestoSequenceTagging,MorphologyInduction,andLexicalResource

AcquisitionRezaBosaghzadeh&NathanSchneider

LS2~1December2008

Page 2: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

UnsupervisedMethods– SequenceLabeling(Part‐of‐SpeechTagging)

– MorphologyInduction

– LexicalResourceAcquisition

.

She ran to the station quickly

pronoun verb preposition det noun adverb

un‐supervise‐dlearn‐ing

Page 3: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ContrastiveEstimationSmith&Eisner(2005)

•  Alreadydiscussedinclass•  Keyidea:exploitsimplicitnegativeevidence

– Mutatingtrainingexamplesoftengivesungrammatical(negative)sentences

– Duringtraining,shiftprobabilitymassfromgeneratednegativeexamplestogivenpositiveexamples

•  BUT:Requiresataggingdictionary,i.e.alistofpossibletagsforeachwordtype

Page 4: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Prototype‐driventaggingHaghighi&Klein(2006)

+

PrototypesTargetLabel

UnlabeledData

PrototypeList

AnnotatedData

slidecourtesyHaghighi&Klein

Page 5: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Prototype‐driventaggingHaghighi&Klein(2006)

Newlyremodeled2Bdrms/1Bath,spaciousupperunit,locatedinHilltopMallarea.Walkingdistancetoshopping,publictransportation,schoolsandpark.Paidwaterandgarbage.Nodogsallowed.

Newlyremodeled2Bdrms/1Bath,spaciousupperunit,locatedinHilltopMallarea.Walkingdistancetoshopping,publictransportation,schoolsandpark.Paidwaterandgarbage.Nodogsallowed.

PrototypeList

NN VBN CC JJ CD PUNC

IN NNS IN NNP RB DET

NN president IN of

VBD said NNS shares

CC and TO to

NNP Mr. PUNC .

JJ new CD million

DET the VBP are

EnglishPOS

slidecourtesyHaghighi&Klein

Page 6: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Prototypes

Newlyremodeled2Bdrms/1Bath,spaciousupperunit,locatedinHilltopMallarea.Walkingdistancetoshopping,publictransportation,schoolsandpark.Paidwaterandgarbage.Nodogsallowed.

FEATURE kitchen, laundry

LOCATION near, close TERMS paid, utilities SIZE large, feet RESTRICT cat, smoking

InformationExtraction:ClassifiedAds

FeaturesLocationTermsRestrictSize

PrototypeList

slidecourtesyHaghighi&Klein

Page 7: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Prototype‐driventaggingHaghighi&Klein(2006)

•  Trigramtagger,samefeaturesas(Smith&Eisner2005)– Wordtype,suffixesuptolength3,contains‐hyphen,contains‐digit,initialcapitalization

•  Tieeachwordtoitsmostsimilarprototype,usingcontext‐basedsimilaritytechnique(Schütze1993)–  SVDdimensionalityreduction–  Cosinesimilaritybetweencontextvectors

slideadaptedfromHaghighi&Klein

Page 8: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Prototype‐driventaggingHaghighi&Klein(2006)

Pros•  Doesn’trequiretaggingdictionaryCons•  Stillneedatagset•  Maybehardtochoosegoodprototypes

Page 9: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

UnsupervisedPOStaggingTheStateoftheArt

Bestsupervisedresult(CRF):99.5%!

Page 10: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

UnsupervisedMethods– SequenceLabeling(Part‐of‐SpeechTagging)

– MorphologyInduction

– LexicalResourceAcquisition

.

She ran to the station quickly

pronoun verb preposition det noun adverb

un‐supervise‐dlearn‐ing

Page 11: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

UnsupervisedApproachestoMorphology

•  Morphologyreferstotheinternalstructureofwords– Amorphemeisaminimalmeaningfullinguisticunit

– Morphemesegmentationistheprocessofdividingwordsintotheircomponentmorphemes

un‐supervise‐dlearn‐ing– Wordsegmentationistheprocessoffindingwordboundariesinastreamofspeechortextunsupervised_learning_of_natural_language

Page 12: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

•  Learnsinflectionalparadigmsfromrawtext– Requiresonlyalistofwordtypesfromacorpus– Looksatwordcountsofsubstrings,andproposes(stem,suffix)pairingsbasedontypefrequency

•  3‐stagealgorithm– Stage1:Candidateparadigmsbasedonfrequencies

– Stages2‐3:Refinementofparadigmsetviamergingandfiltering

•  Paradigmscanbeusedformorphemesegmentationorstemming

Page 13: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …

•  AsamplingofSpanishverbconjugations(inflections)

Page 14: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …

•  Aproposedparadigm(correct):stems{habl,bail,compr}andsuffixes{‐ar,‐o,‐amos,‐an}

Page 15: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

•  Twosubsequentstages:– Filteringoutspuriousparadigms(e.g.withincorrectsegmentations)

– Mergingpartialparadigmstoovercomesparsity:smoothing

Page 16: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …

•  Forcertainsub‐setsofverbs,thealgorithmmayproposeparadigmswithspuriousseg‐mentations,liketheoneatleft

•  Thefilteringstageofthealgorithmweedsouttheseincorrectparadigms

Page 17: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

•  Whatifnotallconjugationswereinthecorpus?

speak dance buyhablar bailar comprar

bailo comprohablamos bailamos compramoshablan… … …

Page 18: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

•  Anotherstageofthealgorithmmergestheseoverlappingpartialparadigmsviaclustering

speak dance buyhablar bailar comprar

bailo comprohablamos bailamos compramoshablan… … …

Page 19: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …

•  Thisamountstosmoothing,or“hallucinating”out‐of‐vocabularyitems

Page 20: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)

•  Heuristic‐based,deterministicalgorithmcanlearninflectionalparadigmsfromrawtext

•  Currently,ParaMorassumessuffix‐basedmorphology

•  Paradigmscanbeusedstraightforwardlytopredictsegmentations– CombiningtheoutputsofParaMorandMorfessor(anothersystem)wonthesegmentationtaskatMorphoChallenge2008foreverylanguage:English,Arabic,Turkish,German,andFinnish

Page 21: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

•  Wordsegmentationresults–comparison

•  SeeNarges&Andreas’spresentationformoreonthismodel

Goldwateretal.UnigramDP

Goldwateretal.BigramHDP

BayesianwordsegmentationGoldwateretal.(2006;insubmission)

tablefromGoldwateretal.(insubmission)

Page 22: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

MultilingualmorphemesegmentationSnyder&Barzilay(2008)

speakrs speaktuhablar parlerhablo parlehablamos parlonshablan parlent… …

•  Abstractmorphemescrosslanguages:(ar,er),(o,e),(amos,ons),(an,ent),(habl,parl)

•  Considersparallelphrasesandtriestofindmorphemecorrespondences

•  Straymorphemesdon’tcorrespondacrosslanguages

Page 23: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

MorphologyPapers:Inputs&Outputs

•  Whatdoes“unsupervised”meanforeachapproach?

Page 24: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

UnsupervisedMethods– SequenceLabeling(Part‐of‐SpeechTagging)

– MorphologyInduction

– LexicalResourceAcquisition

.

She ran to the station quickly

pronoun verb preposition det noun adverb

un‐supervise‐dlearn‐ing

Page 25: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

BilinguallexiconsfrommonolingualcorporaHaghighietal.(2008)

SourceText

TargetText

Matching

m state

world

name

SourceWords

s

nation

estado

política

TargetWords

t

mundo

nombre

diagramcourtesyHaghighietal.UsedavariantofCCA(CanonicalCorrelationAnalysis)

Page 26: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

state

Orthographic Features 1.0

1.0

1.0

#st

tat te#

5.0

20.0

10.0

Context Features

world politics society

SourceText

estado

Orthographic Features 1.0

1.0

1.0

#es

sta do#

10.0

17.0

6.0

Context Features

mundo politica sociedad

TargetText

slidecourtesyHaghighietal.

BilingualLexiconsfromMonolingualCorporaHaghighietal.(2008)

DataRepresentation

Page 27: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

FeatureExperiments

61.1

80.1 80.289.0

0

25

50

75

100

EditDist Ortho Context MCCA

Precision

•  MCCA:Orthographicandcontextfeatures

4kEN‐ESWikipediaArticlesslidecourtesyHaghighietal.

Page 28: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

NarrativeeventsChambers&Jurafsky(2008)

•  Givenacorpus,identifiesrelatedeventsthatconstitutea“narrative”and(whenpossible)predicttheirtypicaltemporalordering– E.g.:NOPQPRSTUOVWXNYZPVRnarrative,withverbs:arrest,accuse,plead,testify,acquit/convict

•  Keyinsight:relatedeventstendtoshareaparticipantinadocument– Thecommonparticipantmayfilldifferentsyntactic/semanticroleswithrespecttoverbs:arrest.V\]XNZ,accuse.V\]XNZ,plead.WY\]XNZ

Page 29: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

NarrativeeventsChambers&Jurafsky(2008)

•  Atemporalclassifiercanreconstructpairwisecanonicaleventorderings,producingadirectedgraphforeachnarrative

Page 30: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

StatisticalverblexiconGrenager&Manning(2006)

•  Fromdependencyparses,agenerativemodelpredictsforeachverb:– PropBank‐stylesemanticroles:wux0,wux1,etc.(donotnecessarilycorrespondacrossverbs)

– Theroles’syntacticrealizations,e.g.:

•  Usedforsemanticrolelabeling

He gave me a cookie

subj ARG0

verb give

np#1 ARG2

np#2 ARG1

He gave a cookie to me

subj ARG0

verb give

np#2 ARG1

pp_to ARG2

Page 31: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

“Semanticity”:Ourproposedscaleofsemanticrichness

•  text<POS<syntax/morphology/alignments<coreference/semanticroles/temporalordering<translations/narrativeeventsequences

•  Wescoreeachmodel’sinputsandoutputsonthisscale,andcalltheinput‐to‐outputincrease“semanticgain”– Haghighietal.’sbilinguallexiconinductionwinsinthisrespect,goingfromrawtexttolexicaltranslations

Page 32: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

SemanticGain:ComparisonofMethods

Page 33: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Robustnesstolanguagevariation•  AbouthalfofthepapersweexaminedhadEnglish‐onlyevaluations

•  Weconsideredwhichtechniquesweremostadaptabletoother(esp.resource‐poor)languages.Twomainfactors:– Relianceonexistingtools/resourcesforpreprocessing(parsers,coreferenceresolvers,…)

– Anylinguisticspecificityinthemodel(e.g.suffix‐basedmorphology)

Page 34: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

SummaryWeexaminedthreeareasofunsupervisedNLP:

1.   Sequencetagging:HowcanwepredictPOS(ortopic)tagsforwordsinsequence?

2.   Morphology:Howarewordsputtogetherfrommorphemes(andhowcanwebreakthemapart)?

3.   Lexicalresources:Howcanweidentifylexicaltranslations,semanticrolesandargumentframes,ornarrativeeventsequencesfromtext?

Ineightrecentpaperswefoundavarietyofapproaches,includingheuristicalgorithms,Bayesianmethods,andEM‐styletechniques.

Page 35: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ThankstoNoahandKevinfortheirfeedbackonthepaper;AndreasandNargesfortheir

collaborationonthepresentations;andallofyouforgivingusyourattention!

Questions?

un‐supervise‐dlearn‐ing

hablar bailar

hablo bailo

hablamos bailamos

hablan bailan

subj=give.wux0verb=givenp#1=give.wux2np#2=give.wux1

PrototypesTargetLabel

Page 36: Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf ·  · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

ImprovementIdeas

•  POSTagging:Learnthetagset•  Morphology:Non‐agglomerativeMorphology,Alsoparses

•  LexicalResources:Trywordclasses

•  All:Languagevariability