Top Banner
Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource Acquisition Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008
25

Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Feb 22, 2016

Download

Documents

gasha

Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource Acquisition. Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008. Unsupervised Approaches to Morphology. Morphology refers to the internal structure of words - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Unsupervised Approaches to Sequence Tagging, Morphology Induction, and

Lexical Resource Acquisition

Reza Bosaghzadeh & Nathan Schneider

LS2 ~ 1 December 2008

Page 2: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Unsupervised Approaches to Morphology

• Morphology refers to the internal structure of words– A morpheme is a minimal meaningful

linguistic unit–Morpheme segmentation is the

process of dividing words into their component morphemes

unsupervised learning

.

Page 3: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Unsupervised Approaches to Morphology

• Morphology refers to the internal structure of words– A morpheme is a minimal meaningful

linguistic unit–Morpheme segmentation is the process of

dividing words into their component morphemes

un-supervise-d learn-ing–Word segmentation is the process of

finding word boundaries in a stream of speech or text

unsupervisedlearningofnaturallanguage

Page 4: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Unsupervised Approaches to Morphology

• Morphology refers to the internal structure of words– A morpheme is a minimal meaningful

linguistic unit–Morpheme segmentation is the process of

dividing words into their component morphemes

un-supervise-d learn-ing–Word segmentation is the process of

finding word boundaries in a stream of speech or textunsupervised learning of natural language

Page 5: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

• Learns inflectional paradigms from raw text– Requires only the vocabulary of a corpus– Looks at word counts of substrings, and

proposes (stem, suffix) pairings based on type frequency

• 3-stage algorithm– Stage 1: Candidate paradigms based on

frequencies– Stages 2-3: Refinement of paradigm set via

merging and filtering• Paradigms can be used for morpheme

segmentation or stemming

Page 6: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• A sampling of Spanish verb

conjugations

Page 7: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• A proposed paradigm with stems {habl,

bail} and suffixes {-ar, -o, -amos, -an}

Page 8: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …• Same paradigm from previous slide,

but with stems {habl, bail, compr}

Page 9: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• From just this list, other paradigm analyses

(which happen to be incorrect) are possible

Page 10: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• Another possibility: stems {hab, bai},

suffixes {-lar, -lo, -lamos, -lan}

Page 11: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …• Spurious segmentations—this paradigm

doesn’t generalize to comprar (or most verbs)

Page 12: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

• What if not all conjugations were in the corpus?

speak dance buyhablar bailar comprar

bailo comprohablamos bailamos compramoshablan… … …

Page 13: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

• We have two similar paradigms that we want to merge

speak dance buyhablar bailar comprar

bailo comprohablamos bailamos compramoshablan… … …

Page 14: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …• This amounts to smoothing, or

“hallucinating” out-of-vocabulary items

Page 15: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

ParaMor: Monson et al. (2007, 2008)

• Heuristic-based, deterministic algorithm can learn inflectional paradigms from raw text

• Paradigms can be used straightforwardly to predict segmentations– Combining the outputs of ParaMor and

Morfessor (another system) won the segmentation task at MorphoChallenge 2008 for every language: English, Arabic, Turkish, German, and Finnish

• Currently, ParaMor assumes suffix-based morphology

Page 16: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

• Word segmentation results – comparison

• See Narges & Andreas’s presentation for more on this model

Goldwater Unigram DPGoldwater Bigram HDP

Goldwater et al. (2006; in submission)

Page 17: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Multilingual morpheme segmentation Snyder & Barzilay (2008)

speak ES speak FRhablar parlerhablo parlehablamos parlonshablan parlent… …• Abstract morphemes cross

languages: (ar, er), (o, e), (amos, ons), (an, ent), (habl, parl)

• Considers parallel phrases and tries to find morpheme correspondences

• Stray morphemes don’t correspond across languages

Page 18: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Morphology papers: inputs & outputs

Page 19: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Narrative events: Chambers & Jurafsky (2008)

• Given a corpus, identifies related events that constitute a “narrative” and (when possible) predict their typical temporal ordering– E.g.: CRIMINAL PROSECUTION narrative, with

verbs: arrest, accuse, plead, testify, acquit/convict

• Key insight: related events tend to share a participant in a document– The common participant may fill different

syntactic/semantic roles with respect to verbs: arrest.OBJECT, accuse.OBJECT, plead.SUBJECT

Page 20: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Narrative events: Chambers & Jurafsky (2008)

• A temporal classifier can reconstruct pairwise canonical event orderings, producing a directed graph for each narrative

Page 21: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Narrative events: Grenager & Manning (2006)

• From dependency parses, a generative model predicts semantic roles corresponding to each verb’s arguments, as well as their syntactic realizations– PropBank-style: ARG0, ARG1, etc. per verb (do

not necessarily correspond across verbs)– Learned syntactic patterns of the form:

(subj=give.ARG0, verb=give, np#1=give.ARG2, np#2=give.ARG1) or (subj=give.ARG0, verb=give, np#2=give.ARG1, pp_to=give.ARG2)

• Used for semantic role labeling

Page 22: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

“Semanticity”: Our proposed scale of semantic richness

• text < POS < syntax/morphology/alignments < coreference/semantic roles/temporal ordering < translations/narrative event sequences

• We score each model’s inputs and outputs on this scale, and call the input-to-output increase “semantic gain”– Haghighi et al.’s bilingual lexicon induction wins

in this respect, going from raw text to lexical translations

Page 23: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Robustness to language variation• About half of the papers we examined

had English-only evaluations• We considered which techniques were

most adaptable to other (esp. resource-poor) languages. Two main factors:– Reliance on existing tools/resources for

preprocessing (parsers, coreference resolvers, …)

– Any linguistic specificity in the model (e.g. suffix-based morphology)

Page 24: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

SummaryWe examined three areas of unsupervised NLP:

1. Sequence tagging: How can we predict POS (or topic) tags for words in sequence?

2. Morphology: How are words put together from morphemes (and how can we break them apart)?

3. Lexical resources: How can we identify lexical translations, semantic roles and argument frames, or narrative event sequences from text?

In eight recent papers we found a variety of approaches, including heuristic algorithms, Bayesian methods, and EM-style techniques.

Page 25: Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008

Thanks to Noah and Kevin for their feedback on the paper; Andreas and Narges for their collaboration on the presentations; and all of you for giving us your attention!

Questions?

un-supervise-d learn-inghablar bailarhablo bailohablamos

bailamos

hablan bailan

subj=give.ARG0 verb=give np#1=give.ARG2 np#2=give.ARG1