Top Banner
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/ Brown University
27

Distributional Cues to Word Boundaries: Context Is Important

Dec 30, 2015

Download

Documents

gage-pace

Distributional Cues to Word Boundaries: Context Is Important. Sharon Goldwater Stanford University. Tom Griffiths UC Berkeley. Mark Johnson Microsoft Research/ Brown University. Word segmentation. One of the first problems infants must solve when learning language. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributional Cues to Word Boundaries: Context Is Important

Distributional Cues to Word Boundaries: Context Is Important

Sharon GoldwaterStanford University

Tom GriffithsUC Berkeley

Mark JohnsonMicrosoft Research/

Brown University

Page 2: Distributional Cues to Word Boundaries: Context Is Important

Word segmentation

One of the first problems infants must solve when learning language.

Infants make use of many different cues. Phonotactics, allophonic variation, metrical (stress)

patterns, effects of coarticulation, and statistical regularities in syllable sequences.

Statistics may provide initial bootstrapping.Used very early (Thiessen & Saffran, 2003).Language-independent.

Page 3: Distributional Cues to Word Boundaries: Context Is Important

Distributional segmentation

Work on distributional segmentation often discusses transitional probabilities (Saffran et al. 1996; Aslin et al. 1998, Johnson & Jusczyk, 2001).

What do TPs have to say about words?1. A word is a unit whose beginning predicts its

end, but it does not predict other words.

2. A word is a unit whose beginning predicts its end, and it also predicts future words.

Or…

Page 4: Distributional Cues to Word Boundaries: Context Is Important

Interpretation of TPs

Most previous work assumes words are statistically independent.Experimental work: Saffran et al. (1996),

many others.

Computational work: Brent (1999). What about words predicting other words?

tupiro golabu bidaku padoti

golabubidakugolabutupiropadotibidakupadotitupirobidakugolabutupiropadotibidakutupiro…

Page 5: Distributional Cues to Word Boundaries: Context Is Important

Questions

If a learner assumes that words are independent units, what is learned (from more realistic input)?

What if the learner assumes that words are units that help predict other units?

Approach: use a Bayesian “ideal observer” model to examine the consequences of making these different assumptions. What kinds of words are learned?

Page 6: Distributional Cues to Word Boundaries: Context Is Important

Two kinds of models

Unigram model: words are independent.Generate a sentence by generating each

word independently.

look .1 that .2 at .4 …

look .1 that .2 at .4 …

look at that

look .1 that .2 at .4 …

Page 7: Distributional Cues to Word Boundaries: Context Is Important

Two kinds of models

Bigram model: words predict other words.Generate a sentence by generating each

word, conditioned on the previous word.

look .1 that .3 at .5 …

look .4 that .2 at .1 …

look at that

look .1 that .5 at .1 …

Page 8: Distributional Cues to Word Boundaries: Context Is Important

Bayesian learning

The Bayesian learner seeks to identify an explanatory linguistic hypothesis thataccounts for the observed data. conforms to prior expectations.

Focus is on the goal of computation, not the procedure (algorithm) used to achieve the goal.

Page 9: Distributional Cues to Word Boundaries: Context Is Important

Bayesian segmentation

In the domain of segmentation, we have:Data: unsegmented corpus (transcriptions).Hypotheses: sequences of word tokens.

Optimal solution is the segmentation with highest prior probability.

= 1 if concatenating words forms corpus, = 0 otherwise.

Encodes unigram or bigram assumption (also others).

Page 10: Distributional Cues to Word Boundaries: Context Is Important

Brent (1999)

Describes a Bayesian unigram model for segmentation.Prior favors solutions with fewer words,

shorter words. Problems with Brent’s system:

Learning algorithm is approximate (non-optimal).

Difficult to extend to incorporate bigram info.

Page 11: Distributional Cues to Word Boundaries: Context Is Important

A new unigram model

Assumes word wi is generated as follows:

1. Is wi a novel lexical item?

n

yesP )(

n

nnoP )(

Fewer word types = Higher probability

Page 12: Distributional Cues to Word Boundaries: Context Is Important

A new unigram model

Assume word wi is generated as follows:

2. If novel, generate phonemic form x1…xm :

If not, choose lexical identity of wi from previously occurring words:

m

iimi xPxxwP

11 )()...(

n

lcountlwP i

)()(

Shorter words = Higher probability

Power law = Higher probability

Page 13: Distributional Cues to Word Boundaries: Context Is Important

Advantages of our model

Unigram? Bigram? Algorithm?

Brent

GGJ

Page 14: Distributional Cues to Word Boundaries: Context Is Important

Unigram model: simulations

Same corpus as Brent:9790 utterances of phonemically transcribed

child-directed speech (19-23 months).Average utterance length: 3.4 words.Average word length: 2.9 phonemes.

Example input: yuwanttusiD6bUklUkD*z6b7wIThIzh&t&nd6dOgiyuwanttulUk&tDIs...

Page 15: Distributional Cues to Word Boundaries: Context Is Important

Example results

Page 16: Distributional Cues to Word Boundaries: Context Is Important

Comparison to previous results

Proposed boundaries are more accurate than Brent’s, but fewer proposals are made.

Result: word tokens are less accurate.

Boundary Precision

Boundary Recall

Brent .80 .85

GGJ .92 .62

Token F-score

Brent .68

GGJ .54

Precision: #correct / #found

Recall: #found / #true

F-score: an average of precision and recall.

Page 17: Distributional Cues to Word Boundaries: Context Is Important

What happened?

Model assumes (falsely) that words have the same probability regardless of context.

Positing amalgams allows the model to capture word-to-word dependencies.

P(D&t) = .024 P(D&t|WAts) = .46 P(D&t|tu) = .0019

Page 18: Distributional Cues to Word Boundaries: Context Is Important

What about other unigram models?

Brent’s learning algorithm is insufficient to identify the optimal segmentation.Our solution has higher probability under his

model than his own solution does.On randomly permuted corpus, our system

achieves 96% accuracy; Brent gets 81%. Formal analysis shows undersegmentation

is the optimal solution for any (reasonable) unigram model.

Page 19: Distributional Cues to Word Boundaries: Context Is Important

Bigram model

Assume word wi is generated as follows:

1. Is (wi-1,wi) a novel bigram?

2. If novel, generate wi using unigram model.

If not, choose lexical identity of wi from words previously occurring after wi-1.

1

)(iw

nyesP

1

1)(i

i

w

w

n

nnoP

)'(

),'()'|( 1 lcount

llcountlwlwP ii

Page 20: Distributional Cues to Word Boundaries: Context Is Important

Example results

Page 21: Distributional Cues to Word Boundaries: Context Is Important

Quantitative evaluation

Compared to unigram model, more boundaries are proposed, with no loss in accuracy:

Accuracy is higher than previous models:

Boundary Precision

Boundary Recall

GGJ (unigram) .92 .62

GGJ (bigram) .92 .84

Token F-score Type F-score

Brent (unigram) .68 .52

GGJ (bigram) .77 .63

Page 22: Distributional Cues to Word Boundaries: Context Is Important

Conclusion

Different assumptions about what defines a word lead to different segmentations.Beginning of word predicts end of word:

Optimal solution undersegments, finding common multi-word units.

Word also predicts next word:

Segmentation is more accurate, adult-like. Important to consider how transitional

probabilities and other statistics are used.

Page 23: Distributional Cues to Word Boundaries: Context Is Important
Page 24: Distributional Cues to Word Boundaries: Context Is Important

Constraints on learning

Algorithms can impose implicit constraints. Implication: learning process prevents the

learner from identifying the best solutions. Specifics of algorithm are critical, but hard to

determine their effect. Prior imposes explicit constraints.

State general expectations about the nature of language.

Assume humans are good at learning.

Page 25: Distributional Cues to Word Boundaries: Context Is Important

Algorithmic constraints

Venkataraman (2001) and Batchelder (2002) describe unigram model-based approaches to segmentation, with no prior.Venkataraman algorithm penalizes novel

words.Batchelder algorithm penalizes long words.

Without algorithmic constraints, these models would memorize every utterance whole (insert no word boundaries).

Page 26: Distributional Cues to Word Boundaries: Context Is Important

Remaining questions

Are multi-word chunks sufficient as an initial bootstrapping step in humans?

(cf. Swingley, 2005)

Do children go through a stage with many chunks like these?

(cf. MacWhinney, ??)

Are humans able to segment based on bigram statistics?

Page 27: Distributional Cues to Word Boundaries: Context Is Important