Top Banner
A Bayesian approach to word segmentation: Theoretical and experimental results Sharon Goldwater Department of Linguistics Stanford University
42

A Bayesian approach to word segmentation: Theoretical and experimental results

Jan 24, 2016

Download

Documents

tahir iqbal

A Bayesian approach to word segmentation: Theoretical and experimental results. Sharon Goldwater Department of Linguistics Stanford University. Word segmentation. One of the first problems infants must solve when learning language. Infants make use of many different cues. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Bayesian approach to word segmentation: Theoretical and experimental results

A Bayesian approach to word segmentation: Theoretical and experimental results

Sharon Goldwater

Department of Linguistics

Stanford University

Page 2: A Bayesian approach to word segmentation: Theoretical and experimental results

Word segmentation

One of the first problems infants must solve when learning language.

Infants make use of many different cues. Phonotactics, allophonic variation, metrical (stress) patterns,

effects of coarticulation, and statistical regularities in syllable sequences.

Statistics may provide initial bootstrapping. Used very early (Thiessen & Saffran, 2003). Language-independent.

Page 3: A Bayesian approach to word segmentation: Theoretical and experimental results

Modeling statistical segmentation

Previous work often focuses on how statistical information (e.g., transitional probabilities) can be used to segment speech.

Bayesian approach asks what information should be used by a successful learner.

What statistics should be collected? What assumptions (by the learner) constrain

possible generalizations?

Page 4: A Bayesian approach to word segmentation: Theoretical and experimental results

Outline

1. Computational model and theoretical results What are the consequences of using different sorts

of information for optimal word segmentation?(joint work with Tom Griffiths and Mark Johnson)

2. Modeling experimental data Do humans behave optimally?

(joint work with Mike Frank, Vikash Mansinghka, Tom Griffiths, and Josh Tenenbaum)

Page 5: A Bayesian approach to word segmentation: Theoretical and experimental results

Statistical segmentation

Work on statistical segmentation often discusses transitional probabilities (Saffran et al. 1996; Aslin et al. 1998, Johnson & Jusczyk, 2001).

P(syli | syli-1) is often lower at word boundaries.

What do TPs have to say about words? A word is a unit whose beginning predicts its end,

but it does not predict other words. A word is a unit whose beginning predicts its end,

and it also predicts future words.

Or…

Page 6: A Bayesian approach to word segmentation: Theoretical and experimental results

Interpretation of TPs

Most previous work assumes words are statistically independent. Experimental work: Saffran et al. (1996), many others.

Computational work: Brent (1999).

What about words predicting other words?

tupiro golabu bidaku padoti

golabubidakugolabutupiropadotibidakupadotitupi…

Page 7: A Bayesian approach to word segmentation: Theoretical and experimental results

Questions

If a learner assumes that words are independent units, what is learned (from more realistic input)? Unigram model: Generate each word independently.

What if the learner assumes that words are units that help predict other units? Bigram model: Generate each word conditioned on

the previous word.

Approach: use a Bayesian ideal observer model to examine the consequences of each assumption.

Page 8: A Bayesian approach to word segmentation: Theoretical and experimental results

Bayesian learning

The Bayesian learner seeks to identify an explanatory linguistic hypothesis that accounts for the observed data. conforms to prior expectations.

Focus is on the goal of computation, not the procedure (algorithm) used to achieve the goal.

Page 9: A Bayesian approach to word segmentation: Theoretical and experimental results

Bayesian segmentation

In the domain of segmentation, we have: Data: unsegmented corpus (transcriptions). Hypotheses: sequences of word tokens.

Optimal solution is the segmentation with highest prior probability.

= 1 if concatenating words forms corpus, = 0 otherwise.

Encodes unigram or bigram assumption (also others).

Page 10: A Bayesian approach to word segmentation: Theoretical and experimental results

Brent (1999)

Describes a Bayesian unigram model for segmentation. Prior favors solutions with fewer words, shorter words.

Problems with Brent’s system: Learning algorithm is approximate (non-optimal). Difficult to extend to incorporate bigram info.

Page 11: A Bayesian approach to word segmentation: Theoretical and experimental results

A new unigram model (Dirichlet process)

Assume word wi is generated as follows:

1. Is wi a novel lexical item?

αα +

=n

yesP )(

α +=

n

nnoP )(

Fewer word types = Higher probability

Page 12: A Bayesian approach to word segmentation: Theoretical and experimental results

A new unigram model (Dirichlet process)

Assume word wi is generated as follows:

2. If novel, generate phonemic form x1…xm :

If not, choose lexical identity of wi from previously occurring words:

∏=

==m

iimi xPxxwP

11 )()...(

n

lcountlwP i

)()( ==

Shorter words = Higher probability

Power law = Higher probability

Page 13: A Bayesian approach to word segmentation: Theoretical and experimental results

Unigram model: simulations

Same corpus as Brent (Bernstein-Ratner, 1987): 9790 utterances of phonemically transcribed child-

directed speech (19-23 months). Average utterance length: 3.4 words. Average word length: 2.9 phonemes.

Example input: yuwanttusiD6bUklUkD*z6b7wIThIzh&t&nd6dOgiyuwanttulUk&tDIs...

Page 14: A Bayesian approach to word segmentation: Theoretical and experimental results

Example results

Page 15: A Bayesian approach to word segmentation: Theoretical and experimental results

Comparison to previous results

Proposed boundaries are more accurate than Brent’s, but fewer proposals are made.

Result: word tokens are less accurate.

Boundary Precision

Boundary Recall

Brent .80 .85

GGJ .92 .62

Token F-score

Brent .68

GGJ .54

Precision: #correct / #found = [= hits / (hits + false alarms)]

Recall: #found / #true = [= hits / (hits + misses)]

F-score: an average of precision and recall.

Page 16: A Bayesian approach to word segmentation: Theoretical and experimental results

What happened?

Model assumes (falsely) that words have the same probability regardless of context.

Positing amalgams allows the model to capture word-to-word dependencies.

P(D&t) = .024 P(D&t|WAts) = .46 P(D&t|tu) = .0019

Page 17: A Bayesian approach to word segmentation: Theoretical and experimental results

What about other unigram models?

Brent’s learning algorithm is insufficient to identify the optimal segmentation. Our solution has higher probability under his model

than his own solution does. On randomly permuted corpus, our system achieves

96% accuracy; Brent gets 81%.

Formal analysis shows undersegmentation is the optimal solution for any (reasonable) unigram model.

Page 18: A Bayesian approach to word segmentation: Theoretical and experimental results

Bigram model (hierachical Dirichlet process)

Assume word wi is generated as follows:

1. Is (wi-1,wi) a novel bigram?

2. If novel, generate wi using unigram model (almost).

If not, choose lexical identity of wi from words previously occurring after wi-1.

ββ

+=

−1

)(iwn

yesPβ +

=−

1

1)(i

i

w

w

n

nnoP

)'(

),'()'|( 1 lcount

llcountlwlwP ii === −

Page 19: A Bayesian approach to word segmentation: Theoretical and experimental results

Example results

Page 20: A Bayesian approach to word segmentation: Theoretical and experimental results

Quantitative evaluation

Compared to unigram model, more boundaries are proposed, with no loss in accuracy:

Accuracy is higher than previous models:

Boundary Precision

Boundary Recall

GGJ (unigram) .92 .62

GGJ (bigram) .92 .84

Token F-score Type F-score

Brent (unigram) .68 .52

GGJ (bigram) .77 .63

Page 21: A Bayesian approach to word segmentation: Theoretical and experimental results

Summary

Two different assumptions about what defines a word are consistent with behavioral evidence.

Different assumptions lead to different results. Beginning of word predicts end of word:

Optimal solution undersegments, finding common multi-word units.

Word also predicts next word:

Segmentation is more accurate, adult-like.

Page 22: A Bayesian approach to word segmentation: Theoretical and experimental results

Remaining questions

Is unigram segmentation sufficient to start bootstrapping other cues (e.g., stress)?

How prevalent are multi-word chunks in infant vocabulary?

Are humans able to segment based on bigram statistics?

Is there any evidence that human performance is consistent with Bayesian predictions?

Page 23: A Bayesian approach to word segmentation: Theoretical and experimental results

Testing model predictions

Goal: compare our model (and others) to human performance in a Saffran-style experiment.

Problem: all models have near-perfect accuracy on experimental stimuli.

Solution: compare changes in model performance relative to humans as task difficulty is varied.

tupiro golabu bidaku padoti

golabubidakugolabutupiropadotibidakupadotitupiro…

Page 24: A Bayesian approach to word segmentation: Theoretical and experimental results

Experimental method

Examine segmentation performance under different utterance length conditions.

Example lexicon: lagi dazu tigupi bavulu kabitudu kipavazi

Conditions: # wds/utt # utts tot # wds

1 1200 1200

2 600 1200

4 300 1200

6 200 1200

8 150 1200

12 100 1200

24 50 1200

Page 25: A Bayesian approach to word segmentation: Theoretical and experimental results

Procedure

Training: adult subjects listened to synthesized utterances in one length condition. No pauses between syllables within utterances. 500 ms pauses between utterances.

Testing: 2AFC between words and part-word distractors.

lagi dazu tigupi bavulu kabitudu kipavazi

lagitigupibavulukabitudulagikipavazidazukipavazibavululagitigupikabitudukipavazitigupidazukabitudulagitigupi …

Lexicon:

Page 26: A Bayesian approach to word segmentation: Theoretical and experimental results

Human performance

Page 27: A Bayesian approach to word segmentation: Theoretical and experimental results

Model comparison

Evaluated six different models. Each model trained and tested on same stimuli

as humans. To simulate 2AFC, produce a score s(w) for each

word in choice pair and use Luce choice rule:

Compute best linear fit of each model to human data, then calculate correlation.

)()(

)()(

21

11 wsws

wswP

+=

Page 28: A Bayesian approach to word segmentation: Theoretical and experimental results

Models used

Three local statistic models, all similar to transitional probabilities (TP) Segment at minima of P(syli | syli-1).

s(w) = minimum TP in w.

Swingley (2005) Builds lexicon using local statistic and frequency thresholds. s(w) = max threshold at which w appears in lexicon.

PARSER (Perruchet and Vinter, 1998) Incorporates principles of lexical competition and memory decay. s(w) = P(w) as defined by model.

GGJ (our Bayesian model) s(w) = P(w) as defined by model.

Page 29: A Bayesian approach to word segmentation: Theoretical and experimental results

Results: linear fit

Page 30: A Bayesian approach to word segmentation: Theoretical and experimental results

Results: words vs. part-words

Page 31: A Bayesian approach to word segmentation: Theoretical and experimental results

Summary

Statistical segmentation is more difficult when utterances contain more words.

Gradual decay in performance is predicted by Bayesian model, but not by others tested.

Bayes predicts difficulty is primarily due to effects of competition. In longer utterances, correct words are less probable

because more other possibilities exist. Local statistic approaches don’t model competition.

Page 32: A Bayesian approach to word segmentation: Theoretical and experimental results

Continuing work

Experiments with other task modifications will further test our model’s predictions.

Vary the length of exposure to training stimulus: Bayes: longer exposure => better performance. TPs: no effect of exposure.

Vary the number of lexical items: Bayes: larger lexicon => worse performance. TPs: larger lexicon => better performance.

Page 33: A Bayesian approach to word segmentation: Theoretical and experimental results

Conclusions

Computer simulations and experimental work suggest that

Unigram assumption causes ideal learners to undersegment fluent speech.

Human word segmentation may approximate Bayesian ideal learning.

Page 34: A Bayesian approach to word segmentation: Theoretical and experimental results
Page 35: A Bayesian approach to word segmentation: Theoretical and experimental results

Bayesian segmentation

whatsthatthedoggiewheresthedoggie...

whatsthatthedoggiewheresthedoggie

wh at sth atthedo ggiewh eres thedo ggie

whats thatthe doggiewheres the doggie

w h a t s t h a tt h e d o g g i ew h e r e s t h e d o g g i e

Input data: Some hypotheses:

Page 36: A Bayesian approach to word segmentation: Theoretical and experimental results

Search algorithm

Model defines a distribution over hypotheses. We use Gibbs sampling to find a good hypothesis.

Iterative procedure produces samples from the posterior distribution of hypotheses.

A batch algorithm (but online algorithms are possible, e.g., particle filtering).

P(h|d)

h

Page 37: A Bayesian approach to word segmentation: Theoretical and experimental results

Gibbs sampler

1. Consider two hypotheses differing by a single word boundary:

2. Calculate probabilities of words that differ, given current analysis of all other words. Model is exchangeable: probability of a set of

outcomes does not depend on ordering.

3. Sample one of the two hypotheses according to the ratio of probabilities.

whats.thatthe.doggiewheres.the.doggie

whats.thatthe.dog.giewheres.the.doggie

Page 38: A Bayesian approach to word segmentation: Theoretical and experimental results

Models used

Transitional probabilities (TP) Segment at minima of P(syli | syli-1).

s(w) = minimum TP in w. (Equivalently, use product).

Smoothed transitional probabilities Avoid 0 counts by using add-λ smoothing.

Mutual information (MI) Segment where MI between syllables is lowest. s(w) = minimum MI in w. (Equivalently, use sum).

Page 39: A Bayesian approach to word segmentation: Theoretical and experimental results

Models used

Swingley (2005) Builds a lexicon, including syllable sequences above

some threshold for both MI and n-gram frequency. s(w) = max threshold at which w appears in lexicon.

PARSER (Perruchet and Vinter, 1998) Lexicon-based model incorporating principles of

lexical competition and memory decay. s(w) = P(w) as defined by model.

GGJ (our Bayesian model) s(w) = P(w) as defined by model.

Page 40: A Bayesian approach to word segmentation: Theoretical and experimental results

Results: linear fit

Page 41: A Bayesian approach to word segmentation: Theoretical and experimental results

Continuing work

Comparisons to human data and other models: Which words/categories are most robust?

Compare to Frequent Frames predictions (Mintz, 2003).

Compare to corpus data from children’s production.

Modeling cue combination: Integrate morphology into syntactic model. Model experimental work on cue combination in

category learning.

Page 42: A Bayesian approach to word segmentation: Theoretical and experimental results