Top Banner
Ngram models and the Sparcity problem
32

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Jan 14, 2016

Download

Documents

Nathan Pearson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Ngram models and the Sparcity problem

Page 2: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

The task

• Find a probability distribution for the current word in a text (utterance, etc.), given what the last n words have been. (n = 0,1,2,3)

• Why this is reasonable

• What the problems are

Page 3: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Why this is reasonable

The last few words tells us a lot about the next word:

• collocations

• prediction of current category: the is followed by nouns or adjectives

• semantic domain

Page 4: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Reminder about applications

• Speech recognition

• Handwriting recognition

• POS tagging

Page 5: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Problem of sparsity

• Words are very rare events (even if we’re not aware of that), so

• What feel like perfectly common sequences of words may be too rare to actually have in our training corpus

Page 6: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

What’s the next word?

in a ____with a ____the last ____shot a _____open the ____over my ____President Bill ____keep tabs ____

Page 7: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

borrowed from Henke, based on Manning and Schütze

Example:

Corpus: five Jane Austen novels

N = 617,091 words

V = 14,585 unique words

Task: predict the next word of the trigram “inferior to ________”

from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

Page 8: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

borrowed from Henke, based on Manning and Schütze

Instances in the Training Corpus:“inferior to ________”

Page 9: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Maximum Likelihood Estimate:

borrowed from Henke, based on Manning and Schütze

Page 10: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Maximum Likelihood Distribution = DML

• probability is assigned exactly based on the n-gram count in the training corpus.

• Anything not found in the training corpus gets probability 0.

Page 11: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

borrowed from Henke, based on Manning and Schütze

Actual Probability Distribution:

Page 12: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Conundrum

• Do we stick very tight to the “Maximum Likelihood” model, assigning zero probability to sequences not seen in the training corpus?

• Answer: we simply cannot; the results are just too bad.

Page 13: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Smoothing

• We need, therefore, some “smoothing” procedure

• which adds some of the probability mass to unseen n-grams

• and must therefore take away some of the probability mass from observed n-grams

Page 14: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Laplace/Lidstone/Jeffrey-Perks

Three closely related ideas that are widely used.

Page 15: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

“Sum of counts” method of creating a distribution

You can always get a distribution from a set of counts by dividing each count by the total count of the set.

“bins”: name for the different preceding n-grams that we keep track of. Each bin gets a probability, and they must sum to 1.0

Page 16: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Zero knowledge

Suppose we give a count of 1 to every possible bin in our model.

If our model is a bigram model, we give a count of 1 to the V2 conceivable bigrams. (V if unigram, V3 if trigram, etc.)

Admittedly, this model assumes zero knowledge of the language….

We get a distribution for each bin by assigning probability 1/V2 to each bin. Call this distribution DN.

Page 17: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Too much knowledge

• Give each bin gets exactly the number of counts that it earns from the training corpus.

• If we are making a bigram model, then there are V2 bins, and those bigrams that do not appear in the training corpus get a count of 0.

• We get the Maximum Likelihood distribution by dividing by the total count = N.

Page 18: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Laplace (“Adding one”)

Add the bin counts from the Zero-knowledge case (1 for each bin, V2 of them in bigram case) and the bin counts from the Too-much knowledge (score in training corpus)

• Divide by total number of counts = V2 + N

• Formula: each bin gets probability (Count in corpus + 1) / (V2 + N)

Page 19: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Difference with book

Page 20: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Lidstone’s Law Choose a number l, between 0 and 1, for the

count in the NoKnowledge distribution.

Then the count in each bin is Count in corpus + l

And we assign probability to it (where the number of bins is V2, because we’re considering a bigram model:

2VN

Count

If = 1 this is Laplace;

If = 0.5, this is Jeffrey-Perks Law

If = 0, this is Maximum Likelihood

Page 21: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Another way to say this…

• We can also think of Laplace as a weighted average of two distributions, the No Knowledge distribution and the MaximumLikelihood distribution…

Page 22: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

2. Averaging distributionsRemember this:

If you take weighted averages of distributions of this form:

* distribution D1 + (1- ) distribution D2

the result is a distribution: all the numbers sum to 1.0

This means that you split the probability mass between the two distributions (in proportion then divide up those smaller portions exactly according to D1 and D2.

Page 23: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

“Adding 1” (Laplace)

Is it clear that

oodMaxLikeliheNoKnowledg DNV

ND

NV

V

22

2

?122

2

NV

N

NV

V

Page 24: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

this is a special case of

DN + (1- )DML

where V2/(V2+N).

How big is this? if V= 50,000, then

V2 = 2,500,000,000. This means that if our corpus is 2 and a half billion words, we are still reserving half of our probability mass for zero knowledge – that’s too much.

= V2/(V2+N) = 2,500,000,000/5,000,000,000 = 0.5

Page 25: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Good-Turing discounting

• The central problem is assign probability mass to unseen examples, especially unseen bigrams (or trigrams), based on known vocabulary.

• Good-Turing estimation says that a good estimate for the total probability of unseen n-grams is the total number of 1-grams seen = N1/N.

Page 26: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

So we take the probability mass assigned empirically to n-grams seen once, and assign it to all the unseen n-grams (we know how many there are: if the vocabulary is of size V, then there are Vn n-grams:

if we have seen T distinct n-grams, then each unseen n-gram gets probability:

N

N

TV n11

Page 27: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

• So unseen n-grams got all of the probability mass that had been earned by the n-grams seen once. So the n-grams seen once will grab all of the probability mass earned by n-grams seen twice, then (uniformly) distributed:

N

N

N2

1

1

Page 28: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

So n-grams seen twice will take all the probability mass earned by n-grams seen three times…and we stop this foolishness around the time when observed frequencies are reliable, around 10 times.

seen 1x seen 2x 3x 4x 5x

pred 1x pred 2x 3x 4x 5x

MODEL: assigns probabilities

Counts

all unseen ngrams

Page 29: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Witten-Bell discounting

Let’s try to estimate the probability of all of the unseen N-grams of English, given a corpus.

First guess: the probability of hitting a new word in a corpus is roughly equal to the number of new words encountered in the observed corpus divided by the number of tokens. (Likewise for bigrams, n-grams). prob = #distinct words/#words ?

Page 30: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

That over-estimates…because at the beginning, almost every word looks

new and unseen!

So we must either decrease the numerator or increase the denominator.

Witten-Bell: Suppose we have a data-structure keeping track of seen words. As we read a corpus, with each word, we ask: have you seen this before? If it says, No, we say, Add it to your memory (that’s a separate function). The probability of new words is estimated by the proportion of calls to this data-structure which are “Add” functions.

Page 31: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

• Estimate prob (unseen word) as

0.1##

#

K

wordswordsdistinct

wordsdistinct

And then distribute K uniformly over unseen unigrams (that’s hard…) or n-grams, and reduce the probability given to seen n-grams

Page 32: Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

• Therefore, the estimated real probability of seeing one of the N-grams we have already seen is T/(T+B), and the estimate of seeing a new N-gram at any moment is B/(T+B).

• So we want to distribute B/(T+B) over the unseen N-grams.