Top Banner
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011
94

Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Dec 16, 2015

Download

Documents

Candice Ackland
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Smoothing N-gramLanguage Models

Shallow Processing Techniques for NLPLing570

October 24, 2011

Page 2: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

RoadmapComparing N-gram Models

Managing Sparse Data: SmoothingAdd-one smoothingGood-Turing Smoothing InterpolationBackoff

Extended N-gram ModelsClass-based n-gramsLong distance models

Page 3: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Perplexity Model Comparison

Compare models with different history

Train models38 million words – Wall Street Journal

Page 4: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Perplexity Model Comparison

Compare models with different history

Train models38 million words – Wall Street Journal

Compute perplexity on held-out test set1.5 million words (~20K unique, smoothed)

Page 5: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Perplexity Model Comparison

Compare models with different history

Train models38 million words – Wall Street Journal

Compute perplexity on held-out test set1.5 million words (~20K unique, smoothed)

N-gram Order | Perplexity Unigram | 962 Bigram | 170 Trigram | 109

Page 6: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Smoothing

Page 7: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Problem: Sparse DataGoal: Accurate estimates of probabilities

Current maximum likelihood estimatesE.g.

Work fairly well for frequent events

Page 8: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Problem: Sparse DataGoal: Accurate estimates of probabilities

Current maximum likelihood estimatesE.g.

Work fairly well for frequent events

Problem: Corpora limited Zero count events (event = ngram)

Page 9: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Problem: Sparse DataGoal: Accurate estimates of probabilities

Current maximum likelihood estimates E.g.

Work fairly well for frequent events

Problem: Corpora limited Zero count events (event = ngram)

Approach: “Smoothing” Shave some probability mass from higher counts to put

on (erroneous) zero counts

Page 10: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

How much of a problem is it?

Consider ShakespeareShakespeare produced 300,000 bigram types

out of V2= 844 million possible bigrams...

Page 11: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

How much of a problem is it?

Consider ShakespeareShakespeare produced 300,000 bigram types

out of V2= 844 million possible bigrams...

So, 99.96% of the possible bigrams were never seen (have zero entries in the table)

Page 12: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

How much of a problem is it?

Consider ShakespeareShakespeare produced 300,000 bigram types

out of V2= 844 million possible bigrams...

So, 99.96% of the possible bigrams were never seen (have zero entries in the table)

Does that mean that any sentence that contains one of those bigrams should have a probability of 0?

Page 13: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

What are Zero Counts? Some of those zeros are really zeros...

Things that really can’t or shouldn’t happen.

Page 14: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

What are Zero Counts? Some of those zeros are really zeros...

Things that really can’t or shouldn’t happen.

On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have

had a count (probably a count of 1!).

Page 15: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

What are Zero Counts? Some of those zeros are really zeros...

Things that really can’t or shouldn’t happen.

On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have

had a count (probably a count of 1!).

Zipf’s Law (long tail phenomenon): A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid

statistics on low frequency events

Page 16: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace SmoothingAdd 1 to all counts (aka Add-one Smoothing)

Page 17: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace SmoothingAdd 1 to all counts (aka Add-one Smoothing)

V: size of vocabulary; N: size of corpus

Unigram: PMLE

Page 18: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace SmoothingAdd 1 to all counts (aka Add-one Smoothing)

V: size of vocabulary; N: size of corpus

Unigram: PMLE: P(wi) = C(wi)/N

PLaplace(wi) =

Page 19: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace SmoothingAdd 1 to all counts (aka Add-one Smoothing)

V: size of vocabulary; N: size of corpus

Unigram: PMLE: P(wi) = C(wi)/N

PLaplace(wi) =

Bigram: Plaplace(wi|wi-1) =

Page 20: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace SmoothingAdd 1 to all counts (aka Add-one Smoothing)

V: size of vocabulary; N: size of corpus

Unigram: PMLE: P(wi) = C(wi)/N

PLaplace(wi) =

Bigram: Plaplace(wi|wi-1) =

n-gram: Plaplace(wi|wi-1---wi-n+1)=-

Page 21: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace SmoothingAdd 1 to all counts (aka Add-one Smoothing)

V: size of vocabulary; N: size of corpus

Unigram: PMLE: P(wi) = C(wi)/N

PLaplace(wi) =

Bigram: Plaplace(wi|wi-1) =

n-gram: Plaplace(wi|wi-1---wi-n+1)=-

Page 22: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

BERP Corpus BigramsOriginal bigram probabilites

Page 23: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

BERP Smoothed Bigrams Smoothed bigram probabilities from the BERP

Page 24: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace Smoothing Example

Consider the case where |V|= 100KC(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9

Page 25: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace Smoothing Example

Consider the case where |V|= 100KC(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9

PMLE=

Page 26: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace Smoothing Example

Consider the case where |V|= 100KC(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9

PMLE=9/10 = 0.9

PLAP=

Page 27: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace Smoothing Example

Consider the case where |V|= 100KC(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9

PMLE=9/10 = 0.9

PLAP=(9+1)/10+100K ~ 0.0001

Page 28: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Laplace Smoothing Example

Consider the case where |V|= 100KC(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9

PMLE=9/10 = 0.9

PLAP=(9+1)/10+100K ~ 0.0001

Too much probability mass ‘shaved off’ for zeroes

Too sharp a change in probabilitiesProblematic in practice

Page 29: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Add-δSmoothing Problem: Adding 1 moves too much probability

mass

Page 30: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Add-δSmoothing Problem: Adding 1 moves too much probability

mass

Proposal: Add smaller fractional mass δ

Padd-δ (wi|wi-1)

Page 31: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Add-δSmoothing Problem: Adding 1 moves too much probability

mass

Proposal: Add smaller fractional mass δ

Padd-δ (wi|wi-1) =

Issues:

Page 32: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Add-δSmoothing Problem: Adding 1 moves too much probability

mass

Proposal: Add smaller fractional mass δ

Padd-δ (wi|wi-1) =

Issues: Need to pick δStill performs badly

Page 33: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing SmoothingNew idea: Use counts of things you have seen to

estimate those you haven’t

Page 34: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing SmoothingNew idea: Use counts of things you have seen to

estimate those you haven’t

Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

Page 35: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing SmoothingNew idea: Use counts of things you have seen to

estimate those you haven’t

Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

Notation: Nc is the frequency of frequency cNumber of ngrams which appear c timesN0: # ngrams of count 0; N1: # of ngrams of count

1

Page 36: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing SmoothingEstimate probability of things which occur c times

with the probability of things which occur c+1 times

Page 37: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing Josh Goodman Intuition

Imagine you are fishingThere are 8 species: carp, perch, whitefish,

trout, salmon, eel, catfish, bass

You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1

salmon, 1 eel = 18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?

Slide adapted from Josh Goodman, Dan Jurafsky

Page 38: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing Josh Goodman Intuition

Imagine you are fishingThere are 8 species: carp, perch, whitefish, trout,

salmon, eel, catfish, bass

You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel =

18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18

Assuming so, how likely is it that next species is trout?

Slide adapted from Josh Goodman, Dan Jurafsky

Page 39: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing Josh Goodman Intuition

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon,

eel, catfish, bass

You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel =

18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18

Assuming so, how likely is it that next species is trout? Must be less than 1/18

Slide adapted from Josh Goodman, Dan Jurafsky

Page 40: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

GT Fish Example

Page 41: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Bigram Frequencies of Frequencies and GT Re-estimates

Page 42: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Good-Turing SmoothingN-gram counts to conditional probability

c* from GT estimate

Page 43: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Backoff and InterpolationAnother really useful source of knowledge

If we are estimating:trigram p(z|x,y) but count(xyz) is zero

Use info from:

Page 44: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Backoff and InterpolationAnother really useful source of knowledge

If we are estimating:trigram p(z|x,y) but count(xyz) is zero

Use info from:Bigram p(z|y)

Or even:Unigram p(z)

Page 45: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Backoff and InterpolationAnother really useful source of knowledge

If we are estimating:trigram p(z|x,y) but count(xyz) is zero

Use info from:Bigram p(z|y)

Or even:Unigram p(z)

How to combine this trigram, bigram, unigram info in a valid fashion?

Page 46: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Backoff Vs. InterpolationBackoff: use trigram if you have it, otherwise

bigram, otherwise unigram

Page 47: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Backoff Vs. InterpolationBackoff: use trigram if you have it, otherwise

bigram, otherwise unigram

Interpolation: always mix all three

Page 48: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

InterpolationSimple interpolation

Page 49: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

InterpolationSimple interpolation

Lambdas conditional on context: Intuition: Higher weight on more frequent n-grams

Page 50: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

How to Set the Lambdas?Use a held-out, or development, corpus

Choose lambdas which maximize the probability of some held-out data I.e. fix the N-gram probabilitiesThen search for lambda valuesThat when plugged into previous equationGive largest probability for held-out setCan use EM to do this search

Page 51: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Katz Backoff

Page 52: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Katz BackoffNote: We used P* (discounted probabilities) and α

weights on the backoff values

Why not just use regular MLE estimates?

Page 53: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Katz BackoffNote: We used P* (discounted probabilities) and α

weights on the backoff values

Why not just use regular MLE estimates?Sum over all wi in n-gram context

If we back off to lower n-gram?

Page 54: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Katz BackoffNote: We used P* (discounted probabilities) and α

weights on the backoff values

Why not just use regular MLE estimates?Sum over all wi in n-gram context

If we back off to lower n-gram?Too much probability mass > 1

Page 55: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Katz BackoffNote: We used P* (discounted probabilities) and α

weights on the backoff values

Why not just use regular MLE estimates?Sum over all wi in n-gram context

If we back off to lower n-gram?Too much probability mass > 1

Solution:Use P* discounts to save mass for lower order ngramsApply α weights to make sure sum to amount savedDetails in 4.7.1

Page 56: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

ToolkitsTwo major language modeling toolkits

SRILMCambridge-CMU toolkit

Page 57: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

ToolkitsTwo major language modeling toolkits

SRILMCambridge-CMU toolkit

Publicly available, similar functionalityTraining: Create language model from text fileDecoding: Computes perplexity/probability of text

Page 58: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

OOV words: <UNK> wordOut Of Vocabulary = OOV words

Page 59: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

OOV words: <UNK> wordOut Of Vocabulary = OOV words

We don’t use GT smoothing for these

Page 60: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

OOV words: <UNK> wordOut Of Vocabulary = OOV words

We don’t use GT smoothing for these Because GT assumes we know the number of unseen

events

Instead: create an unknown word token <UNK>

Page 61: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

OOV words: <UNK> wordOut Of Vocabulary = OOV words

We don’t use GT smoothing for these Because GT assumes we know the number of unseen

events

Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size VAt text normalization phase, any training word not in L

changed to <UNK>Now we train its probabilities like a normal word

Page 62: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

OOV words: <UNK> wordOut Of Vocabulary = OOV words

We don’t use GT smoothing for these Because GT assumes we know the number of unseen

events

Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size VAt text normalization phase, any training word not in L

changed to <UNK>Now we train its probabilities like a normal word

At decoding time If text input: Use UNK probabilities for any word not in training

Page 63: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Google N-Gram Release

Page 64: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Google N-Gram Release serve as the incoming 92

serve as the incubator 99

serve as the independent 794

serve as the index 223

serve as the indication 72

serve as the indicator 120

serve as the indicators 45

serve as the indispensable 111

serve as the indispensible 40

serve as the individual 234

Page 65: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Google CaveatRemember the lesson about test sets and

training sets... Test sets should be similar to the training set (drawn from the same distribution) for the probabilities to be meaningful.

So... The Google corpus is fine if your application deals with arbitrary English text on the Web.

If not then a smaller domain specific corpus is likely to yield better results.

Page 66: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Class-Based Language Models

Variant of n-gram models using classes or clusters

Page 67: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Class-Based Language Models

Variant of n-gram models using classes or clusters

Motivation: SparsenessFlight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)

Relate probability of n-gram to word classes & class ngram

Page 68: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Class-Based Language Models

Variant of n-gram models using classes or clusters

Motivation: SparsenessFlight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)

Relate probability of n-gram to word classes & class ngram

IBM clustering: assume each word in single classP(wi|wi-1)~P(ci|ci-1)xP(wi|ci)

Learn by MLE from data

Where do classes come from?

Page 69: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Class-Based Language Models

Variant of n-gram models using classes or clusters

Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)

Relate probability of n-gram to word classes & class ngram

IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci)

Learn by MLE from data

Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

Page 70: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Class-Based Language Models

Variant of n-gram models using classes or clusters

Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)

Relate probability of n-gram to word classes & class ngram

IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci)

Learn by MLE from data

Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

Page 71: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

LM AdaptationChallenge: Need LM for new domain

Have little in-domain data

Page 72: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

LM AdaptationChallenge: Need LM for new domain

Have little in-domain data

Intuition: Much of language is pretty generalCan build from ‘general’ LM + in-domain data

Page 73: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

LM AdaptationChallenge: Need LM for new domain

Have little in-domain data

Intuition: Much of language is pretty generalCan build from ‘general’ LM + in-domain data

Approach: LM adaptationTrain on large domain independent corpusAdapt with small in-domain data set

What large corpus?

Page 74: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

LM AdaptationChallenge: Need LM for new domain

Have little in-domain data

Intuition: Much of language is pretty generalCan build from ‘general’ LM + in-domain data

Approach: LM adaptationTrain on large domain independent corpusAdapt with small in-domain data set

What large corpus?Web counts! e.g. Google n-grams

Page 75: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Incorporating Longer Distance Context

Why use longer context?

Page 76: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Incorporating Longer Distance Context

Why use longer context?N-grams are approximation

Model sizeSparseness

Page 77: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Incorporating Longer Distance Context

Why use longer context?N-grams are approximation

Model sizeSparseness

What sorts of information in longer context?

Page 78: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Incorporating Longer Distance Context

Why use longer context?N-grams are approximation

Model sizeSparseness

What sorts of information in longer context?PrimingTopicSentence typeDialogue actSyntax

Page 79: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Long Distance LMsBigger n!

284M words: <= 6-grams improve; 7-20 no better

Page 80: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Long Distance LMsBigger n!

284M words: <= 6-grams improve; 7-20 no better

Cache n-gram: Intuition: Priming: word used previously, more likely

Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM

Page 81: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Long Distance LMsBigger n!

284M words: <= 6-grams improve; 7-20 no better

Cache n-gram: Intuition: Priming: word used previously, more likely

Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM

Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)

Page 82: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Long Distance LMsBigger n!

284M words: <= 6-grams improve; 7-20 no better

Cache n-gram: Intuition: Priming: word used previously, more likely

Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM

Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)

Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams

Page 83: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Language ModelsN-gram models:

Finite approximation of infinite context history

Issues: Zeroes and other sparseness

Strategies: SmoothingAdd-one, add-δ, Good-Turing, etcUse partial n-grams: interpolation, backoff

RefinementsClass, cache, topic, trigger LMs

Page 84: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingMost commonly used modern smoothing

technique

Intuition: improving backoff I can’t see without my reading……

Compare P(Francisco|reading) vs P(glasses|reading)

Page 85: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingMost commonly used modern smoothing

technique

Intuition: improving backoff I can’t see without my reading……

Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco)

Page 86: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingMost commonly used modern smoothing

technique

Intuition: improving backoff I can’t see without my reading……

Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)

Page 87: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingMost commonly used modern smoothing

technique

Intuition: improving backoff I can’t see without my reading……

Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

Page 88: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingMost commonly used modern smoothing technique

Intuition: improving backoff I can’t see without my reading……

Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

Interpolate based on # of contexts

Words seen in more contexts, more likely to appear in others

Page 89: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingModeling diversity of contexts

Continuation probability

Page 90: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingModeling diversity of contexts

Continuation probability

Backoff:

Page 91: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Knesser-Ney SmoothingModeling diversity of contexts

Continuation probability

Backoff:

Interpolation:

Page 92: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

IssuesRelative frequency

Typically compute count of sequenceDivide by prefix

Corpus sensitivityShakespeare vs Wall Street Journal

Very unnatural

NgramsUnigram: little; bigrams: colloc;trigrams:phrase

)(

)()|(

1

11

n

nnnn wC

wwCwwP

Page 93: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Additional Issues in Good-Turing

General approach:Estimate of c* for Nc depends on N c+1

What if Nc+1 = 0?More zero count problemsNot uncommon: e.g. fish example, no 4s

Page 94: Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

ModificationsSimple Good-Turing

Compute Nc bins, then smooth Nc to replace zeroesFit linear regression in log space

log(Nc) = a +b log(c)

What about large c’s?Should be reliableAssume c*=c if c is large, e.g c > k (Katz: k =5)

Typically combined with other interpolation/backoff