Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke http://www.sims.berkeley.edu/~jhenke/Tdm/TDM-Ch6 .ppt Slide set modified slightly by Juggy for teaching a class on NLP using the same book: http://www.csee.wvu.edu/classes/nlp/Spring_2007/ Modified Slides are marked with a
37
Embed
Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Slide set modified slightly by Juggy for teachinga class on NLP using the same book: http://www.csee.wvu.edu/classes/nlp/Spring_2007/Modified Slides are marked with a
• “Markov Assumption” – word is affected only by its “prior local context” (last few words)
Possible Applications:
• OCR / Voice recognition – resolve ambiguity
• Spelling correction
• Machine translation
• Confirming the author of a newly discovered work
• “Shannon game”
“Shannon Game”
• Claude E. Shannon. “Prediction and Entropy of Printed English”, Bell System Technical Journal 30:50-64. 1951.
• Predict the next word, given (n-1) previous words
• Determine probability of different sequences by examining training corpus
Forming Equivalence Classes (Bins)
• “n-gram” = sequence of n words– bigram– trigram– four-gram
• Task at hand:– P(wn|w1,…,wn-1)
Reliability vs. Discrimination
“large green ___________”
tree? mountain? frog? car?
“swallowed the large green ________”pill? broccoli?
Reliability vs. Discrimination
• larger n: more information about the context of the specific instance (greater discrimination)
• smaller n: more instances in training data, better statistical estimates (more reliability)
Selecting an n
Vocabulary (V) = 20,000 words
n Number of bins
2 (bigrams) 20,000*19,999=400 Million
3 (trigrams) 20,000*19,999*19,998= 8 Trillion
4 (4-grams) 1.6 x 1017
Statistical Estimators
• Given the observed training data …• How do you develop a model (probability
distribution) to predict future events?
)...wP(w
)...wP(w)...w|wP(w
n
nnn
11
111
Maximum Likelihood Estimation (MLE)
• Example– 10 training instances of “comes across”
– 8 of them were followed by “as”
– 1 followed by “a”
– 1 followed by “more”
– P(as) = 0.8
– P(a) = 0.1
– P(more) = 0.1
– P(x) = 0
Statistical Estimators
Example:
Corpus: five Jane Austen novels
N = 617,091 words
V = 14,585 unique words
Task: predict the next word of the trigram “inferior to ________”
from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”
“Smoothing”
• Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-grams
• a.k.a. “Discounting methods”
• “Validation” – Smoothing methods which utilize a second batch of test data.
LaPlace’s Law(adding one)
LaPlace’s Law(adding one)
LaPlace’s Law
Lidstone’s Law
BλN
λ)wC(w)w(wP n
nLid
11
P = probability of specific n-gram
C = count of that n-gram in training data
N = total n-grams in training data
B = number of “bins” (possible n-grams)
= small positive number
M.L.E: = 0LaPlace’s Law: = 1Jeffreys-Perks Law: = ½
Expected Likelihood EstimationRank Word MLE ELE
1 not 0.065 0.036
2 a 0.052 0.030
3 the 0.033 0.019
4 to 0.031 0.017
…
=1482 inferior 0 0.00003
“was” appeared 9409“not” appeared after “was” 608Total # of word types = 14589MLE = 608/9409 = 0.065ELE = (608+0.5)/(608+14589x0.5) = 0.036The new estimate has been discounted by 50%
Jeffreys-Perks Law
Objections to Lidstone’s Law
• Need an a priori way to determine .
• Predicts all unseen events to be equally likely
• Gives probability estimates linear in the M.L.E. frequency
Smoothing
• Lidstone’s Law (incl. LaPlace’s Law and Jeffreys-Perks Law): modifies the observed counts
• Other methods: modify probabilities.
Held-Out Estimator
• How much of the probability distribution should be “held out” to allow for previously unseen events?
• Validate by holding out part of the training data.
• How often do events unseen in training data occur in validation data?(e.g., to choose for Lidstone model)
Held-Out Estimator
})(:{12
111
)(rwwCww
nr
nn
wwCT
NN
TwwP
r
rnh )...( 10
C1(w1… wn) = frequency of w1… wn in training dataC2(w1… wn) = frequency of w1… wn in training data
Nr is # of n-grams with frequency r in the training textTr is the total # of times that all n-grams appeared r times in training text appeared in the held out dataAverage frequency of the n-grams in the held-out data= Tr /Nr
r = C(w1… wn)
Testing Models
• Hold out ~ 5 – 10% for testing
• Hold out ~ 10% for validation (smoothing)
• For testing: useful to test on multiple sets of data, report variance of results.– Are results (good or bad) just the result of
chance?
Cross-Validation(a.k.a. deleted estimation)
• Use data for both training and validation
Divide test data into 2 parts
(1) Train on A, validate on B
(2) Train on B, validate on A
Combine two models
A B
train validate
validate train
Model 1
Model 2
Model 1 Model 2+ Final Model
Cross-Validation
Two estimates:
Combined estimate:
NN
TP
r
rho 0
01
NN
TP
r
rho 1
10
Nra = number of n-grams
occurring r times in a-th part of training set
Trab = total number of those
found in b-th part
)( 10
1001
rr
rrho NNN
TTP
(arithmetic mean)
Good-Turing Estimator
r* = “adjusted frequency”
Nr = number of n-gram-types which occur r times
E(Nr) = “expected value”
E(Nr+1) < E(Nr) Typically this is done for r < some constant k as this value
is 0 for a r that corresponds to max r.
)(
)()(*
r
r
NE
NErr 11 NrPGT
*
Count of counts in Austen corpus
Good-Turing Estimates for Austen Corpus
• N1 = number of bigrams seen exactly once in training instance = 138741
• N = 617091 [number of words in Austen corpus]• N1 /N = 0.2248 [mass reserved for unseen bigrams using Good-Turing
approach]• Space of bigrams is vocabulary squared: 145852 • Total # of bigrams seen in training set: 199,252• Probability estimate for
unseen bigrams = 0.2248/(145852 -199,252) = 1.058 x 10-9
Discounting Methods
First, determine held-out probability
• Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant
• Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion
Combining Estimators
(Sometimes a trigram model is best, sometimes a bigram model is best, and sometimes a unigram model is best.)
• How can you develop a model to utilize different length n-grams as appropriate?
Simple Linear Interpolation(a.k.a., finite mixture models;a.k.a., deleted interpolation)
• weighted average of unigram, bigram, and trigram probabilities
),|( 12 nnnli wwwP
),|()|()( 123112211 nnnnnn wwwPwwPwP
Katz’s Backing-Off
• Use n-gram probability when enough training data– (when adjusted count > k; k usu. = 0 or 1)
• If not, “back-off” to the (n-1)-gram probability
• (Repeat as needed)
Problems with Backing-Off
• If bigram w1 w2 is common
• but trigram w1 w2 w3 is unseen
• may be a meaningful gap, rather than a gap due to chance and scarce data– i.e., a “grammatical null”
• May not want to back-off to lower-order probability