Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP
Naïve Bayes, Maxent and Neural Models
CMSC 473/673
UMBC
Some slides adapted from 3SLP
OutlineRecap: classification (MAP vs. noisy channel) & evaluation
Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language
Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation
Neural (language) models
Probabilistic Classification
Discriminatively trained classifier
Generatively trained classifier
Directly model the posterior
Model the posterior with
Bayes rule
Posterior Classification/Decodingmaximum a posteriori
Noisy Channel Model Decoding
Posterior Decoding:Probabilistic Text Classification
Assigning subject categories, topics, or genres
Spam detection
Authorship identification
Age/gender identification
Language Identification
Sentiment analysis
…
class
observed data
class-based likelihood (language model)
prior probability of
class
observation likelihood (averaged over all classes)
Noisy Channel Model
what I want to tell you“sports”
what you actually see“The Os lost
again…”
Decode Rerank
hypothesized intent
“sad stories”“sports”
reweight according to what’s likely
“sports”
Noisy Channel
Machine translation
Speech-to-text
Spelling correction
Text normalization
Part-of-speech tagging
Morphological analysis
Image captioning
…
possible (clean) output
observed (noisy) text
(clean) language
model
observation (noisy) likelihood
translation/decode model
Use Logarithms
Accuracy, Precision, and Recall
Accuracy: % of items correct
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Actually Correct Actually Incorrect
Selected/Guessed True Positive (TP) False Positive (FP)
Not select/not guessed False Negative (FN) True Negative (TN)
A combined measure: F
Weighted (harmonic) average of Precision & Recall
Balanced F1 measure: β=1
OutlineRecap: classification (MAP vs. noisy channel) & evaluation
Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language
Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation
Neural (language) models
The Bag of Words Representation
The Bag of Words Representation
The Bag of Words Representation
13
Bag of Words Representation
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...classifier
classifier
Naïve Bayes Classifier
Start with Bayes Rule
label text
Q: Are we doing discriminative training or generative training?
Naïve Bayes Classifier
Start with Bayes Rule
label text
Q: Are we doing discriminative training or generative training?
A: generative training
Naïve Bayes Classifier
Adopt naïve bag of words representation Y i
label each word (token)
Naïve Bayes Classifier
Adopt naïve bag of words representation Y i
Assume position doesn’t matter
label each word (token)
Naïve Bayes Classifier
Adopt naïve bag of words representation Y i
Assume position doesn’t matter
Assume the feature probabilities are independent given the class X
label each word (token)
Multinomial Naïve Bayes: Learning
From training corpus, extract Vocabulary
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
For each cj in C dodocsj = all docs with class =cj
From training corpus, extract Vocabulary
Brill and Banko (2001)
With enough data, the classifier may not matter
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
For each cj in C dodocsj = all docs with class =cj
Calculate P(wk | cj) termsTextj = single doc containing all docsj
For each word wk in Vocabularynk = # of occurrences of wk in Textj
From training corpus, extract Vocabulary
𝑝 𝑤𝑘| 𝑐𝑗 = class unigram LM
Naïve Bayes and Language Modeling
Naïve Bayes classifiers can use any sort of feature
But if, as in the previous slides
We use only word features
we use all of the words in the text (not a subset)
Then
Naïve Bayes has an important similarity to language modeling
Naïve Bayes as a Language Model
Sec.13.2.1
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Positive Model Negative Model
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
filmlove this funI
Sec.13.2.1
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Positive Model Negative Model
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Positive Model Negative Model
filmlove this funI
0.10.1 0.01 0.050.1
0.10.001 0.01 0.0050.2
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Positive Model Negative Model
filmlove this funI
0.10.1 0.01 0.050.1
0.10.001 0.01 0.0050.2
5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1
Summary: Naïve Bayes is Not So Naïve
Very Fast, low storage requirements
Robust to Irrelevant Features
Very good in domains with many equally important features
Optimal if the independence assumptions hold
Dependable baseline for text classification (but often not the best)
But: Naïve Bayes Isn’t Without Issue
Model the posterior in one go?
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there “better” ways of handling missing/noisy data? (automated, more principled)
OutlineRecap: classification (MAP vs. noisy channel) & evaluation
Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language
Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation
Neural (language) models
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
Discriminative Naïve Bayes
Very shallow (sigmoidal) neural nets
as statistical regression
a form of
viewed as
based in information theory
to be cool today :)
Maxent Models for Classification: Discriminatively or Generatively Trained
Discriminatively trained classifier
Generatively trained classifier
Directly model the posterior
Model the posterior with
Bayes rule
Maximum Entropy (Log-linear) Models
discriminatively trained:classify in one go
Maximum Entropy (Log-linear) Models
generatively trained:learn to model language
Document Classification
ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACK
• # killed:
• Type:
• Perp:
shot ATTACK
Document Classification
ATTACK
Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACK
Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACK
Three people have beenfatally shot, and five people, including a mayor, were seriously woundedas a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACK
Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACK
Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
ATTACK
Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
We need to score the different combinations.
Score and Combine Our Possibilities
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)
…
COMBINEposterior
probability of ATTACK
are all of these uncorrelated?
…scorek(department, ATTACK)
Score and Combine Our Possibilities
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)
…
COMBINEposterior
probability of ATTACK
Q: What are the score and combine functions for Naïve
Bayes?
Scoring Our Possibilities
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
score( , ) =ATTACK
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)
…
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9
Lesson 1
Three people have been fatally shot, and five people, including a mayor, were seriously wounded
as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
SNAP(score( , ))ATTACK
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
What function…
operates on any real number?
is never less than 0?
What function…
operates on any real number?
is never less than 0?
f(x) = exp(x)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded
as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
exp(score( , ))ATTACK
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)…
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)…
Maxent Modeling
Learn the scores (but we’ll declare what combinations should be looked at)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))weight1 * occurs1(fatally shot, ATTACK)
weight2 * occurs2(seriously wounded, ATTACK)
weight3 * occurs3(Shining Path, ATTACK)…
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))weight1 * occurs1(fatally shot, ATTACK)
weight2 * occurs2(seriously wounded, ATTACK)
weight3 * occurs3(Shining Path, ATTACK)…
Maxent Modeling: Feature Functions
Feature functions help extract useful features (characteristics) of the
data
Generally templated
Often binary-valued (0 or 1), but can be real-
valued
occurstarget,type fatally shot,ATTACK =
ቊ1, target == fatally shot and type == ATTACK
0, otherwise
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
p( | )∝ATTACK
binary
More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data
Generally templated
Often binary-valued (0 or 1), but can be real-valued
occurstarget,type fatally shot, ATTACK =
log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)
Templated real-valued
occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)
Non-templated real-valued
Non-templated count-valued
???
occurstarget,type fatally shot, ATTACK =
ቊ1, target == fatally shot and type == ATTACK
0, otherwise
binary
More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data
Generally templated
Often binary-valued (0 or 1), but can be real-valued
occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)
Non-templated real-valued
occurs fatally shot, ATTACK =count fatally sho𝑡 ATTACK)
Non-templated count-valued
occurstarget,type fatally shot, ATTACK =
log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)
Templated real-valued
occurstarget,type fatally shot, ATTACK =
ቊ1, target == fatally shot and type == ATTACK
0, otherwise
binary
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | ) =ATTACK
exp( ))weight1 * applies1(fatally shot, ATTACK)
weight2 * applies2(seriously wounded, ATTACK)
weight3 * applies3(Shining Path, ATTACK)…
Maxent Modeling
1
Z
Q: How do we define Z?
exp( )…
Σlabel x
Z =Normalization for Classification
𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go
weight1 * occurs1(fatally shot, ATTACK)
weight2 * occurs2(seriously wounded, ATTACK)
weight3 * occurs3(Shining Path, ATTACK)
Normalization for Language Model
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case
Simplifying assumption: maxent n-grams!
general class-based (X) language model of doc y
Understanding Conditioning
Is this a good language model?
Understanding Conditioning
Is this a good language model?
Understanding Conditioning
Is this a good language model? (no)
Understanding Conditioning
Is this a good posterior classifier? (no)
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9
Lesson 11
OutlineRecap: classification (MAP vs. noisy channel) & evaluation
Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language
Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation
Neural (language) models
pθ(x | y ) probabilistic model
objective(given observations)
Objective = Full Likelihood?
These values can have very small magnitude ➔ underflow
Differentiating this product could be a pain
Logarithms
(0, 1] ➔ (-∞, 0]
Products ➔ Sums
log(ab) = log(a) + log(b)
log(a/b) = log(a) – log(b)
Inverse of exp
log(exp(x)) = x
Log-Likelihood
Differentiating this becomes nicer (even
though Z depends on θ)
Wide range of (negative) numbers
Sums are more stable
Products ➔ Sums
log(ab) = log(a) + log(b)
log(a/b) = log(a) – log(b)
Log-Likelihood
Wide range of (negative) numbers
Sums are more stable
Differentiating this becomes nicer (even
though Z depends on θ)
Inverse of explog(exp(x)) = x
Log-Likelihood
Wide range of (negative) numbers
Sums are more stable
= 𝐹 𝜃
OutlineRecap: classification (MAP vs. noisy channel) & evaluation
Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language
Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation
Neural (language) models
How will we optimize F(θ)?
Calculus
F(θ)
θ
F(θ)
θθ*
F(θ)
θ
F’(θ)derivative of F
wrt θ
θ*
Example
F’(x) = -2x + 4
F(x) = -(x-2)2
differentiate
Solve F’(x) = 0
x = 2
Common Derivative Rules
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)
θ0
y0
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)
θ0
y0
g0
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
g0
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
y1
θ2
g0
g1
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
y1
θ2
y2
y3
θ3
g0
g1 g2
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0
Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)
3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
y1
θ2
y2
y3
θ3
g0
g1 g2
Gradient = Multi-variable derivative
K-dimensional input
K-dimensional output
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
OutlineRecap: classification (MAP vs. noisy channel) & evaluation
Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language
Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation
Neural (language) models
Expectations
1 2 3 4 5 6
number of pieces of candy
1/6 * 1 +1/6 * 2 +1/6 * 3 +1/6 * 4 +1/6 * 5 + 1/6 * 6
= 3.5
Expectations
1 2 3 4 5 6
number of pieces of candy
1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6
= 2.5
Expectations
1 2 3 4 5 6
number of pieces of candy
1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6
= 2.5
Expectations
1 2 3 4 5 6
number of pieces of candy
1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6
= 2.5
Log-Likelihood
Wide range of (negative) numbers
Sums are more stable
Differentiating this becomes nicer (even though Z depends
on θ)= 𝐹 𝜃
Log-Likelihood Gradient
Each component k is the difference between:
Log-Likelihood Gradient
Each component k is the difference between:
the total value of feature fk in the training data
Log-Likelihood Gradient
Each component k is the difference between:
the total value of feature fk in the training data
and
the total value the current model pθ
thinks it computes for feature fk
X' Yi
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9
Lesson 6
Log-Likelihood Gradient Derivation
𝑦𝑖
Log-Likelihood Gradient Derivation
𝑦𝑖
Log-Likelihood Gradient Derivation
𝜕
𝜕𝜃log𝑔(ℎ 𝜃 ) =
𝜕𝑔
𝜕ℎ(𝜃)
𝜕ℎ
𝜕𝜃
use the (calculus) chain rulescalar p(x’ | yi)
vector of functions
𝑦𝑖
Log-Likelihood Gradient Derivation
Do we want these to fully match?
What does it mean if they do?
What if we have missing values in our data?
Gradient Optimization
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1 𝜕𝐹
𝜕𝜃𝑘=
𝑖
𝑓𝑘 𝑥𝑖 , 𝑦𝑖 −
𝑖
𝑦′
𝑓𝑘 𝑥𝑖 , 𝑦′ 𝑝 𝑦′ 𝑥𝑖)
Do we want these to fully match?
What does it mean if they do?
What if we have missing values in our data?
Preventing Extreme Values
Naïve Bayes
Extreme values are 0 probabilities
Preventing Extreme Values
Naïve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large θ values
Preventing Extreme Values
Naïve Bayes Log-linear models
Extreme values are 0 probabilities Extreme values are large θ values
regularization
(Squared) L2 Regularization
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/
https://goo.gl/BQCdH9
Lesson 8
OutlineRecap: classification (MAP vs. noisy channel) & evaluation
Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language
Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation
Neural (language) models
Revisiting the SNAP Function
softmax
Revisiting the SNAP Function
softmax
N-gram Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
N-gram Language Models
predict the next word
given some context…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)
wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
N-gram Language Models
predict the next word
given some context…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)
wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
Maxent Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝑓(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))
Neural Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))
can we learn the feature function(s)?
Neural Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝒘𝒊⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))
can we learn the feature function(s) for justthe context?
can we learn word-specific weights (by type)?
Neural Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))
create/use “distributed representations”… ei-3 ei-2 ei-1ew
Neural Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))
create/use “distributed representations”… ei-3 ei-2 ei-1
combine these representations… C = f
matrix-vector product
ew
Neural Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))
create/use “distributed representations”… ei-3 ei-2 ei-1
combine these representations… C = f
matrix-vector product
ew
θwi
Neural Language Models
predict the next word
given some context…wi-3 wi-2
wi
wi-1
compute beliefs about what is likely…
𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))
create/use “distributed representations”… ei-3 ei-2 ei-1
combine these representations… C = f
matrix-vector product
ew
θwi
“A Neural Probabilistic Language Model,” Bengio et al. (2003)
Baselines
LM NameN-
gramParams.
Test Ppl.
Interpolation 3 --- 336
Kneser-Ney backoff
3 --- 323
Kneser-Neybackoff
5 --- 321
Class-based backoff
3500
classes312
Class-based backoff
5500
classes312
“A Neural Probabilistic Language Model,” Bengio et al. (2003)
Baselines
LM NameN-
gramParams.
Test Ppl.
Interpolation 3 --- 336
Kneser-Ney backoff
3 --- 323
Kneser-Neybackoff
5 --- 321
Class-based backoff
3500
classes312
Class-based backoff
5500
classes312
NPLM
N-gramWord Vector Dim.
Hidden Dim.
Mix with non-
neural LM
Ppl.
5 60 50 No 268
5 60 50 Yes 257
5 30 100 No 276
5 30 100 Yes 252
“A Neural Probabilistic Language Model,” Bengio et al. (2003)
Baselines
LM NameN-
gramParams.
Test Ppl.
Interpolation 3 --- 336
Kneser-Ney backoff
3 --- 323
Kneser-Neybackoff
5 --- 321
Class-based backoff
3500
classes312
Class-based backoff
5500
classes312
NPLM
N-gramWord Vector Dim.
Hidden Dim.
Mix with non-
neural LM
Ppl.
5 60 50 No 268
5 60 50 Yes 257
5 30 100 No 276
5 30 100 Yes 252
“we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)