Top Banner
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP
132

Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Jul 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes, Maxent and Neural Models

CMSC 473/673

UMBC

Some slides adapted from 3SLP

Page 2: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Page 3: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Probabilistic Classification

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Posterior Classification/Decodingmaximum a posteriori

Noisy Channel Model Decoding

Page 4: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Posterior Decoding:Probabilistic Text Classification

Assigning subject categories, topics, or genres

Spam detection

Authorship identification

Age/gender identification

Language Identification

Sentiment analysis

class

observed data

class-based likelihood (language model)

prior probability of

class

observation likelihood (averaged over all classes)

Page 5: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Noisy Channel Model

what I want to tell you“sports”

what you actually see“The Os lost

again…”

Decode Rerank

hypothesized intent

“sad stories”“sports”

reweight according to what’s likely

“sports”

Page 6: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Noisy Channel

Machine translation

Speech-to-text

Spelling correction

Text normalization

Part-of-speech tagging

Morphological analysis

Image captioning

possible (clean) output

observed (noisy) text

(clean) language

model

observation (noisy) likelihood

translation/decode model

Page 7: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Use Logarithms

Page 8: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually Incorrect

Selected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

Page 9: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

A combined measure: F

Weighted (harmonic) average of Precision & Recall

Balanced F1 measure: β=1

Page 10: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Page 11: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

The Bag of Words Representation

Page 12: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

The Bag of Words Representation

Page 13: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

The Bag of Words Representation

13

Page 14: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Bag of Words Representation

γ( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier

Page 15: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes Classifier

Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?

Page 16: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes Classifier

Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?

A: generative training

Page 17: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes Classifier

Adopt naïve bag of words representation Y i

label each word (token)

Page 18: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes Classifier

Adopt naïve bag of words representation Y i

Assume position doesn’t matter

label each word (token)

Page 19: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes Classifier

Adopt naïve bag of words representation Y i

Assume position doesn’t matter

Assume the feature probabilities are independent given the class X

label each word (token)

Page 20: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary

Page 21: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj

From training corpus, extract Vocabulary

Page 22: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Brill and Banko (2001)

With enough data, the classifier may not matter

Page 23: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj

Calculate P(wk | cj) termsTextj = single doc containing all docsj

For each word wk in Vocabularynk = # of occurrences of wk in Textj

From training corpus, extract Vocabulary

𝑝 𝑤𝑘| 𝑐𝑗 = class unigram LM

Page 24: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature

But if, as in the previous slides

We use only word features

we use all of the words in the text (not a subset)

Then

Naïve Bayes has an important similarity to language modeling

Page 25: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Page 26: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

filmlove this funI

Sec.13.2.1

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Page 27: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

Page 28: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

Page 29: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Page 30: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data? (automated, more principled)

Page 31: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Page 32: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

Very shallow (sigmoidal) neural nets

as statistical regression

a form of

viewed as

based in information theory

to be cool today :)

Page 33: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Maxent Models for Classification: Discriminatively or Generatively Trained

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Page 34: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Maximum Entropy (Log-linear) Models

discriminatively trained:classify in one go

Page 35: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Maximum Entropy (Log-linear) Models

generatively trained:learn to model language

Page 36: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Document Classification

ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )ATTACK

Page 37: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Document Classification

ATTACK

• # killed:

• Type:

• Perp:

shot ATTACK

Page 38: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Page 39: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Page 40: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously woundedas a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Page 41: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Page 42: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Page 43: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

We need to score the different combinations.

Page 44: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Page 45: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Page 46: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

Page 47: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 1

Page 48: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Page 49: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

What function…

operates on any real number?

is never less than 0?

Page 50: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

What function…

operates on any real number?

is never less than 0?

f(x) = exp(x)

Page 51: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Page 52: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

exp( ))score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)…

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Page 53: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

exp( ))score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Page 54: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

exp( ))weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)…

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Page 55: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

exp( ))weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)…

Maxent Modeling: Feature Functions

Feature functions help extract useful features (characteristics) of the

data

Generally templated

Often binary-valued (0 or 1), but can be real-

valued

occurstarget,type fatally shot,ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝ATTACK

binary

Page 56: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurstarget,type fatally shot, ATTACK =

log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

Non-templated count-valued

???

occurstarget,type fatally shot, ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

binary

Page 57: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

occurs fatally shot, ATTACK =count fatally sho𝑡 ATTACK)

Non-templated count-valued

occurstarget,type fatally shot, ATTACK =

log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued

occurstarget,type fatally shot, ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

binary

Page 58: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | ) =ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)

weight3 * applies3(Shining Path, ATTACK)…

Maxent Modeling

1

Z

Q: How do we define Z?

Page 59: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

exp( )…

Σlabel x

Z =Normalization for Classification

𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go

weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)

Page 60: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Normalization for Language Model

general class-based (X) language model of doc y

Page 61: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Normalization for Language Model

Can be significantly harder in the general case

general class-based (X) language model of doc y

Page 62: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Normalization for Language Model

Can be significantly harder in the general case

Simplifying assumption: maxent n-grams!

general class-based (X) language model of doc y

Page 63: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Understanding Conditioning

Is this a good language model?

Page 64: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Understanding Conditioning

Is this a good language model?

Page 65: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Understanding Conditioning

Is this a good language model? (no)

Page 66: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Understanding Conditioning

Is this a good posterior classifier? (no)

Page 67: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 11

Page 68: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Page 69: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

pθ(x | y ) probabilistic model

objective(given observations)

Page 70: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Objective = Full Likelihood?

These values can have very small magnitude ➔ underflow

Differentiating this product could be a pain

Page 71: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Logarithms

(0, 1] ➔ (-∞, 0]

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Inverse of exp

log(exp(x)) = x

Page 72: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood

Differentiating this becomes nicer (even

though Z depends on θ)

Wide range of (negative) numbers

Sums are more stable

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Page 73: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood

Wide range of (negative) numbers

Sums are more stable

Differentiating this becomes nicer (even

though Z depends on θ)

Inverse of explog(exp(x)) = x

Page 74: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood

Wide range of (negative) numbers

Sums are more stable

= 𝐹 𝜃

Page 75: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Page 76: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

How will we optimize F(θ)?

Calculus

Page 77: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

Page 78: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θθ*

Page 79: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F

wrt θ

θ*

Page 80: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Example

F’(x) = -2x + 4

F(x) = -(x-2)2

differentiate

Solve F’(x) = 0

x = 2

Page 81: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Common Derivative Rules

Page 82: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Page 83: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)

θ0

y0

Page 84: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

θ0

y0

g0

Page 85: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

g0

Page 86: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

g0

g1

Page 87: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

Page 88: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0

Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

Page 89: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient = Multi-variable derivative

K-dimensional input

K-dimensional output

Page 90: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient Ascent

Page 91: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient Ascent

Page 92: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient Ascent

Page 93: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient Ascent

Page 94: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient Ascent

Page 95: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient Ascent

Page 96: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Page 97: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Expectations

1 2 3 4 5 6

number of pieces of candy

1/6 * 1 +1/6 * 2 +1/6 * 3 +1/6 * 4 +1/6 * 5 + 1/6 * 6

= 3.5

Page 98: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Expectations

1 2 3 4 5 6

number of pieces of candy

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Page 99: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Expectations

1 2 3 4 5 6

number of pieces of candy

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Page 100: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Expectations

1 2 3 4 5 6

number of pieces of candy

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Page 101: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood

Wide range of (negative) numbers

Sums are more stable

Differentiating this becomes nicer (even though Z depends

on θ)= 𝐹 𝜃

Page 102: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood Gradient

Each component k is the difference between:

Page 103: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood Gradient

Each component k is the difference between:

the total value of feature fk in the training data

Page 104: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood Gradient

Each component k is the difference between:

the total value of feature fk in the training data

and

the total value the current model pθ

thinks it computes for feature fk

X' Yi

Page 105: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 6

Page 106: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood Gradient Derivation

𝑦𝑖

Page 107: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood Gradient Derivation

𝑦𝑖

Page 108: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood Gradient Derivation

𝜕

𝜕𝜃log𝑔(ℎ 𝜃 ) =

𝜕𝑔

𝜕ℎ(𝜃)

𝜕ℎ

𝜕𝜃

use the (calculus) chain rulescalar p(x’ | yi)

vector of functions

𝑦𝑖

Page 109: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Log-Likelihood Gradient Derivation

Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Page 110: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Gradient Optimization

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1 𝜕𝐹

𝜕𝜃𝑘=

𝑖

𝑓𝑘 𝑥𝑖 , 𝑦𝑖 −

𝑖

𝑦′

𝑓𝑘 𝑥𝑖 , 𝑦′ 𝑝 𝑦′ 𝑥𝑖)

Page 111: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Page 112: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Preventing Extreme Values

Naïve Bayes

Extreme values are 0 probabilities

Page 113: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Preventing Extreme Values

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

Page 114: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Preventing Extreme Values

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

regularization

Page 115: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

(Squared) L2 Regularization

Page 116: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 8

Page 117: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Page 118: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Revisiting the SNAP Function

softmax

Page 119: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Revisiting the SNAP Function

softmax

Page 120: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

N-gram Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

Page 121: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

N-gram Language Models

predict the next word

given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

Page 122: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

N-gram Language Models

predict the next word

given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

Page 123: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Maxent Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝑓(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

Page 124: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

can we learn the feature function(s)?

Page 125: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝒘𝒊⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

can we learn the feature function(s) for justthe context?

can we learn word-specific weights (by type)?

Page 126: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1ew

Page 127: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

ew

Page 128: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

ew

θwi

Page 129: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

ew

θwi

Page 130: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

Page 131: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

NPLM

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

Ppl.

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252

Page 132: Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

NPLM

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

Ppl.

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252

“we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)