Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Naïve Bayes, Maxent and Neural Models

CMSC 473/673

UMBC

Some slides adapted from 3SLP

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Probabilistic Classification

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Posterior Classification/Decodingmaximum a posteriori

Noisy Channel Model Decoding

Posterior Decoding:Probabilistic Text Classification

Assigning subject categories, topics, or genres

Spam detection

Authorship identification

Age/gender identification

Language Identification

Sentiment analysis

…

class

observed data

class-based likelihood (language model)

prior probability of

class

observation likelihood (averaged over all classes)

Noisy Channel Model

what I want to tell you“sports”

what you actually see“The Os lost

again…”

Decode Rerank

hypothesized intent

“sad stories”“sports”

reweight according to what’s likely

“sports”

Noisy Channel

Machine translation

Speech-to-text

Spelling correction

Text normalization

Part-of-speech tagging

Morphological analysis

Image captioning

…

possible (clean) output

observed (noisy) text

(clean) language

model

observation (noisy) likelihood

translation/decode model

Use Logarithms

Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually Incorrect

Selected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

A combined measure: F

Weighted (harmonic) average of Precision & Recall

Balanced F1 measure: β=1





The Bag of Words Representation



13

Bag of Words Representation

γ( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier

Naïve Bayes Classifier

Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?


Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?

A: generative training


Adopt naïve bag of words representation Y i

label each word (token)



Assume position doesn’t matter




Assume position doesn’t matter

Assume the feature probabilities are independent given the class X


Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary


Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj


Brill and Banko (2001)

With enough data, the classifier may not matter


Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj

Calculate P(wk | cj) termsTextj = single doc containing all docsj

For each word wk in Vocabularynk = # of occurrences of wk in Textj


𝑝 𝑤𝑘| 𝑐𝑗 = class unigram LM

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature

But if, as in the previous slides

We use only word features

we use all of the words in the text (not a subset)

Then

Naïve Bayes has an important similarity to language modeling

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film


Which class assigns the higher probability to s?

filmlove this funI

Sec.13.2.1

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film


0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film



0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film


filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1



0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film


filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data? (automated, more principled)





Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

Very shallow (sigmoidal) neural nets

as statistical regression

a form of

viewed as

based in information theory

to be cool today :)

Maxent Models for Classification: Discriminatively or Generatively Trained

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Maximum Entropy (Log-linear) Models

discriminatively trained:classify in one go

Maximum Entropy (Log-linear) Models

generatively trained:learn to model language

Document Classification

ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.


ATTACK

• # killed:

• Type:

• Perp:

shot ATTACK


ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.


ATTACK



ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously woundedas a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.


ATTACK



ATTACK


ATTACK


We need to score the different combinations.

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

…

COMBINEposterior

probability of ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Score and Combine Our Possibilities




…

COMBINEposterior

probability of ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK




…

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 1



Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

What function…

operates on any real number?

is never less than 0?

What function…

operates on any real number?

is never less than 0?

f(x) = exp(x)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))ATTACK


p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)


score3(Shining Path, ATTACK)…


p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)


score3(Shining Path, ATTACK)…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)


p( | )∝ATTACK

exp( ))weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)…


p( | )∝ATTACK

exp( ))weight1 * occurs1(fatally shot, ATTACK)


weight3 * occurs3(Shining Path, ATTACK)…

Maxent Modeling: Feature Functions

Feature functions help extract useful features (characteristics) of the

data

Generally templated

Often binary-valued (0 or 1), but can be real-

valued

occurstarget,type fatally shot,ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝ATTACK

binary

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurstarget,type fatally shot, ATTACK =

log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

Non-templated count-valued

???



0, otherwise

binary

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

occurs fatally shot, ATTACK =count fatally sho𝑡 ATTACK)

Non-templated count-valued


log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued



0, otherwise

binary


p( | ) =ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)

weight3 * applies3(Shining Path, ATTACK)…

Maxent Modeling

1

Z

Q: How do we define Z?

exp( )…

Σlabel x

Z =Normalization for Classification

𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go

weight1 * occurs1(fatally shot, ATTACK)


weight3 * occurs3(Shining Path, ATTACK)

Normalization for Language Model

general class-based (X) language model of doc y


Can be significantly harder in the general case



Can be significantly harder in the general case

Simplifying assumption: maxent n-grams!


Understanding Conditioning

Is this a good language model?


Is this a good language model?


Is this a good language model? (no)


Is this a good posterior classifier? (no)



Lesson 11







pθ(x | y ) probabilistic model

objective(given observations)

Objective = Full Likelihood?

These values can have very small magnitude ➔ underflow

Differentiating this product could be a pain

Logarithms

(0, 1] ➔ (-∞, 0]

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Inverse of exp

log(exp(x)) = x

Log-Likelihood

Differentiating this becomes nicer (even

though Z depends on θ)

Wide range of (negative) numbers

Sums are more stable

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Log-Likelihood



Differentiating this becomes nicer (even

though Z depends on θ)

Inverse of explog(exp(x)) = x

Log-Likelihood



= 𝐹 𝜃





How will we optimize F(θ)?

Calculus

F(θ)

θ

F(θ)

θθ*

F(θ)

θ

F’(θ)derivative of F

wrt θ

θ*

Example

F’(x) = -2x + 4

F(x) = -(x-2)2

differentiate

Solve F’(x) = 0

x = 2

Common Derivative Rules

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

F(θ)

θ


θ*


Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)

θ0

y0

F(θ)

θ


θ*



Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

θ0

y0

g0

F(θ)

θ


θ*



Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

g0

F(θ)

θ


θ*




4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

g0

g1

F(θ)

θ


θ*




4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

F(θ)

θ


θ*


Set t = 0

Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

Gradient = Multi-variable derivative

K-dimensional input

K-dimensional output

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent





Expectations

1 2 3 4 5 6

number of pieces of candy

1/6 * 1 +1/6 * 2 +1/6 * 3 +1/6 * 4 +1/6 * 5 + 1/6 * 6

= 3.5

Expectations

1 2 3 4 5 6


1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Expectations

1 2 3 4 5 6


1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Expectations

1 2 3 4 5 6


1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Log-Likelihood



Differentiating this becomes nicer (even though Z depends

on θ)= 𝐹 𝜃

Log-Likelihood Gradient

Each component k is the difference between:



the total value of feature fk in the training data



the total value of feature fk in the training data

and

the total value the current model pθ

thinks it computes for feature fk

X' Yi



Lesson 6



Log-Likelihood Gradient Derivation

𝑦𝑖


𝑦𝑖


𝜕

𝜕𝜃log𝑔(ℎ 𝜃 ) =

𝜕𝑔

𝜕ℎ(𝜃)

𝜕ℎ

𝜕𝜃

use the (calculus) chain rulescalar p(x’ | yi)

vector of functions

𝑦𝑖


Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Gradient Optimization



4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1 𝜕𝐹

𝜕𝜃𝑘=

𝑖

𝑓𝑘 𝑥𝑖 , 𝑦𝑖 −

𝑖

𝑦′

𝑓𝑘 𝑥𝑖 , 𝑦′ 𝑝 𝑦′ 𝑥𝑖)

Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Preventing Extreme Values

Naïve Bayes

Extreme values are 0 probabilities


Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values


Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

regularization

(Squared) L2 Regularization



Lesson 8







Revisiting the SNAP Function

softmax

Revisiting the SNAP Function

softmax

N-gram Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1



given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…



given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

wi

wi-1


Maxent Language Models



wi

wi-1


𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝑓(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

Neural Language Models



wi

wi-1


𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

can we learn the feature function(s)?




wi

wi-1


𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝒘𝒊⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

can we learn the feature function(s) for justthe context?

can we learn word-specific weights (by type)?




wi

wi-1


𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1ew




wi

wi-1



create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

ew




wi

wi-1






ew

θwi




wi

wi-1






ew

θwi

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312


Baselines

LM NameN-

gramParams.

Test Ppl.


Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

NPLM

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

Ppl.

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252


Baselines

LM NameN-

gramParams.

Test Ppl.


Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

NPLM

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

Ppl.

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252

“we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)

Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Documents