Page 1
©2006 Carlos Guestrin 1
Naïve Bayes & Logistic Regression,See class website:
Mitchell’s Chapter (required)Ng & Jordan ’02 (optional)
Gradient ascent and extensions:Koller & Friedman Chapter 1.4
Naïve Bayes (Continued)Naïve Bayes with Continuous (variables)
Logistic Regression
Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University
January 30th, 2006
Page 2
©2006 Carlos Guestrin 2
Announcements
Recitations stay on Thursdays5-6:30pm in Wean 5409This week: Naïve Bayes & Logistic Regression
Extension for the first homework:Due Wed. Feb 8th beginning of classMitchell’s chapter is most useful reading
Go to the AI seminar:Tuesdays 3:30pm, Wean 5409http://www.cs.cmu.edu/~aiseminar/This week’s seminar very relevant to what we are covering in class
Page 3
©2006 Carlos Guestrin 3
Classification
Learn: h:X a YX – featuresY – target classes
Suppose you know P(Y|X) exactly, how should you classify?
Bayes classifier:
Why?
Page 4
©2006 Carlos Guestrin 4
Optimal classification
Theorem: Bayes classifier hBayes is optimal!
That isProof:
Page 5
©2006 Carlos Guestrin 5
How hard is it to learn the optimal classifier?
Data =
How do we represent these? How many parameters?Prior, P(Y):
Suppose Y is composed of k classes
Likelihood, P(X|Y):Suppose X is composed of n binary features
Complex model → High variance with limited data!!!
Page 6
©2006 Carlos Guestrin 6
Conditional Independence
X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z
e.g.,
Equivalent to:
Page 7
©2006 Carlos Guestrin 7
The Naïve Bayes assumption
Naïve Bayes assumption:Features are independent given class:
More generally:
How many parameters now?Suppose X is composed of n binary features
Page 8
©2006 Carlos Guestrin 8
The Naïve Bayes Classifier
Given:Prior P(Y)n conditionally independent features X given the class YFor each Xi, we have likelihood P(Xi|Y)
Decision rule:
If assumption holds, NB is optimal classifier!
Page 9
©2006 Carlos Guestrin 9
MLE for the parameters of NB
Given datasetCount(A=a,B=b) ← number of examples where A=a and B=b
MLE for NB, simply:Prior: P(Y=y) =
Likelihood: P(Xi=xi|Yi=yi) =
Page 10
©2006 Carlos Guestrin 10
Subtleties of NB classifier 1 –Violating the NB assumption
Usually, features are not conditionally independent:
Thus, in NB, actual probabilities P(Y|X) often biased towards 0 or 1 (see homework 1)Nonetheless, NB is the single most used classifier out there
NB often performs well, even when assumption is violated[Domingos & Pazzani ’96] discuss some conditions for good performance
Page 11
©2006 Carlos Guestrin 11
Subtleties of NB classifier 2 –Insufficient training dataWhat if you never see a training instance where X1=a when Y=b?
e.g., Y={SpamEmail}, X1={‘Enlargement’}P(X1=a | Y=b) = 0
Thus, no matter what the values X2,…,Xn take:P(Y=b | X1=a,X2,…,Xn) = 0
What now???
Page 12
©2006 Carlos Guestrin 12
MAP for Beta distribution
MAP: use most likely parameter:
Beta prior equivalent to extra thumbtack flipsAs N →∞, prior is “forgotten”But, for small sample size, prior is important!
Page 13
©2006 Carlos Guestrin 13
Bayesian learning for NB parameters – a.k.a. smoothingDataset of N examplesPrior
“distribution” Q(Xi,Y), Q(Y)m “virtual” examples
MAP estimateP(Xi|Y)
Now, even if you never observe a feature/class, posterior probability never zero
Page 14
©2006 Carlos Guestrin 14
Text classification
Classify e-mailsY = {Spam,NotSpam}
Classify news articlesY = {what is the topic of the article?}
Classify webpagesY = {Student, professor, project, …}
What about the features X?The text!
Page 15
©2006 Carlos Guestrin 15
Features X are entire document –Xi for ith word in article
Page 16
©2006 Carlos Guestrin 16
NB for Text classification
P(X|Y) is huge!!!Article at least 1000 words, X={X1,…,X1000}Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
NB assumption helps a lot!!!P(Xi=xi|Y=y) is just the probability of observing word xi in a document on topic y
Page 17
©2006 Carlos Guestrin 17
Bag of words model
Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
“Bag of words” model – order of words on the page ignoredSounds really silly, but often works very well!
When the lecture is over, remember to wake up the person sitting next to you in the lecture room.
Page 18
©2006 Carlos Guestrin 18
Bag of words model
Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
“Bag of words” model – order of words on the page ignoredSounds really silly, but often works very well!
in is lecture lecture next over person remember room sitting the the the to to up wake when you
Page 19
©2006 Carlos Guestrin 19
Bag of Words Approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
Page 20
©2006 Carlos Guestrin 20
NB with Bag of Words for text classification
Learning phase:Prior P(Y)
Count how many documents you have from each topic (+ prior)
P(Xi|Y) For each topic, count how many times you saw word in documents of this topic (+ prior)
Test phase:For each document
Use naïve Bayes decision rule
Page 21
©2006 Carlos Guestrin 21
Twenty News Groups results
Page 22
©2006 Carlos Guestrin 22
Learning curve for Twenty News Groups
Page 23
©2006 Carlos Guestrin 23
What if we have continuous Xi ?
Eg., character recognition: Xi is ith pixel
Gaussian Naïve Bayes (GNB):
Sometimes assume varianceis independent of Y (i.e., σi), or independent of Xi (i.e., σk)or both (i.e., σ)
Page 24
©2006 Carlos Guestrin 24
Estimating Parameters: Y discrete, Xi continuous
Maximum likelihood estimates: jth training example
δ(x)=1 if x true, else 0
Page 25
©2006 Carlos Guestrin 25
Example: GNB for classifying mental states
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
measures Blood Oxygen Level Dependent (BOLD) response
Typical impulse response
10 sec
[Mitchell et al.]
Page 26
©2006 Carlos Guestrin 26
Brain scans can track activation with precision and sensitivity
[Mitchell et al.]
Page 27
©2006 Carlos Guestrin 27
Gaussian Naïve Bayes: Learned µvoxel,wordP(BrainActivity | WordCategory = {People,Animal})
[Mitchell et al.]
Page 28
©2006 Carlos Guestrin 28
Learned Bayes Models – Means forP(BrainActivity | WordCategory)
[Mitchell et al.]Pairwise classification accuracy: 85%
People words Animal words
Page 29
©2006 Carlos Guestrin 29
What you need to know about Naïve Bayes
Types of learning problemsLearning is (just) function approximation!
Optimal decision using Bayes ClassifierNaïve Bayes classifier
What’s the assumptionWhy we use itHow do we learn itWhy is Bayesian estimation important
Text classificationBag of words model
Gaussian NBFeatures are still conditionally independentEach feature has a Gaussian distribution given class
Page 30
©2006 Carlos Guestrin 30
Generative v. Discriminative classifiers – Intuition Want to Learn: h:X a Y
X – featuresY – target classes
Bayes optimal classifier – P(Y|X)Generative classifier, e.g., Naïve Bayes:
Assume some functional form for P(X|Y), P(Y)Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)This is a ‘generative’ model
Indirect computation of P(Y|X) through Bayes ruleBut, can generate a sample of the data, P(X) = ∑y P(y) P(X|y)
Discriminative classifiers, e.g., Logistic Regression:Assume some functional form for P(Y|X)Estimate parameters of P(Y|X) directly from training dataThis is the ‘discriminative’ model
Directly learn P(Y|X)But cannot obtain a sample of the data, because P(X) is not available
Page 31
©2006 Carlos Guestrin 31
Logistic RegressionLogisticfunction(or Sigmoid):
Learn P(Y|X) directly!Assume a particular functional formSigmoid applied to a linear function of the data:
Z
Page 32
©2006 Carlos Guestrin 32
Understanding the sigmoid
w0=0, w1=1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=2, w1=1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1=0.5
Page 33
©2006 Carlos Guestrin 33
Logistic Regression –a Linear classifier
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Page 34
©2006 Carlos Guestrin 34
Very convenient!
implies
implies
implies
linear classification
rule!
Page 35
©2006 Carlos Guestrin 35
Logistic regression more generally
Logistic regression in more general case, where Y ∈{Y1 ... YR} : learn R-1 sets of weights
for k<R
for k=R (normalization, so no weights for this class)
Features can be discrete or continuous!
Page 36
©2006 Carlos Guestrin 36
Logistic regression v. Naïve Bayes
Consider learning f: X Y, whereX is a vector of real-valued features, < X1 … Xn >Y is boolean
Could use a Gaussian Naïve Bayes classifierassume all Xi are conditionally independent given Ymodel P(Xi | Y = yk) as Gaussian N(µik,σi)model P(Y) as Bernoulli(θ,1-θ)
What does that imply about the form of P(Y|X)?
Page 37
©2006 Carlos Guestrin 37
Logistic regression v. Naïve Bayes
Consider learning f: X Y, whereX is a vector of real-valued features, < X1 … Xn >Y is boolean
Could use a Gaussian Naïve Bayes classifierassume all Xi are conditionally independent given Ymodel P(Xi | Y = yk) as Gaussian N(µik,σi)model P(Y) as Bernoulli(θ,1-θ)
What does that imply about the form of P(Y|X)?
Cool!!!!
Page 38
©2006 Carlos Guestrin 38
Derive form for P(Y|X) for continuous Xi
Page 39
©2006 Carlos Guestrin 39
Ratio of class-conditional probabilities
Page 40
©2006 Carlos Guestrin 40
Derive form for P(Y|X) for continuous Xi
Page 41
©2006 Carlos Guestrin 41
Gaussian Naïve Bayes v. Logistic Regression
Set of Gaussian Naïve Bayes parameters
Set of Logistic Regression parameters
Representation equivalenceBut only in a special case!!! (GNB with class-independent variances)
But what’s the difference???LR makes no assumptions about P(X|Y) in learning!!!Loss function!!!
Optimize different functions → Obtain different solutions
Page 42
©2006 Carlos Guestrin 42
Loss functions: Likelihood v. Conditional LikelihoodGenerative (Naïve Bayes) Loss function: Data likelihood
Discriminative models cannot compute P(xj|w)!But, discriminative (logistic regression) loss function:Conditional Data Likelihood
Doesn’t waste effort learning P(X) – focuses on P(Y|X) all that matters for classification
Page 43
©2006 Carlos Guestrin 43
Expressing Conditional Log Likelihood
Page 44
©2006 Carlos Guestrin 44
Maximizing Conditional Log Likelihood
Good news: l(w) is concave function of w → no locally optimal solutions
Bad news: no closed-form solution to maximize l(w)
Good news: concave functions easy to optimize
Page 45
©2006 Carlos Guestrin 45
Optimizing concave function –Gradient ascent Conditional likelihood for Logistic Regression is concave → Find optimum with gradient ascent
Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)
Gradient:
Update rule:
Learning rate, η>0
Page 46
©2006 Carlos Guestrin 46
Maximize Conditional Log Likelihood: Gradient ascent
Gradient ascent algorithm: iterate until change < ε
For all i,
repeat
Page 47
©2006 Carlos Guestrin 47
That’s all M(C)LE. How about MAP?
One common approach is to define priors on wNormal distribution, zero mean, identity covariance“Pushes” parameters towards zero
Corresponds to RegularizationHelps avoid very large weights and overfittingExplore this in your homeworkMore on this later in the semester
MAP estimate
Page 48
©2006 Carlos Guestrin 48
Gradient of M(C)AP
Page 49
©2006 Carlos Guestrin 49
MLE vs MAP Maximum conditional likelihood estimate
Maximum conditional a posteriori estimate
Page 50
©2006 Carlos Guestrin 50
What you should know about Logistic Regression (LR)
Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR
Solution differs because of objective (loss) function
In general, NB and LR make different assumptionsNB: Features independent given class → assumption on P(X|Y)LR: Functional form of P(Y|X), no assumption on P(X|Y)
LR is a linear classifierdecision rule is a hyperplane
LR optimized by conditional likelihoodno closed-form solutionconcave → global optimum with gradient ascentMaximum conditional a posteriori corresponds to regularization
Page 51
©2006 Carlos Guestrin 51
Acknowledgements
Some of the material is the presentation is courtesy of Tom Mitchell