Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar
Dec 24, 2015
Crash Course on Machine Learning
Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar
Typical Paradigms of Recognition
Feature Computation
Model
Visual RecognitionIdentification
Is this your car?
Classification
Visual RecognitionVerification
Is this a car?
Classification
Visual RecognitionClassification:
Is there a car in this picture?
Classification
Visual RecognitionDetection:Where is the car in this picture?
Classification
Structure Learning
Visual RecognitionActivity Recognition:
What is he doing?What is he doing?
Classification
Visual RecognitionPose Estimation:
Regression
Structure Learning
Visual RecognitionObject Categorization:
Sky
Tree
Car
PersonBicycle
Horse
Person
Road
Structure Learning
Visual Recognition
Person
Segmentation
Sky
Tree
Car
Classification
Structure Learning
What kind of problems?• Classification
– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic
• Regression– Linear regression– Structured output regression
• Structure Learning– Graphical Models– Margin based approaches
What kind of problems?• Classification
– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic
• Regression– Linear regression– Structured output regression
• Structure Learning– Graphical Models– Margin based approaches
What kind of problems?• Classification
– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic
• Regression– Linear regression– Structured output regression
• Structure Learning– Graphical Models– Margin based approaches
What kind of problems?• Classification
– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic
• Regression– Linear regression– Structured output regression
• Structure Learning– Graphical Models– Margin based approaches
Let’s play with probability for a bit
Remembering simple stuff
• P(Heads) = , P(Tails) = 1-
• Flips are i.i.d.: – Independent events– Identically distributed according to Binomial
distribution
• Sequence D of H Heads and T Tails
…
D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )
Thumbtack & Probabilities
Maximum Likelihood Estimation• Data: Observed set D of H Heads and T Tails • Hypothesis: Binomial distribution • Learning: finding is an optimization problem
– What’s the objective function?
• MLE: Choose to maximize probability of D
Parameter learning
• Set derivative to zero, and solve!
But, how many flips do I need?
• 3 heads and 2 tails.• = 3/5, I can prove it!• What if I flipped 30 heads and 20 tails?• Same answer, I can prove it!• What’s better?• Umm… The more the merrier???
N
P
rob.
of M
istak
e
Exponential Decay!
A bound (from Hoeffding’s inequality)
• For N =H+T, and
• Let * be the true parameter, for any >0:
What if I have prior beliefs? • Wait, I know that the thumbtack is “close” to 50-
50. What can you do for me now?
• Rather than estimating a single , we obtain a distribution over possible values of
In the beginning After observationsObserve flips
e.g.: {tails, tails}
How to use Prior• Use Bayes rule:
• Or equivalently:
• Also, for uniform priors:
Prior
Normalization
Data Likelihood
Posterior
reduces to MLE objective
Beta prior distribution – P()
• Likelihood function:• Posterior:
MAP for Beta distribution
• MAP: use most likely parameter:
What about continuous variables?
We like Gaussians because
• Affine transformation (multiplying by scalar and adding a constant) are Gaussian– X ~ N(,2)– Y = aX + b Y ~ N(a+b,a22)
• Sum of Gaussians is Gaussian– X ~ N(X,2
X)
– Y ~ N(Y,2Y)
– Z = X+Y Z ~ N(X+Y, 2X+2
Y)
• Easy to differentiate
Learning a Gaussian• Collect a bunch of data
– Hopefully, i.i.d. samples– e.g., exam scores
• Learn parameters– Mean: μ– Variance: σ
xi
i =
Exam Score
0 85
1 95
2 100
3 12
… …99 89
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
• Log-likelihood of data:
MLE for mean of a Gaussian
• What’s MLE for mean?
MLE for variance• Again, set derivative to zero:
Learning Gaussian parameters
• MLE:
MAP
• Conjugate priors– Mean: Gaussian prior– Variance: Wishart Distribution
• Prior for mean:
Supervised Learning: find f
• Given: Training set {(xi, yi) | i = 1 … n}• Find: A good approximation to f : X Y
• What is x?• What is y?
Simple Example: Digit Recognition
• Input: images / pixel grids• Output: a digit 0-9• Setup:
– Get a large collection of example images, each labeled with a digit
– Note: someone has to hand label all this data!
– Want to learn to predict labels of new, future digit images
• Features: ?
0
1
2
1
??Screw You, I want to use Pixels :D
Lets take a probabilistic approach!!!
• Can we directly estimate the data distribution P(X,Y)?
• How do we represent these? How many parameters?– Prior, P(Y):
• Suppose Y is composed of k classes
– Likelihood, P(X|Y):• Suppose X is composed of n binary
features
Conditional Independence• X is conditionally independent of Y given Z, if
the probability distribution for X is independent of the value of Y, given the value of Z
• e.g.,
• Equivalent to:
Naïve Bayes• Naïve Bayes assumption:
– Features are independent given class:
– More generally:
The Naïve Bayes Classifier• Given:
– Prior P(Y)– n conditionally independent
features X given the class Y– For each Xi, we have likelihood P(Xi|
Y)
• Decision rule:
Y
X1 XnX2
A Digit Recognizer
• Input: pixel grids
• Output: a digit 0-9
Naïve Bayes for Digits (Binary Inputs)
• Simple version:– One feature Fij for each grid position <i,j>
– Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image
– Each input maps to a feature vector, e.g.
– Here: lots of features, each is binary valued
• Naïve Bayes model:
• Are the features independent given class?• What do we need to learn?
Example Distributions
1 0.1
2 0.1
3 0.1
4 0.1
5 0.1
6 0.1
7 0.1
8 0.1
9 0.1
0 0.1
1 0.01
2 0.05
3 0.05
4 0.30
5 0.80
6 0.90
7 0.05
8 0.60
9 0.50
0 0.80
1 0.05
2 0.01
3 0.90
4 0.80
5 0.90
6 0.90
7 0.25
8 0.85
9 0.60
0 0.80
MLE for the parameters of NB• Given dataset
– Count(A=a,B=b) number of examples where A=a and B=b
• MLE for discrete NB, simply:– Prior:
– Likelihood:
Violating the NB assumption
• Usually, features are not conditionally independent:
– NB often performs well, even when assumption is violated– [Domingos & Pazzani ’96] discuss some conditions for good
performance
Smoothing
2 wins!!
Does this happen in vision?
NB & Bag of words model
What about real Features?What if we have continuous Xi ?
Eg., character recognition: Xi is ith pixel
Gaussian Naïve Bayes (GNB):
Sometimes assume variance• is independent of Y (i.e., i), • or independent of Xi (i.e., k)• or both (i.e., )
Estimating Parameters
Maximum likelihood estimates:• Mean:
• Variance:
jth training example
(x)=1 if x true, else 0
What you need to know about Naïve Bayes
• Naïve Bayes classifier– What’s the assumption– Why we use it– How do we learn it– Why is Bayesian estimation important– Bag of words model
• Gaussian NB– Features are still conditionally independent– Each feature has a Gaussian distribution given class
• Optimal decision using Bayes Classifier
another probabilistic approach!!!
• Naïve Bayes: directly estimate the data distribution P(X,Y)!– challenging due to size
of distribution!– make Naïve Bayes
assumption: only need P(Xi|Y)!
• But wait, we classify according to:– maxY P(Y|X)
• Why not learn P(Y|X) directly?