Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Crash Course on Machine Learning

Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar

Typical Paradigms of Recognition

Feature Computation

Model

Visual RecognitionIdentification

Is this your car?

Classification

Visual RecognitionVerification

Is this a car?

Classification

Visual RecognitionClassification:

Is there a car in this picture?

Classification

Visual RecognitionDetection:Where is the car in this picture?

Classification

Structure Learning

Visual RecognitionActivity Recognition:

What is he doing?What is he doing?

Classification

Visual RecognitionPose Estimation:

Regression

Structure Learning

Visual RecognitionObject Categorization:

Sky

Tree

Car

PersonBicycle

Horse

Person

Road

Structure Learning

Visual Recognition

Person

Segmentation

Sky

Tree

Car

Classification

Structure Learning

What kind of problems?• Classification

– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic

• Regression– Linear regression– Structured output regression

• Structure Learning– Graphical Models– Margin based approaches













Let’s play with probability for a bit

Remembering simple stuff

• P(Heads) = , P(Tails) = 1-

• Flips are i.i.d.: – Independent events– Identically distributed according to Binomial

distribution

• Sequence D of H Heads and T Tails

…

D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )

Thumbtack & Probabilities

Maximum Likelihood Estimation• Data: Observed set D of H Heads and T Tails • Hypothesis: Binomial distribution • Learning: finding is an optimization problem

– What’s the objective function?

• MLE: Choose to maximize probability of D

Parameter learning

• Set derivative to zero, and solve!

But, how many flips do I need?

• 3 heads and 2 tails.• = 3/5, I can prove it!• What if I flipped 30 heads and 20 tails?• Same answer, I can prove it!• What’s better?• Umm… The more the merrier???

N

P

rob.

of M

istak

e

Exponential Decay!

A bound (from Hoeffding’s inequality)

• For N =H+T, and

• Let * be the true parameter, for any >0:

What if I have prior beliefs? • Wait, I know that the thumbtack is “close” to 50-

50. What can you do for me now?

• Rather than estimating a single , we obtain a distribution over possible values of

In the beginning After observationsObserve flips

e.g.: {tails, tails}

How to use Prior• Use Bayes rule:

• Or equivalently:

• Also, for uniform priors:

Prior

Normalization

Data Likelihood

Posterior

reduces to MLE objective

Beta prior distribution – P()

• Likelihood function:• Posterior:

MAP for Beta distribution

• MAP: use most likely parameter:

What about continuous variables?

We like Gaussians because

• Affine transformation (multiplying by scalar and adding a constant) are Gaussian– X ~ N(,2)– Y = aX + b Y ~ N(a+b,a22)

• Sum of Gaussians is Gaussian– X ~ N(X,2

X)

– Y ~ N(Y,2Y)

– Z = X+Y Z ~ N(X+Y, 2X+2

Y)

• Easy to differentiate

Learning a Gaussian• Collect a bunch of data

– Hopefully, i.i.d. samples– e.g., exam scores

• Learn parameters– Mean: μ– Variance: σ

xi

i =

Exam Score

0 85

1 95

2 100

3 12

… …99 89

MLE for Gaussian:

• Prob. of i.i.d. samples D={x1,…,xN}:

• Log-likelihood of data:

MLE for mean of a Gaussian

• What’s MLE for mean?

MLE for variance• Again, set derivative to zero:

Learning Gaussian parameters

• MLE:

MAP

• Conjugate priors– Mean: Gaussian prior– Variance: Wishart Distribution

• Prior for mean:

Supervised Learning: find f

• Given: Training set {(xi, yi) | i = 1 … n}• Find: A good approximation to f : X Y

• What is x?• What is y?

Simple Example: Digit Recognition

• Input: images / pixel grids• Output: a digit 0-9• Setup:

– Get a large collection of example images, each labeled with a digit

– Note: someone has to hand label all this data!

– Want to learn to predict labels of new, future digit images

• Features: ?

0

1

2

1

??Screw You, I want to use Pixels :D

Lets take a probabilistic approach!!!

• Can we directly estimate the data distribution P(X,Y)?

• How do we represent these? How many parameters?– Prior, P(Y):

• Suppose Y is composed of k classes

– Likelihood, P(X|Y):• Suppose X is composed of n binary

features

Conditional Independence• X is conditionally independent of Y given Z, if

the probability distribution for X is independent of the value of Y, given the value of Z

• e.g.,

• Equivalent to:

Naïve Bayes• Naïve Bayes assumption:

– Features are independent given class:

– More generally:

The Naïve Bayes Classifier• Given:

– Prior P(Y)– n conditionally independent

features X given the class Y– For each Xi, we have likelihood P(Xi|

Y)

• Decision rule:

Y

X1 XnX2

A Digit Recognizer

• Input: pixel grids

• Output: a digit 0-9

Naïve Bayes for Digits (Binary Inputs)

• Simple version:– One feature Fij for each grid position <i,j>

– Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image

– Each input maps to a feature vector, e.g.

– Here: lots of features, each is binary valued

• Naïve Bayes model:

• Are the features independent given class?• What do we need to learn?

Example Distributions

1 0.1

2 0.1

3 0.1

4 0.1

5 0.1

6 0.1

7 0.1

8 0.1

9 0.1

0 0.1

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

MLE for the parameters of NB• Given dataset

– Count(A=a,B=b) number of examples where A=a and B=b

• MLE for discrete NB, simply:– Prior:

– Likelihood:

Violating the NB assumption

• Usually, features are not conditionally independent:

– NB often performs well, even when assumption is violated– [Domingos & Pazzani ’96] discuss some conditions for good

performance

Smoothing

2 wins!!

Does this happen in vision?

NB & Bag of words model

What about real Features?What if we have continuous Xi ?

Eg., character recognition: Xi is ith pixel

Gaussian Naïve Bayes (GNB):

Sometimes assume variance• is independent of Y (i.e., i), • or independent of Xi (i.e., k)• or both (i.e., )

Estimating Parameters

Maximum likelihood estimates:• Mean:

• Variance:

jth training example

(x)=1 if x true, else 0

What you need to know about Naïve Bayes

• Naïve Bayes classifier– What’s the assumption– Why we use it– How do we learn it– Why is Bayesian estimation important– Bag of words model

• Gaussian NB– Features are still conditionally independent– Each feature has a Gaussian distribution given class

• Optimal decision using Bayes Classifier

another probabilistic approach!!!

• Naïve Bayes: directly estimate the data distribution P(X,Y)!– challenging due to size

of distribution!– make Naïve Bayes

assumption: only need P(Xi|Y)!

• But wait, we classify according to:– maxY P(Y|X)

• Why not learn P(Y|X) directly?

Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Documents

classification slide

probability of d slide

ben taskar slide

classification generative

visual recognition detection

binomial distribution

based approaches

kind of problems