Top Banner
Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar
49

Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Dec 24, 2015

Download

Documents

Rachel Townsend
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Crash Course on Machine Learning

Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar

Page 2: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Typical Paradigms of Recognition

Feature Computation

Model

Page 3: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual RecognitionIdentification

Is this your car?

Classification

Page 4: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual RecognitionVerification

Is this a car?

Classification

Page 5: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual RecognitionClassification:

Is there a car in this picture?

Classification

Page 6: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual RecognitionDetection:Where is the car in this picture?

Classification

Structure Learning

Page 7: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual RecognitionActivity Recognition:

What is he doing?What is he doing?

Classification

Page 8: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual RecognitionPose Estimation:

Regression

Structure Learning

Page 9: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual RecognitionObject Categorization:

Sky

Tree

Car

PersonBicycle

Horse

Person

Road

Structure Learning

Page 10: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Visual Recognition

Person

Segmentation

Sky

Tree

Car

Classification

Structure Learning

Page 11: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What kind of problems?• Classification

– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic

• Regression– Linear regression– Structured output regression

• Structure Learning– Graphical Models– Margin based approaches

Page 12: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What kind of problems?• Classification

– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic

• Regression– Linear regression– Structured output regression

• Structure Learning– Graphical Models– Margin based approaches

Page 13: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What kind of problems?• Classification

– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic

• Regression– Linear regression– Structured output regression

• Structure Learning– Graphical Models– Margin based approaches

Page 14: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What kind of problems?• Classification

– Generative vs. Discriminative– Supervised, unsupervised, semi-supervised, weakly supervised– Linear, nonlinear– Ensemble methods– Probabilistic

• Regression– Linear regression– Structured output regression

• Structure Learning– Graphical Models– Margin based approaches

Page 15: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Let’s play with probability for a bit

Remembering simple stuff

Page 16: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

• P(Heads) = , P(Tails) = 1-

• Flips are i.i.d.: – Independent events– Identically distributed according to Binomial

distribution

• Sequence D of H Heads and T Tails

D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )

Thumbtack & Probabilities

Page 17: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Maximum Likelihood Estimation• Data: Observed set D of H Heads and T Tails • Hypothesis: Binomial distribution • Learning: finding is an optimization problem

– What’s the objective function?

• MLE: Choose to maximize probability of D

Page 18: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Parameter learning

• Set derivative to zero, and solve!

Page 19: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

But, how many flips do I need?

• 3 heads and 2 tails.• = 3/5, I can prove it!• What if I flipped 30 heads and 20 tails?• Same answer, I can prove it!• What’s better?• Umm… The more the merrier???

Page 20: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

N

P

rob.

of M

istak

e

Exponential Decay!

A bound (from Hoeffding’s inequality)

• For N =H+T, and

• Let * be the true parameter, for any >0:

Page 21: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What if I have prior beliefs? • Wait, I know that the thumbtack is “close” to 50-

50. What can you do for me now?

• Rather than estimating a single , we obtain a distribution over possible values of

In the beginning After observationsObserve flips

e.g.: {tails, tails}

Page 22: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

How to use Prior• Use Bayes rule:

• Or equivalently:

• Also, for uniform priors:

Prior

Normalization

Data Likelihood

Posterior

reduces to MLE objective

Page 23: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Beta prior distribution – P()

• Likelihood function:• Posterior:

Page 24: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

MAP for Beta distribution

• MAP: use most likely parameter:

Page 25: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What about continuous variables?

Page 26: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

We like Gaussians because

• Affine transformation (multiplying by scalar and adding a constant) are Gaussian– X ~ N(,2)– Y = aX + b Y ~ N(a+b,a22)

• Sum of Gaussians is Gaussian– X ~ N(X,2

X)

– Y ~ N(Y,2Y)

– Z = X+Y Z ~ N(X+Y, 2X+2

Y)

• Easy to differentiate

Page 27: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Learning a Gaussian• Collect a bunch of data

– Hopefully, i.i.d. samples– e.g., exam scores

• Learn parameters– Mean: μ– Variance: σ

xi

i =

Exam Score

0 85

1 95

2 100

3 12

… …99 89

Page 28: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

MLE for Gaussian:

• Prob. of i.i.d. samples D={x1,…,xN}:

• Log-likelihood of data:

Page 29: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

MLE for mean of a Gaussian

• What’s MLE for mean?

Page 30: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

MLE for variance• Again, set derivative to zero:

Page 31: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Learning Gaussian parameters

• MLE:

Page 32: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

MAP

• Conjugate priors– Mean: Gaussian prior– Variance: Wishart Distribution

• Prior for mean:

Page 33: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Supervised Learning: find f

• Given: Training set {(xi, yi) | i = 1 … n}• Find: A good approximation to f : X Y

• What is x?• What is y?

Page 34: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Simple Example: Digit Recognition

• Input: images / pixel grids• Output: a digit 0-9• Setup:

– Get a large collection of example images, each labeled with a digit

– Note: someone has to hand label all this data!

– Want to learn to predict labels of new, future digit images

• Features: ?

0

1

2

1

??Screw You, I want to use Pixels :D

Page 35: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Lets take a probabilistic approach!!!

• Can we directly estimate the data distribution P(X,Y)?

• How do we represent these? How many parameters?– Prior, P(Y):

• Suppose Y is composed of k classes

– Likelihood, P(X|Y):• Suppose X is composed of n binary

features

Page 36: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Conditional Independence• X is conditionally independent of Y given Z, if

the probability distribution for X is independent of the value of Y, given the value of Z

• e.g.,

• Equivalent to:

Page 37: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Naïve Bayes• Naïve Bayes assumption:

– Features are independent given class:

– More generally:

Page 38: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

The Naïve Bayes Classifier• Given:

– Prior P(Y)– n conditionally independent

features X given the class Y– For each Xi, we have likelihood P(Xi|

Y)

• Decision rule:

Y

X1 XnX2

Page 39: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

A Digit Recognizer

• Input: pixel grids

• Output: a digit 0-9

Page 40: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Naïve Bayes for Digits (Binary Inputs)

• Simple version:– One feature Fij for each grid position <i,j>

– Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image

– Each input maps to a feature vector, e.g.

– Here: lots of features, each is binary valued

• Naïve Bayes model:

• Are the features independent given class?• What do we need to learn?

Page 41: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Example Distributions

1 0.1

2 0.1

3 0.1

4 0.1

5 0.1

6 0.1

7 0.1

8 0.1

9 0.1

0 0.1

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

Page 42: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

MLE for the parameters of NB• Given dataset

– Count(A=a,B=b) number of examples where A=a and B=b

• MLE for discrete NB, simply:– Prior:

– Likelihood:

Page 43: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Violating the NB assumption

• Usually, features are not conditionally independent:

– NB often performs well, even when assumption is violated– [Domingos & Pazzani ’96] discuss some conditions for good

performance

Page 44: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Smoothing

2 wins!!

Does this happen in vision?

Page 45: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

NB & Bag of words model

Page 46: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What about real Features?What if we have continuous Xi ?

Eg., character recognition: Xi is ith pixel

Gaussian Naïve Bayes (GNB):

Sometimes assume variance• is independent of Y (i.e., i), • or independent of Xi (i.e., k)• or both (i.e., )

Page 47: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

Estimating Parameters

Maximum likelihood estimates:• Mean:

• Variance:

jth training example

(x)=1 if x true, else 0

Page 48: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

What you need to know about Naïve Bayes

• Naïve Bayes classifier– What’s the assumption– Why we use it– How do we learn it– Why is Bayesian estimation important– Bag of words model

• Gaussian NB– Features are still conditionally independent– Each feature has a Gaussian distribution given class

• Optimal decision using Bayes Classifier

Page 49: Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar.

another probabilistic approach!!!

• Naïve Bayes: directly estimate the data distribution P(X,Y)!– challenging due to size

of distribution!– make Naïve Bayes

assumption: only need P(Xi|Y)!

• But wait, we classify according to:– maxY P(Y|X)

• Why not learn P(Y|X) directly?