Top Banner
Pattern Classification & Decision Theory
46

Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Mar 26, 2015

Download

Documents

Alejandro Daly
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Pattern Classification & Decision Theory

Page 2: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How are we doing on the pass sequence?• Bayesian regression and estimation enables us to

track the man in the striped shirt based on labeled data• Can we track the man in the white shirt?

Not very well.

Feature, xHan

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

Regression fails to identify that there really are two classes of solution

Page 3: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Decision theory

• We wish to classify an input x as belonging to one of K classes, where class k is denoted

• Example: Buffalo digits, 10 classes, 16x16 images, each image x is a 256-dimensional vector, xm [0,1]

Page 4: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Decision theory

• We wish to classify an input x as belonging to one of K classes, where class k is denoted

• Partition the input space into regions , , …, so that if x , our classifier predicts class

• How should we choose the partition?

• Suppose x is presented with probability p(x) and the

distribution over the class labels given x is p( |x)

• Then, p(correct) = k x p(x) p( |x) dx

• This is maximized by assigning each x to the region

whose class maximizes p( |x)

Page 5: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Three approaches to pattern classification

1. Discriminative and non-probabilistic– Learn a discriminant function f (x),

which maps x directly to a class label

2. Discriminative and probabilistic– For each class k, learn the probability

model

– Use this probability to classify a new input x

2

1

0

3

f(x)

Page 6: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Three approaches to pattern classification

3. Generative– For each class k, learn the generative

probability model– Also, learn the class probabilities– Use Bayes’ rule to classify a new

input:

where

Page 7: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Three approaches to pattern classification

1. Discriminative and non-probabilistic– Learn a discriminant function f (x),

which maps x directly to a class label

2. Discriminative and probabilistic– For each class k, learn the probability

model

– Use this probability to classify a new input x

2

1

0

3

f(x)

Page 8: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Can we use regression to learn discriminant functions?

2

1

0

3

f(x)

Page 9: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

2

1

0

3

f(x)2

1

0

3

f(x)

• What do the classification regions look like?• Is there any sense in which square error is an

appropriate cost function for classification?• We should be careful to not interpret integer-valued

class labels as ordinal targets for regression

Can we use regression to learn discriminant functions?

Page 10: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

The one-of-K representation

• For > 2 classes, each class is represented by a binary vector t with 1 indicating the class:

• K regression problems:

• To classify x, pick class k with largest yk(x)

100...

0

. . .t =

010...

0

t =

Page 11: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Let’s focus on binary classification

• Predict target t {0,1} from input vector x

• Denote the mth input of training case n by xnm

Page 12: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Classification boundary for linear regression

• Values of x where y(x)=0.5 are ambiguous – these form the classification boundary

• For these points, = 0.5

• If x is (M+1)-dimensional, this defines an M-dimensional hyperplane separating the classes

Page 13: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How well does linear regression work?

Works well in some cases, but there are two problems:

• Due to linearity, “extreme” x’s cause extreme y(x,w)’s• Due to squared error, extreme y(x,w)’s dominate learning

Logistic regression(more later)

Linear regression

Page 14: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Clipping off extreme y’s

• To clip off extremes of , we can use a sigmoid function:

where

• Now, squared error won’t penalize extreme x’s

• y is now a non-linear function of w, so learning is harder

Page 15: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How the observed error propagates back to the parameters

E(w) = ½ n ( tn – (mwmxnm) )2

• The rate of change of E w.r.t. wm is

E(w)/wm = -n ( tn - yn ) yn (1- yn) xnm

– Useful fact: ’(a) = (a) (1 - (a)),

• Compare with linear regression:

E(w)/wm = -n ( tn - yn ) xnm

yn

Page 16: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

The effect of clipping

• Regression with sigmoid:

E(w)/wm = -n ( tn - yn ) yn (1- yn) xnm

• Linear regression:

E(w)/wm = -n ( tn - yn ) xnm

For these outliers, both (tn-yn) 0 and y(1-y) 0, so the outliers won’t hold back improvement of the boundary

Page 17: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Squared error for learning probabilities

• If t = 0 and y 1, y is moderately pulled down (grad

1)

• If t = 0 and y 0, y is weakly pulled down (grad 0)

E = ½(t-y)2

y

t = 0E dE/dy = 1

dE/dy = 0

Problems:• Being certainly wrong is often undesirable• Often, tiny differences between small probabilities count a lot

Page 18: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Three approaches to pattern classification

1. Discriminative and non-probabilistic– Learn a discriminant function f (x),

which maps x directly to a class label

2. Discriminative and probabilistic– For each class k, learn the probability

model

– Use this probability to classify a new input x

2

1

0

3

f(x)

Page 19: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Logistic regression: Binary likelihood

• As before, we use:

where

• Now, use binary likelihood: p(t|x) = y(x)t(1-y(x))1-t

• Data log-likelihood:

L = n tn ln(mwmxnm) + (1-tn) ln(1-(mwmxnm))

• Unlike linear regression, L is nonlinear in the

w’s, so gradient-based optimizers are needed

Page 20: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Binary likelihood for learning probabilities

• If t = 0 and y 1, y is strongly pulled down (grad ∞)

• If t = 0 and y 0, y is moderately pulled down (grad

1)

E = -ln(1-y)

y

t = 0E

dE/dy = 1

dE/dy = 0

E = ½(t-y)2

t = 0dE/dy = 1

dE/dy ∞

Page 21: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How the observed error propagates back to the parameters

L = n tn ln(mwmxnm) + (1-tn) ln(1-(mwmxnm))

• The rate of change of L w.r.t. wm is

L/wm = n ( tn - yn ) xnm

• Compare with sigmoid plus squared error:

E(w)/wm = -n ( tn - yn ) yn (1- yn) xnm

• Compare with linear regression:

E(w)/wm = -n ( tn - yn ) xnm

yn

Page 22: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How well does logistic regression work?

Logistic regression

Linear regression

Page 23: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Multiclass logistic regression

• Create one set of weights per class and define

• The K-class generalization of the sigmoid function is

p(t|x) = k exp(tk yk(x)) / k exp(yk(x))

which is equivalent to

p( |x) = exp(yk(x)) / j exp(yj(x))

• Learning: Similar to logistic regression (see textbook)

Page 24: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Three approaches to pattern classification

3. Generative– For each class k, learn the generative

probability model– Also, learn the class probabilities– Use Bayes’ rule to classify a new

input:

where

Page 25: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Gaussian generative models

• We can assume each element of x is independent and Gaussian, given the class:

p(x| ) = m p(xm| ) = m N (xm | km,km2)

• Contour plot of density:

k1

k2

k1

k2

Isotropic Gaussian:k1

2 = k22

Page 26: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Learning a Buffalo digit classifier(5000 training cases)

• The generative ML estimates of km and km2 are just

the data means and variances:

Means:

Variances (black=low variance, white=high variance):

• The classes are equally frequent, so = 1/10

• To classify a new input x, compute (in the log-domain!)

Page 27: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

A problem with the ML estimate

• Some pixels were constant across all training images within a class, so ML

2 = 0

• This causes numerical problems when evaluating Gaussian densities, but is also an overfitting problem

• Common hack: Add min2 to all variances

• More principled approaches– Regularize 2

– Place a prior on 2 and use MAP

– Place a prior on 2 and use Bayesian learning

Page 28: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Classifying test data(5000 test cases)

• Adding min2 = 0.01 to the variances, we obtain:

– Error rate on training set = 16.00% (std dev .5%)

– Error rate on test set = 16.72% (std dev .5%)

• Effect of value of min2 on error rates:

log10 min2

Test error rate

Training error rate

Page 29: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Full-covariance Gaussian models

• Let x = y, where y is isotropic Gaussian and is an

M x M rotation and scale matrix

• This generates a full-covariance Gaussian:

• Defining = (-1T-1)-1, we obtain

where is the covariance matrix: jk = COV(xj, xk)Determinant

Page 30: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Generative models easily induce non-linear decision boundaries

• The following three-class problem shows how three axis-aligned Gaussians can induce nonlinear decision boundaries

Page 31: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Generative models easily induce non-linear decision boundaries

• Two Gaussians can be used to account for “inliers” and “outliers”

Page 32: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How are we doing on the pass sequence?• Bayesian regression and estimation enables us to

track the man in the striped shirt based on labeled data• Can we track the man in the white shirt?

Not very well.

Feature, xHan

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

Regression fails to identify that there really are two classes of solution

Page 33: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Po

siti

on

of

man

in

wh

ite

shir

t

Using classification to improve trackingThe position of the man in the striped shirt can be used to

classify the tracking mode of the man in the white shirt

Feature

Po

siti

on

of

man

in

str

iped

sh

irt

Feature

Page 34: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Po

siti

on

of

man

in

wh

ite

shir

t, tw

Using classification to improve tracking• xs and ts = feature and position of man in striped shirt

• xw and tw = feature and position of man in white shirt

• For man in white shirt, hand-label regions and and learn two trackers

Feature, xs

Po

siti

on

of

man

in

str

iped

sh

irt,

ts

Feature, xw

p( |ts)

p(tw|xw, )p(ts|xs) p( |ts)

p(tw|xw, )

Page 35: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Using classification to improve tracking

• The classifier can be obtained using the generative approach, where each class-conditional likelihood is a Gaussian

• Note: Linear classifiers won’t work

Feature, xs

Po

siti

on

of

man

in

str

iped

sh

irt,

ts

p( |ts)

p(ts|xs) p( |ts)

p(ts| )

p(ts| )

Page 36: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Questions?

Page 37: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How are we doing on the pass sequence?

• We can now track both men, provided with– Hand-labeled coordinates of both men in 30 frames– Hand-extracted features (stripe detector, white blob

detector)– Hand-labeled classes for the white-shirt tracker

• We have a framework for how to optimally make decisions and track the men

Page 38: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

How are we doing on the pass sequence?

• We can now track both men, provided with– Hand-labeled coordinates of both men in 30 frames– Hand-extracted features (stripe detector, white blob

detector)– Hand-labeled classes for the white-shirt tracker

• We have a framework for how to optimally make decisions and track the men

This takes too much time to do by hand!

Page 39: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Page 40: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Lecture 4 Appendix

Page 41: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Binary classification regions for linear regression

• is defined by , and vice versa for

• Values of x satisfying are on the decision boundary, which is a D-1 dimensional hyperplane– w specifies the orientation

of the decision hyperplane

– -w0/||w|| specifies the

distance from the hyperplane

to the origin

– The distance from input xto the hyperplane is

y(x)/||w||

Page 42: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

K-ary classification regions for linear regression

• x if

• Each resulting classification region is contiguous and has linear boundaries:

Page 43: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Fisher’s linear discriminant and least squares• Fisher: Viewing y = wTx as a projection, pick w to maximize

the distance between the means of the data sets, while also minimizing the variances of the data sets

• This result is also obtained using linear regression, by setting t = N/N1 for class 1 and t = - N/N2 for class 2, where Nk = #

training cases in class k

Page 44: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

In what sense is logistic regression linear?

• The log-odds can be written thus:

• Each input contributes linearly to the log-odds

1-

Page 45: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Gaussian likelihoods and logistic regression• For two classes, if their covariance matrices are equal,

1=2=, we can write the log-odds as

• So…

=

2 2 1 1=

=

where

Logistic regression classifiers

Classifiers using equal-covariance Gaussian generative models

Page 46: Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

“Linear models” or classifiers with “linear boundaries”

Don’t be fooled• Such classifiers can be very hard to learn• Such classifiers may have boundaries that are highly

nonlinear in x (eg, via basis functions)

• All this means is that in some space the boundaries are linear