CS6716 Pattern Recognitioncc.gatech.edu/~afb/classes/CS7616-Spring2014/slides/CS... · 2014-03-13 · CS6716 Pattern Recognition . ... • Slides posted…sorry that took so long…

Boosting and Ensembles CS7616 Pattern Recognition – A. Bobick

Aaron Bobick School of Interactive Computing

CS6716 Pattern Recognition Ensembles and Boosting (1)


Administrivia • Chapter 10 of the Hastie book. • Slides brought to you by Aarti Singh, Peter Orbanz, and

friends. • Slides posted…sorry that took so long… • Final project discussion…


ENSEMBLES • A randomly chosen hyperplane classifier has an expected error

of 0.5 (i.e. 50%).


ENSEMBLES • A randomly chosen hyperplane classifier has an expected error

of 0.5 (i.e. 50%).


Ensembles • Many random hyperplanes combined by majority vote: Still

0.5 • A single classifier slightly better than random: 0.5 + ε. • What if we use m such classifiers and take a majority vote?

146 / 531


Voting • Decision by majority vote

• m individuals (or classifiers) take a vote. m is an odd number. • They decide between two choices; one is correct, one is wrong. • After everyone has voted, a decision is made by simple majority.

• Note: For two-class classifiers 𝑓𝑓1, … , 𝑓𝑓𝑚𝑚 (with output ±1):

majority vote = 𝑠𝑠𝑠𝑠𝑠𝑠 �𝑓𝑓𝑗𝑗

𝑚𝑚

𝑗𝑗=1


Voting – likelihoods • We make some simplifying assumptions: • Each individual makes the right choice with probability

𝑝𝑝 ∈ 0, 1

• The votes are independent, i.e. stochastically independent when regarded as random outcomes.

• Given n voters, the probability the majority makes the right choice:

• This formula is known as Condorcet’s jury theorem

147 / 531

12

!Pr(majority correct)!(

(1 ))!

m

j

j jm

mpm p

j m j−

+=

−∑−


Power of weak classifiers

12

!Pr(majority correct)!(

(1 ))!

m

j

j jm

mpm p

j m j−

+=

−∑−


ENSEMBLE METHODS • An ensemble method makes a prediction by combining the

predictions of many classifiers into a single vote. • The individual classifiers are usually required to perform only

slightly better than random. For two classes, this means slightly more than 50% of the data are classified correctly. Such a classifier is called a weak learner.

149 / 531


ENSEMBLE METHODS • From before: if the “weak learners” are random and

independent, the prediction accuracy of the majority vote will increase with the number of weak learners.

• But, since the weak learners are all typically trained on the same training data, producing random, independent weak learners is difficult. • (See later for random forests)

• Different ensemble methods (e.g. Boosting, Bagging, etc) use

different strategies to train and combine weak learners that behave relatively independently.

149 / 531


Making ensembles work • Boosting (today)

• After training each weak learner, data is modified using weights. • Deterministic algorithm.

• Bagging (bootstrap aggregation from earlier)

• Each weak learner is trained on a random subset of the data.

• Random forests (later)

• Bagging with tree classifiers as weak learners. • Uses an additional step to remove dimensions that carry little

information.

150 / 531


Why boost weak learners?

Goal: Automatically categorize type of call requested (Collect, Calling card, Person--to--person, etc.)

• Easy to find “rules of thumb” that are “often” correct. E.g. If ‘card’ occurs in utterance, then predict ‘calling card’

• Hard to find single highly accurate prediction rule.


• Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees)

Are good -- Low variance, don’t usually overfit Are bad -- High bias, can’t solve hard learning problems

• Can we make weak learners always good??? – No!!! But often yes…

Fighting the bias-‐variance tradeoff


• Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the input space

• Output class: (Weighted) vote of each classifier

– Classifiers that are most “sure” will vote with more conviction – Classifiers will be most “sure” about a particular part of the space – On average, do better than single classifier!

H: X → Y (-1,1) h1(X) h2(X)

H(X) = h1(X)+h2(X)

H(X) = sign(∑αt ht(X)) t

1 --1

? ?

? ?

1 --1 weights

Voting (Ensemble Methods)


Voting (Ensemble Methods)

15

• Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the input space

• Output class: (Weighted) vote of each classifier

– Classifiers that are most “sure” will vote with more conviction – Classifiers will be most “sure” about a par' cular part of the space – On average, do better than single classifier!

• But how do you ???

– force classifiers ℎ𝑡𝑡 to learn about different parts of the input space?

– weight the votes of different classifiers? 𝛼𝛼𝑡𝑡


• Idea: given a weak learner, run it multiple times (reweighted) training data, then let learned classifiers vote

• On each iteration 𝑡𝑡 :

– weight each training example by "how incorrectly" it was classified

– Learn a weak hypothesis – ℎ𝑡𝑡 – A strength for this hypothesis – 𝛼𝛼𝑡𝑡

• Final classifier: 𝐻𝐻 𝑥𝑥 = 𝑠𝑠𝑠𝑠𝑠𝑠(Σ𝛼𝛼𝑡𝑡 ℎ𝑡𝑡 𝑥𝑥 )

• Practically useful AND Theoretically interesting

Boosting (Shapire, 1989)


Learning from weighted data

17

• Consider a weighted dataset – D(i) – weight of i th training example 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 – Interpretations:

• i th training example counts as D(i) examples • If you were to “resample” data, you would get more samples of

“heavier” data points

• Now, in all calculations, whenever used, i th training example counts as D(i) “examples” – e.g., in MLE redefine Count(Y=y) to be weighted count

Unweighted data m

Count(Y=y) = ∑ 1(Y i=y) i =1

Weights D(i) Count(Y=y) = ∑ D(i)1(Y i=y)

i =1

m


Boosting – weak learners


AdaBoost.M1 (1)


AdaBoost.M1 (2)


Example sequence

/76

Boosting Iteratively reweighting training samples. Higher weights to previously misclassified samples.

22

1 round 2 rounds 3 rounds 4 rounds 5 rounds 50 rounds ICCV09 Tutorial Tae-Kyun Kim University of Cambridge


Not mysterious????


Minimizing a loss function


Forward stage additive model


Exponential loss (1)


Exponential loss (2) • At each node/level:



Training error of final classifier is bounded by:

Convex upper bound

Where

exp loss If boosting can make upper bound → 0, then training error → 0

0/1 loss

1

Analyzing training error

29 0


Why zero training error isn’t the end?!?!


• Boosting often but not always, – Robust to overfitting – Test set error decreases even after training error is zero

[Schapire, 1989]

Test Error Training Error

Boosting results – Digit recognition

31


Logistic regression equivalent to minimizing log loss

Both smooth approximations of 0/1 loss!

Boosting and Logistic Regression

Boosting minimizes similar loss function!!

1 0/1 loss

exp loss

Weighted average of weak learners log loss


Logistic regression: Boosting: • Minimize log loss • Minimize exp loss

• Define

where xj predefined features (linear classifier)

• Jointly optimize over all weights w0, w1, w2…

• Define

where ht(x) defined dynamically to fit data (not a linear classifier)

• Weights 𝛼𝛼𝑡𝑡learned per iteration t incrementally

Boosting and Logistic Regression

33


Good : Can identify outliers since focuses on examples that are hard to categorize

Bad : : Too many outliers can degrade classification

performance dramatically increase time to convergence

Effect of Outliers

34


Bagging [Breiman, 1996] Related approach to combining classifiers: 1. Run independent weak learners on bootstrap replicates

(sample with replacement) of the training set

2. Average/vote over weak hypotheses

Bagging Resamples data points

vs. Boosting Reweights data points (modifies their distribution)

Weight is dependent on classifier’s accuracy

Both bias and variance reduced – learning rule becomes more complex with each iteration

Weight of each classifier is the same

Only variance reduction


Example

155 / 531

AdaBoost test error (simulated data)

.., Weak learners used are decision stumps.

.., Combining many trees of depth 1 yields much better results than a single large tree.


SPAM DATA

Tree classifier: 9.3% overall error rate Boosting with decision stumps: 4.5% Figure shows feature selection results of

Boosting.


Best known boosting application… • Face Detection – Viola/Jones • But it’s easy to forget that two things make this work:

• Boosting – they used real AdaBoost • Cascade architecture

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

http://research.microsoft.com/en-us/um/people/viola/pubs/detect/violajones_cvpr2001.pdf

http://research.microsoft.com/en-us/um/people/viola/pubs/detect/violajones_cvpr2001.pdf

http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf


Face detection

161 / 531

Searching for faces in images Two problems:

• Face detection - Find locations of all faces in image. Two classes. • Face recognition Identify a person depicted in an image by recognizing the

face. One class per person to be identified + background class (all other people).

"Face detection can be regarded as a solved problem." "Face recognition is not solved."

Face detection as a classification problem • Divide image into patches

• Classify each patch as "face" or "not face"


• Basic idea: slide a window across image and evaluate a face model at every location

Face detection


Viola Jones Technique Overview • Three major contributions/phases of the algorithm :

• Feature extraction • Learning using cascaded boosting and decision stumps • Multi-scale detection algorithm

• Feature extraction and feature evaluation. • Rectangular features are used, with a new image representation their

calculation is very fast.

• (First) classifier was actual AdaBoost • Maybe first demonstration to computer vision that a combination of

simple classifiers is very effective


Feature Extraction • Features are extracted from sub windows of a sample image.

• The base size for a sub window is 24 by 24 pixels. • Basic features are difference of sums of rectangles (white minus black

below). • Each of the four feature types are scaled and shifted across all possible

combinations


Example Source

Result


• “Integral” image is new image S(x,y) from I(x,y) such that such that

𝑆𝑆 𝑥𝑥,𝑦𝑦 = ��𝐼𝐼(𝑖𝑖, 𝑗𝑗)𝑦𝑦

𝑗𝑗=1

𝑥𝑥

𝑖𝑖=1

Key to feature computation: Integral images


Fast Computation of Pixel Sums

MATLAB: ii = cumsum(cumsum(double(i)), 2);


Feature selection • For a 24x24 detection region, the number of possible

rectangle features is ~160,000!


Feature selection • For a 24x24 detection region, the number of possible

rectangle features is ~160,000! • At test time, it is impractical to evaluate the entire feature set • Can we create a good classifier using just a small subset of all

possible features? • How to select such a subset?

• No surprise: Boosting!


Paul’s slide: Boosting • Boosting is a classification scheme that works by combining

weak learners into a more accurate ensemble classifier • A weak learner need only do better than chance

• Training consists of multiple boosting rounds • During each boosting round, we select a weak learner that does well on

examples that were hard for the previous weak learners • “Hardness” is captured by weights attached to training examples

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

http://www.cs.princeton.edu/%7Eschapire/uncompress-papers.cgi/FreundSc99.ps


Paul’s Slide: Boosting vs. SVM • Advantages of boosting

• Integrates classification with feature selection • Complexity of training is linear instead of quadratic in the number of

training examples • Flexibility in the choice of weak learners, boosting scheme • Testing is fast • Easy to implement

• Disadvantages • Needs many training examples • Often doesn’t work as well as SVM (especially for many-class problems)


Boosting for face detection • Define weak learners based on rectangle features • For each round of boosting:

• Evaluate each rectangle filter on each example • Select best threshold for each filter • Select best filter/threshold combination • Reweight examples

• Computational complexity of learning: O(MNK) • M rounds, N examples, K features


Boosting for face detection • First two features selected by boosting:

This feature combination can yield 100% detection rate and 50% false positive rate


Boosting for face detection • A 200-feature classifier can yield 95% detection rate and a

false positive rate of 1 in 14084

Not good enough!

Receiver operating characteristic (ROC) curve


Challenges of face detection • Sliding window detector must evaluate tens of thousands of

location/scale combinations • Faces are rare: 0–10 per image

• For computational efficiency, we should try to spend as little time as possible on the non-face windows

• A megapixel image has ≈ 106 pixels and a comparable number of candidate face locations

• To avoid having a false positive in every image, our false positive rate has to be less than 10−6.

• An unbalanced system with the positive class very small. • Standard training algorithm can achieve good error rate

by classifying all data as negative. • The error rate will be precisely the proportion of points in positive class.


“Attentional” cascade • We start with simple classifiers which reject many of the

negative sub-windows while detecting almost all positive sub-windows

• Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on

• A negative outcome at any point leads to the immediate rejection of the sub-window

FACE IMAGE SUB-WINDOW

Classifier 1 T

Classifier 3 T

F

NON-FACE

T Classifier 2

T

F

NON-FACE

F

NON-FACE


FPR(fj) =

We have to consider two rates false positive rate

# negative points classified as "+1" # negative training points at stage j

detection rate DR(fj) = #correctly classified positive points

# positive training points at stage j

Why does a cascade work?

We want to achieve a low value of FPR(f ) and a high value of DR(f ).

Class imbalance: • Number of faces classified as background is (size of face class) × (1 − DR(f ))

• We would like to see a decently high detection rate, say 90%

• Number of background patches classified as faces is (size of background class) × (FPR(f ))

• Since background class is huge, FPR(f ) has to be very small to yield roughly the same amount of errors in both classes. How small?


Why does a cascade work? • Cascade detection rate • The rates of the overall cascade classifier f are

• Suppose we use a 10-stage cascade (k = 10) and that each DR(𝑓𝑓𝑗𝑗) is 99% and we permit FPR(𝑓𝑓𝑗𝑗) of 30%.

• We obtain DR 𝑓𝑓 =.9910 ≈ 0.90 and a FPR 𝑓𝑓 = 0.310 ≈ 6 × 10−6

• Sine k is powerful on false positives, we can set each function to have a very high DR and live with a fair FP.


Training the cascade • Set target detection and false positive rates for each stage • Keep adding features to the current stage until its target rates

have been met • Need to lower AdaBoost threshold to maximize detection (as opposed to

minimizing total classification error) • Test on a validation set

• If the overall false positive rate is not low enough, then add another stage

• Use false positives from current stage as the negative training examples for the next stage


Training the cascade • Training procedure

1. User selects acceptable rates (FPR and DR) for each level of cascade. 2. At each level of cascade:

• Train boosting classifier with final AdaBoost threshold lowered to maximize detection (as opposed to minimizing total classification error)

• Gradually increase number of selected features until overall rates achieved. • Test on a validation set

3. If the overall false positive rate is not low enough, then add another stage

• Use of training data. Each training step uses:

• All positive examples (= faces). • Negative examples (= non-faces) that are misclassified at previous

cascade layer. (plus more?)


Classifier cascades • Training a cascade: Use imbalanced

loss (very low false negative rate for each 𝑓𝑓𝑖𝑖).

1. Train classifier 𝑓𝑓1 on entire training data set.

2. Remove all xi in negative class which 𝑓𝑓1classifies correctly from training set ( Get some more negatives )

3. On smaller training set, train 𝑓𝑓2 4. Continue …. 5. On remaining data at final stage, train 𝑓𝑓𝑘𝑘 .

• Rapid classifying with a cascade • If any 𝑓𝑓𝑗𝑗classifies x as negative,

f (x) = −1. • Only if all𝑓𝑓𝑗𝑗 classify x as positive,

f (x) = +1.


The implemented system • Training Data

• 5000 faces • All frontal, rescaled to

24x24 pixels • 300 million non-faces

• 9500 non-face images • Faces are normalized

• Scale, translation

• Many variations • Across individuals • Illumination • Pose


System performance • Training time: “weeks” on 466 MHz Sun workstation • 38 layers, total of 6061 features • Average of 10 features evaluated per window on test set • “On a 700 Mhz Pentium III processor, the face detector can

process a 384 by 288 pixel image in about .067 seconds” • 15 Hz • 15 times faster than previous detector of comparable accuracy (Rowley

et al., 1998)


Output of Face Detector on Test Images


Other detection tasks

Facial Feature Localization

Male vs. female

Profile Detection


Profile Detection


Profile Features


Beyond AdaBoost…


Why AdaBoost Works: (1) Minimizing a loss function


(2) Forward stage additive model


(3) Exponential Loss


Training error of final classifier is bounded by:

Convex upper bound

Where

exp loss If boosting can make upper bound → 0, then training error → 0

0/1 loss

1

(3) Exponential loss an upper bound on classification error

70 0


Why Boosting Works? (Cont’d)

Source http://www.stat.ucl.ac.be/


Loss Function • In Hastie: “easy to show”:

• So AdaBoost is estimating one-half the “log odds” that Pr 𝑌𝑌 = 1 𝑥𝑥 . So doing classification if greater than zero makes sense. Above also implies:


A better loss function: • A different loss function can be derived from a different

assumed form of probability (logit function in f):

• Minimized by the same f(x) but not the same function.

Binomial deviance


More on Loss Functions and Classification

• 𝑦𝑦 ⋅ 𝑓𝑓 𝑥𝑥 is called the Margin • The classification rule implies that observations with

positive margin 𝑦𝑦𝑖𝑖𝑓𝑓(𝑥𝑥𝑖𝑖) > 0 were classified correctly, but the negative margin ones are incorrect

• The decision boundary is given by the f(X)=0 • The loss criterion should penalize the negative

margins more heavily than the positive ones. • But how much more…


Loss Functions for Two-Class Classification

Exponential goes very high as margin gets bad. Makes AdaBoost less robust to mislabeled data or too many outliers.


How to “fix” boosting? • AdaBoost analytically minimizes exponential loss.

• Clean equations • Good performance in good cases.

• But, exponential loss sensitive to outliers, misclassified points. • The “binomial deviance” loss function is better behaved. • We should be able to boost any weak learner – like trees. • Can we boost trees for “binomial deviance”?

• Not analytically • But we can numerically – “gradient boosting”

• And can be improved by other tricks: • Stochastic sampling of training points at each stage • Regularization or diminishing effects of each stage


Loss Function (Cont’d)

Source http://www.stat.ucl.ac.be/


Trees Reviewed (in Hastie notation) • Trees partition the feature vector (joint

predictor values) into disjoint regions 𝑅𝑅𝑗𝑗 , 𝑗𝑗 = 1, . . , 𝐽𝐽 represented by the terminal nodes

• A constant 𝛾𝛾𝑗𝑗 is assigned to each region, whether regression or classification • The predictive/classification rule 𝑥𝑥∈𝑅𝑅𝑗𝑗𝑓𝑓 𝑥𝑥 = 𝛾𝛾𝑗𝑗

• The tree is: 𝑇𝑇 𝑥𝑥;Θ = ∑ 𝛾𝛾𝑗𝑗𝐼𝐼(𝑥𝑥 ∈𝑅𝑅𝑗𝑗) 𝑗𝑗 where Θ are the parameters of the splits 𝑅𝑅𝑗𝑗 and the values 𝛾𝛾𝑗𝑗

• We want to minimize the loss:

1arg min ( ,ˆ )

i j

J

i jj x R

L y γΘ

= ∈∑Θ = ∑


Boosting Trees • Finding 𝛾𝛾𝑗𝑗 given 𝑅𝑅𝑗𝑗: this is easy • Finding 𝑅𝑅𝑗𝑗: this is difficult, we typically approximate. We

described the greedy top-down recursive partitioning algorithm

• A boosted tree, is sum of such trees,

Where at each stage m we minimize the :


Boosting Trees (cont’d) • Given the regions 𝑅𝑅𝑗𝑗,𝑚𝑚 the correct 𝛾𝛾𝑗𝑗,𝑚𝑚 is whatever minimizes

the loss function:

• For exponential loss, if we restrict our trees to be weak learners outputing {−1, +1} then this is exactly AdaBoost and train the same way.

• But if we want a different loss function, need numerical method.


• Loss function in using prediction f(x) for y is

• The goal is to minimize L(f) w.r.t f, where f is the sum of the trees. But ignore the sum of tree constraint for now. Just think about minimization.

where the parameters f are the values of the approximating function 𝑓𝑓(𝑥𝑥𝑖𝑖) at each of the N data points 𝑥𝑥𝑖𝑖: 𝑓𝑓 = 𝑓𝑓 𝑥𝑥1 , … ,𝑓𝑓 𝑥𝑥𝑁𝑁 .

• Numerical optimization successively approximates 𝒇𝒇� by steps

• Solve it as a sum of component vectors, where 𝑓𝑓0 = ℎ0 is the initial guess and each successive 𝑓𝑓𝑚𝑚 is induced based on the current parameter vector 𝑓𝑓𝑚𝑚−1.

Numerical Optimization

1( ) ( , ( ))

N

i ii

L f L y f x=

= ∑

1,

MN

M m mm

f h h=

= ∈ℜ∑


1. Choose ℎ𝑚𝑚 = −ρ𝑚𝑚𝑠𝑠𝑚𝑚, where ρ𝑚𝑚 is a scalar and 𝑠𝑠𝑚𝑚 is the gradient of L(f) evaluated at 𝑓𝑓 = 𝑓𝑓𝑚𝑚−1

2. The step length ρm is the (line search) solution to

ρ𝑚𝑚 = arg min𝜌𝜌𝐿𝐿(𝑓𝑓𝑚𝑚−1 − 𝜌𝜌𝑠𝑠𝑚𝑚)

3. The current solution is then updated:

𝑓𝑓𝑚𝑚 = 𝑓𝑓𝑚𝑚−1 − ρ𝑚𝑚𝑠𝑠𝑚𝑚

Steepest Descent


Gradient Boosting • Forward stagewise boosting is also a very greedy algorithm • The tree predictions can be thought about like negative

gradients • The only difficulty is that the tree components are not

arbitrary

• They are constrained to be the predictions of a 𝐽𝐽𝑚𝑚-terminal node decision tree, whereas the negative gradient is unconstrained steepest descent

• Unfortunately, the gradient is only defined at the training data points and is not applicable to generalizing 𝑓𝑓𝑚𝑚 𝑥𝑥 to new data


Gradient boosting (cont) • For classification:

• where

• We’re going to approximate gradient by finding a decision tree that’s as close as possible to the gradient.


Gradient boosting


Gradient boosting results


More tweaks • If we fit the “best” tree possible, it will yield a tree

that is deep, as if it were the last tree of the boosting ensemble. • Large trees permit interaction between elements. • Prevent by either penalizing based upon tree size or, believe it

or not, let J=6. • There is the question of how many stages. If go too

far can overfit (worse than Adaboost?) • Need “shrinkage” – from ML:

where 0 < 𝑣𝑣 < 1 . Text actually describes v=.01



Interpretation • Single decision trees are often very interpretable • Linear combination of trees loses this important feature • We often learn the relative importance or contribution of

each input variable in predicting the response • Define a measure of relevance for each predictor 𝑋𝑋𝑙𝑙, sum over

the J-1 internal nodes of the tree


• Relevance of a predictor 𝑋𝑋𝑙𝑙in a single tree (where v(t) is the feature selected):

where 𝚤𝚤�̂�𝑡2is measure of improvement after tree is split is applied to the node. • In a boosted tree the squared relevance:

• Since relative, make the max 100.

Interpretation (Cont’d)


Relevance


Illustration (California Housing)


• Combine weak classifiers to obtain very strong classifier – Weak classifier – slightly better than random on training data – Resulting very strong classifier – can eventually provide zero

training error and

• AdaBoost algorithm • Boosting v. Logistic Regression

– Similar loss functions – LR is single optimization v. Incrementally improving classification

in Boosting

• Boosting is very popular for applications: – Boosted decision stumps – easy to build training system – Very simple to implement and efficient

Boosting Summary

CS6716 Pattern Recognitioncc.gatech.edu/~afb/classes/CS7616-Spring2014/slides/CS... · 2014-03-13 · CS6716 Pattern Recognition . ... • Slides posted…sorry that took so long…

Documents