Top Banner
Boosting Aarti Singh Machine Learning 10-701/15-781 Oct 11, 2010 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Can we make dumb learners smart?
35

Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Apr 30, 2018

Download

Documents

lamdang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Boosting

Aarti Singh

Machine Learning 10-701/15-781

Oct 11, 2010

Slides Courtesy: Carlos Guestrin, Freund & Schapire

1

Can we make dumb learners smart?

Page 2: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Project Proposal DueToday!

2

Page 3: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Why boost weak learners?

Goal: Automatically categorize type of call requested

(Collect, Calling card, Person-to-person, etc.)

• Easy to find “rules of thumb” that are “often” correct.

E.g. If ‘card’ occurs in utterance, then predict ‘calling card’

• Hard to find single highly accurate prediction rule.3

Page 4: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

• Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees)

Are good - Low variance, don’t usually overfit

Are bad - High bias, can’t solve hard learning problems

• Can we make weak learners always good???

– No!!! But often yes…

Fighting the bias-variance tradeoff

4

Page 5: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Voting (Ensemble Methods)

• Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space

• Output class: (Weighted) vote of each classifier– Classifiers that are most “sure” will vote with more conviction

– Classifiers will be most “sure” about a particular part of the space

– On average, do better than single classifier!

5

1 -1

? ?

? ?

1 -1

H: X → Y (-1,1)h1(X) h2(X)

H(X) = sign(∑αt ht(X))t

weights

H(X) = h1(X)+h2(X)

Page 6: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Voting (Ensemble Methods)

• Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space

• Output class: (Weighted) vote of each classifier– Classifiers that are most “sure” will vote with more conviction

– Classifiers will be most “sure” about a particular part of the space

– On average, do better than single classifier!

• But how do you ???

– force classifiers ht to learn about different parts of the input space?

– weigh the votes of different classifiers? t

6

Page 7: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Boosting [Schapire’89]

• Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote

• On each iteration t:

– weight each training example by how incorrectly it was classified

– Learn a weak hypothesis – ht

– A strength for this hypothesis – t

• Final classifier:

• Practically useful

• Theoretically interesting7

H(X) = sign(∑αt ht(X))

Page 8: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Learning from weighted data

• Consider a weighted dataset

– D(i) – weight of i th training example (xi,yi)

– Interpretations:• i th training example counts as D(i) examples

• If I were to “resample” data, I would get more samples of “heavier” data points

• Now, in all calculations, whenever used, i th training example counts as D(i) “examples”

– e.g., in MLE redefine Count(Y=y) to be weighted count

Unweighted data Weights D(i)

Count(Y=y) = ∑ 1(Y i=y) Count(Y=y) = ∑ D(i)1(Y i=y)8

i =1

m

i =1

m

Page 9: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

9

weak

weak

Initially equal weights

Naïve bayes, decision stump

Magic (+ve)

Increase weight if wrong on pt i

yi ht(xi) = -1 < 0

AdaBoost [Freund & Schapire’95]

Page 10: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

10

weak

weak

Initially equal weights

Naïve bayes, decision stump

Magic (+ve)

Increase weight if wrong on pt i

yi ht(xi) = -1 < 0

AdaBoost [Freund & Schapire’95]

Weights for all pts must sum to 1∑ Dt+1(i) = 1t

Page 11: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

11

weak

weak

Initially equal weights

Naïve bayes, decision stump

Magic (+ve)

Increase weight if wrong on pt i

yi ht(xi) = -1 < 0

AdaBoost [Freund & Schapire’95]

Page 12: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

12

εt = 0 if ht perfectly classifies all weighted data pts t = ∞

εt = 1 if ht perfectly wrong => -ht perfectly right t = -∞

εt = 0.5 t = 0

Does ht get ith point wrong

Weighted training error

What t to choose for hypothesis ht?

Weight Update Rule:

[Freund & Schapire’95]

Page 13: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Boosting Example (Decision Stumps)

13

Page 14: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

14

Boosting Example (Decision Stumps)

Page 15: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Analysis reveals:

• What t to choose for hypothesis ht?

εt - weighted training error

• If each weak learner ht is slightly better than random guessing (εt < 0.5),

then training error of AdaBoost decays exponentially fast in number of rounds T.

15

Analyzing training error

Training Error

Page 16: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Training error of final classifier is bounded by:

Where

16

Analyzing training error

Convex

upper

bound

If boosting can make

upper bound → 0, then

training error → 0

1

0

0/1 loss

exp loss

Page 17: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Training error of final classifier is bounded by:

Where

Proof:

17

Analyzing training error

Wts of all pts add to 1

Using Weight Update Rule

Page 18: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Training error of final classifier is bounded by:

Where

18

Analyzing training error

If Zt < 1, training error decreases exponentially (even though weak learners may

not be good εt ~0.5)

Training

error

t

Upper bound

Page 19: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Training error of final classifier is bounded by:

Where

If we minimize t Zt, we minimize our training error

We can tighten this bound greedily, by choosing t and ht on each iteration

to minimize Zt.

19

What t to choose for hypothesis ht?

Page 20: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

We can minimize this bound by choosing t on each iteration to minimize Zt.

For boolean target function, this is accomplished by [Freund & Schapire ’97]:

Proof:

20

What t to choose for hypothesis ht?

Page 21: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

We can minimize this bound by choosing t on each iteration to minimize Zt.

For boolean target function, this is accomplished by [Freund & Schapire ’97]:

Proof:

21

What t to choose for hypothesis ht?

Page 22: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Training error of final classifier is bounded by:

22

Dumb classifiers made Smart

If each classifier is (at least slightly) better than random t < 0.5

AdaBoost will achieve zero training error exponentially fast (innumber of rounds T) !!

grows as t moves

away from 1/2

What about test error?

Page 23: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Boosting results – Digit recognition

• Boosting often, – Robust to overfitting

– Test set error decreases even after training error is zero

23

[Schapire, 1989]

but not always

Test Error

Training Error

Page 24: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

• T – number of boosting rounds

• d – VC dimension of weak learner, measures complexity of classifier

• m – number of training examples

24

Generalization Error Bounds

T smalllarge small

T largesmall large

tradeoff

bias variance

[Freund & Schapire’95]

Page 25: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

25

Generalization Error Bounds

Boosting can overfit if T is large

Boosting often, Contradicts experimental results– Robust to overfitting– Test set error decreases even after training error is zero

Need better analysis tools – margin based bounds

[Freund & Schapire’95]

With high

probability

Page 26: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

26

Margin Based Bounds

Boosting increases the margin very aggressively since it concentrates on the hardest examples.

If margin is large, more weak learners agree and hence more rounds does not necessarily imply that final classifier is getting more complex.

Bound is independent of number of rounds T!

Boosting can still overfit if margin is too small, weak learners are too complex or perform arbitrarily close to random guessing

[Schapire, Freund, Bartlett, Lee’98]

With high

probability

Page 27: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Boosting: Experimental Results

Comparison of C4.5 (decision trees) vs Boosting decision stumps (depth 1 trees)

C4.5 vs Boosting C4.5

27 benchmark datasets

27

[Freund & Schapire, 1996]

errorerror

err

or

Page 28: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

28

Train Test TestTrain

Overfits

Overfits

Overfits

Overfits

Page 29: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Boosting and Logistic Regression

Logistic regression assumes:

And tries to maximize data likelihood:

Equivalent to minimizing log loss

29

iid

Page 30: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Logistic regression equivalent to minimizing log loss

30

Both smooth approximations of 0/1 loss!

Boosting and Logistic Regression

Boosting minimizes similar loss function!!

Weighted average of weak learners

1

0

0/1 loss

exp loss

log loss

Page 31: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Logistic regression:

• Minimize log loss

• Define

where xj predefined features

(linear classifier)

• Jointly optimize over all weights w0, w1, w2…

Boosting:

• Minimize exp loss

• Define

where ht(x) defined dynamically

to fit data(not a linear classifier)

• Weights t learned per iteration t incrementally

31

Boosting and Logistic Regression

Page 32: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

32

Hard & Soft Decision

Weighted average of weak learners

Hard Decision/Predicted label:

Soft Decision:(based on analogy withlogistic regression)

Page 33: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

33

Effect of Outliers

Good : Can identify outliers since focuses on examples that are

hard to categorize

Bad : Too many outliers can degrade classification performance

dramatically increase time to convergence

Page 34: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Bagging

34

Related approach to combining classifiers:

1. Run independent weak learners on bootstrap replicates (sample with replacement) of the training set

2. Average/vote over weak hypotheses

Bagging vs. Boosting

Resamples data points Reweights data points (modifies their distribution)

Weight of each classifier Weight is dependent on is the same classifier’s accuracy

Only variance reduction Both bias and variance reduced –learning rule becomes more complexwith iterations

[Breiman, 1996]

Page 35: Boosting - Carnegie Mellon School of Computer Scienceaarti/Class/10701/slides/Lecture10.pdf · Boosting can still overfit if margin is too small, weak learners are too ... • AdaBoost

Boosting Summary

• Combine weak classifiers to obtain very strong classifier– Weak classifier – slightly better than random on training data

– Resulting very strong classifier – can eventually provide zero training error

• AdaBoost algorithm

• Boosting v. Logistic Regression – Similar loss functions

– Single optimization (LR) v. Incrementally improving classification (B)

• Most popular application of Boosting:– Boosted decision stumps!

– Very simple to implement, very effective classifier

35