Top Banner
Ensemble Methods: Bagging and Boosting Piyush Rai Machine Learning (CS771A) Oct 26, 2016 Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 1
77

Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Apr 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Ensemble Methods: Bagging and Boosting

Piyush Rai

Machine Learning (CS771A)

Oct 26, 2016

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 1

Page 2: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Some Simple Ensembles

Voting or Averaging of predictions of multiple pre-trained models

“Stacking”: Use predictions of multiple models as “features” to train a new model and use the newmodel to make predictions on test data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 2

Page 3: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Some Simple Ensembles

Voting or Averaging of predictions of multiple pre-trained models

“Stacking”: Use predictions of multiple models as “features” to train a new model and use the newmodel to make predictions on test data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 2

Page 4: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Some Simple Ensembles

Voting or Averaging of predictions of multiple pre-trained models

“Stacking”: Use predictions of multiple models as “features” to train a new model and use the newmodel to make predictions on test data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 2

Page 5: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Ensembles: Another Approach

Instead of training different models on same data, train same model multiple times on differentdata sets, and “combine” these “different” models

We can use some simple/weak model as the base model

How do we get multiple training data sets (in practice, we only have one data set at training time)?

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 3

Page 6: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Ensembles: Another Approach

Instead of training different models on same data, train same model multiple times on differentdata sets, and “combine” these “different” models

We can use some simple/weak model as the base model

How do we get multiple training data sets (in practice, we only have one data set at training time)?

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 3

Page 7: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Ensembles: Another Approach

Instead of training different models on same data, train same model multiple times on differentdata sets, and “combine” these “different” models

We can use some simple/weak model as the base model

How do we get multiple training data sets (in practice, we only have one data set at training time)?

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 3

Page 8: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Ensembles: Another Approach

Instead of training different models on same data, train same model multiple times on differentdata sets, and “combine” these “different” models

We can use some simple/weak model as the base model

How do we get multiple training data sets (in practice, we only have one data set at training time)?

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 3

Page 9: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 10: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 11: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 12: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 13: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 14: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 15: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 16: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 17: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging

Bagging stands for Bootstrap Aggregation

Takes original data set D with N training examples

Creates M copies {D̃m}Mm=1

Each D̃m is generated from D by sampling with replacement

Each data set D̃m has the same number of examples as in data set D

These data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)

Train models h1, . . . , hM using D̃1, . . . , D̃M , respectively

Use an averaged model h = 1M

∑Mm=1 hm as the final model

Useful for models with high variance and noisy data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 4

Page 18: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging: illustration

Top: Original data, Middle: 3 models (from some model class) learned using three data sets chosen viabootstrapping, Bottom: averaged model

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 5

Page 19: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Random Forests

An ensemble of decision tree (DT) classifiers

Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses√D randomly chosen features

Randomly chosen features make the different trees uncorrelated

All DTs usually have the same depth

Each DT will split the training data differently at the leaves

Prediction for a test example votes on/averages predictions from all the DTs

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 6

Page 20: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Random Forests

An ensemble of decision tree (DT) classifiers

Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses√D randomly chosen features

Randomly chosen features make the different trees uncorrelated

All DTs usually have the same depth

Each DT will split the training data differently at the leaves

Prediction for a test example votes on/averages predictions from all the DTs

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 6

Page 21: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Random Forests

An ensemble of decision tree (DT) classifiers

Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses√D randomly chosen features

Randomly chosen features make the different trees uncorrelated

All DTs usually have the same depth

Each DT will split the training data differently at the leaves

Prediction for a test example votes on/averages predictions from all the DTs

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 6

Page 22: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Random Forests

An ensemble of decision tree (DT) classifiers

Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses√D randomly chosen features

Randomly chosen features make the different trees uncorrelated

All DTs usually have the same depth

Each DT will split the training data differently at the leaves

Prediction for a test example votes on/averages predictions from all the DTs

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 6

Page 23: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Random Forests

An ensemble of decision tree (DT) classifiers

Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses√D randomly chosen features

Randomly chosen features make the different trees uncorrelated

All DTs usually have the same depth

Each DT will split the training data differently at the leaves

Prediction for a test example votes on/averages predictions from all the DTs

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 6

Page 24: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Random Forests

An ensemble of decision tree (DT) classifiers

Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses√D randomly chosen features

Randomly chosen features make the different trees uncorrelated

All DTs usually have the same depth

Each DT will split the training data differently at the leaves

Prediction for a test example votes on/averages predictions from all the DTs

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 6

Page 25: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Random Forests

An ensemble of decision tree (DT) classifiers

Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses√D randomly chosen features

Randomly chosen features make the different trees uncorrelated

All DTs usually have the same depth

Each DT will split the training data differently at the leaves

Prediction for a test example votes on/averages predictions from all the DTs

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 6

Page 26: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 27: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 28: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 29: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 30: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 31: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 32: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 33: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 34: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 35: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting

The basic idea

Take a weak learning algorithm

Only requirement: Should be slightly better than random

Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example

3 Give higher importance to examples on which the model made mistakes

4 Re-train the model using “importance weighted” training examples

5 Go back to step 2

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 7

Page 36: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀n

Initialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 37: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀n

For round t = 1 : TLearn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 38: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 39: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 40: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 41: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 42: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)

(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 43: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 44: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 45: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 46: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 47: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

The AdaBoost Algorithm

Given: Training data (x1, y1), . . . , (xN , yN ) with yn ∈ {−1,+1}, ∀nInitialize weight of each example (xn, yn): D1(n) = 1/N, ∀nFor round t = 1 : T

Learn a weak ht(x)→ {−1,+1} using training data weighted as per Dt

Compute the weighted fraction of errors of ht on this training data

εt =N∑

n=1

Dt(n)1[ht(xn) 6= yn]

Set “importance” of ht : αt = 12 log( 1−εt

εt)(gets larger as εt gets smaller)

Update the weight of each example

Dt+1(n) ∝{

Dt(n)× exp(−αt) if ht(xn) = yn (correct prediction: decrease weight)

Dt(n)× exp(αt) if ht(xn) 6= yn (incorrect prediction: increase weight)

= Dt(n) exp(−αt ynht(xn))

Normalize Dt+1 so that it sums to 1: Dt+1(n) =Dt+1(n)∑N

m=1Dt+1(m)

Output the “boosted” final hypothesis H(x) = sign(∑T

t=1 αtht(x))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 8

Page 48: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Example

Consider binary classification with 10 training examples

Initial weight distribution D1 is uniform (each point has equal weight = 1/10)

Each of our weak classifers will be an axis-parallel linear classifier

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 9

Page 49: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

After Round 1

Error rate of h1: ε1 = 0.3; weight of h1: α1 = 12 ln((1− ε1)/ε1) = 0.42

Each misclassified point upweighted (weight multiplied by exp(α2))

Each correctly classified point downweighted (weight multiplied by exp(−α2))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 10

Page 50: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

After Round 2

Error rate of h2: ε2 = 0.21; weight of h2: α2 = 12 ln((1− ε2)/ε2) = 0.65

Each misclassified point upweighted (weight multiplied by exp(α2))

Each correctly classified point downweighted (weight multiplied by exp(−α2))

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 11

Page 51: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

After Round 3

Error rate of h3: ε3 = 0.14; weight of h3: α3 = 12 ln((1− ε3)/ε3) = 0.92

Suppose we decide to stop after round 3

Our ensemble now consists of 3 classifiers: h1, h2, h3

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 12

Page 52: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Final Classifier

Final classifier is a weighted linear combination of all the classifiers

Classifier hi gets a weight αi

Multiple weak, linear classifiers combined to give a strong, nonlinear classifier

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 13

Page 53: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Another Example

Given: A nonlinearly separable datasetWe want to use Perceptron (linear classifier) on this data

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 14

Page 54: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 1

After round 1, our ensemble has 1 linear classifier (Perceptron)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 15

Page 55: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 2

After round 2, our ensemble has 2 linear classifiers (Perceptrons)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 16

Page 56: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 3

After round 3, our ensemble has 3 linear classifiers (Perceptrons)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 17

Page 57: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 4

After round 4, our ensemble has 4 linear classifiers (Perceptrons)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 18

Page 58: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 5

After round 5, our ensemble has 5 linear classifiers (Perceptrons)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 19

Page 59: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 6

After round 6, our ensemble has 6 linear classifiers (Perceptrons)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 20

Page 60: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 7

After round 7, our ensemble has 7 linear classifiers (Perceptrons)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 21

Page 61: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

AdaBoost: Round 40

After round 40, our ensemble has 40 linear classifiers (Perceptrons)

Bottom figure: X axis is number of rounds, Y axis is training error

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 22

Page 62: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosted Decision Stumps = Linear Classifier

A decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)

Suppose each example x has D binary features {xd}Dd=1, with xd ∈ {0, 1} and the label y is also

binary, i.e., y ∈ {−1,+1}

The DS (assuming it tests the d-th feature) will predict the label as as

h(x) = sd (2xd − 1) where s ∈ {−1,+1}

Suppose we have T such decision stumps h1, . . . , hT , testing feature number i1, . . . , iT ,respectively, i.e., ht(x) = sit (2xit − 1)

The boosted hypothesis H(x) = sgn(∑T

t=1 αtht(x)) can be written as

H(x) = sgn(T∑

t=1

αit sit (2xit − 1)) = sgn(T∑

t=1

2αit sit xit −T∑

t=1

αit sit ) = sign(w>x + b)

where wd =∑

t:it=d 2αtst and b = −∑

t αtst

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 23

Page 63: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosted Decision Stumps = Linear Classifier

A decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)

Suppose each example x has D binary features {xd}Dd=1, with xd ∈ {0, 1} and the label y is also

binary, i.e., y ∈ {−1,+1}

The DS (assuming it tests the d-th feature) will predict the label as as

h(x) = sd (2xd − 1) where s ∈ {−1,+1}

Suppose we have T such decision stumps h1, . . . , hT , testing feature number i1, . . . , iT ,respectively, i.e., ht(x) = sit (2xit − 1)

The boosted hypothesis H(x) = sgn(∑T

t=1 αtht(x)) can be written as

H(x) = sgn(T∑

t=1

αit sit (2xit − 1)) = sgn(T∑

t=1

2αit sit xit −T∑

t=1

αit sit ) = sign(w>x + b)

where wd =∑

t:it=d 2αtst and b = −∑

t αtst

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 23

Page 64: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosted Decision Stumps = Linear Classifier

A decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)

Suppose each example x has D binary features {xd}Dd=1, with xd ∈ {0, 1} and the label y is also

binary, i.e., y ∈ {−1,+1}

The DS (assuming it tests the d-th feature) will predict the label as as

h(x) = sd (2xd − 1) where s ∈ {−1,+1}

Suppose we have T such decision stumps h1, . . . , hT , testing feature number i1, . . . , iT ,respectively, i.e., ht(x) = sit (2xit − 1)

The boosted hypothesis H(x) = sgn(∑T

t=1 αtht(x)) can be written as

H(x) = sgn(T∑

t=1

αit sit (2xit − 1)) = sgn(T∑

t=1

2αit sit xit −T∑

t=1

αit sit ) = sign(w>x + b)

where wd =∑

t:it=d 2αtst and b = −∑

t αtst

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 23

Page 65: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosted Decision Stumps = Linear Classifier

A decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)

Suppose each example x has D binary features {xd}Dd=1, with xd ∈ {0, 1} and the label y is also

binary, i.e., y ∈ {−1,+1}

The DS (assuming it tests the d-th feature) will predict the label as as

h(x) = sd (2xd − 1) where s ∈ {−1,+1}

Suppose we have T such decision stumps h1, . . . , hT , testing feature number i1, . . . , iT ,respectively, i.e., ht(x) = sit (2xit − 1)

The boosted hypothesis H(x) = sgn(∑T

t=1 αtht(x)) can be written as

H(x) = sgn(T∑

t=1

αit sit (2xit − 1)) = sgn(T∑

t=1

2αit sit xit −T∑

t=1

αit sit ) = sign(w>x + b)

where wd =∑

t:it=d 2αtst and b = −∑

t αtst

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 23

Page 66: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosted Decision Stumps = Linear Classifier

A decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)

Suppose each example x has D binary features {xd}Dd=1, with xd ∈ {0, 1} and the label y is also

binary, i.e., y ∈ {−1,+1}

The DS (assuming it tests the d-th feature) will predict the label as as

h(x) = sd (2xd − 1) where s ∈ {−1,+1}

Suppose we have T such decision stumps h1, . . . , hT , testing feature number i1, . . . , iT ,respectively, i.e., ht(x) = sit (2xit − 1)

The boosted hypothesis H(x) = sgn(∑T

t=1 αtht(x)) can be written as

H(x) = sgn(T∑

t=1

αit sit (2xit − 1)) = sgn(T∑

t=1

2αit sit xit −T∑

t=1

αit sit ) = sign(w>x + b)

where wd =∑

t:it=d 2αtst and b = −∑

t αtst

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 23

Page 67: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosted Decision Stumps = Linear Classifier

A decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)

Suppose each example x has D binary features {xd}Dd=1, with xd ∈ {0, 1} and the label y is also

binary, i.e., y ∈ {−1,+1}

The DS (assuming it tests the d-th feature) will predict the label as as

h(x) = sd (2xd − 1) where s ∈ {−1,+1}

Suppose we have T such decision stumps h1, . . . , hT , testing feature number i1, . . . , iT ,respectively, i.e., ht(x) = sit (2xit − 1)

The boosted hypothesis H(x) = sgn(∑T

t=1 αtht(x)) can be written as

H(x) = sgn(T∑

t=1

αit sit (2xit − 1)) = sgn(T∑

t=1

2αit sit xit −T∑

t=1

αit sit ) = sign(w>x + b)

where wd =∑

t:it=d 2αtst and b = −∑

t αtst

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 23

Page 68: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting: Some Comments

For AdaBoost, given each model’s error εt = 1/2− γt , the training error consistently gets betterwith rounds

train-error(Hfinal ) ≤ exp(−2T∑

t=1

γ2t )

Boosting algorithms can be shown to be minimizing a loss function

E.g., AdaBoost has been shown to be minimizing an exponential loss

L =N∑

n=1

exp{−ynH(xn)}

where H(x) = sign(∑T

t=1 αtht(x)), given weak base classifiers h1, . . . , hT

Boosting in general can perform badly if some examples are outliers

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 24

Page 69: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting: Some Comments

For AdaBoost, given each model’s error εt = 1/2− γt , the training error consistently gets betterwith rounds

train-error(Hfinal ) ≤ exp(−2T∑

t=1

γ2t )

Boosting algorithms can be shown to be minimizing a loss function

E.g., AdaBoost has been shown to be minimizing an exponential loss

L =N∑

n=1

exp{−ynH(xn)}

where H(x) = sign(∑T

t=1 αtht(x)), given weak base classifiers h1, . . . , hT

Boosting in general can perform badly if some examples are outliers

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 24

Page 70: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting: Some Comments

For AdaBoost, given each model’s error εt = 1/2− γt , the training error consistently gets betterwith rounds

train-error(Hfinal ) ≤ exp(−2T∑

t=1

γ2t )

Boosting algorithms can be shown to be minimizing a loss function

E.g., AdaBoost has been shown to be minimizing an exponential loss

L =N∑

n=1

exp{−ynH(xn)}

where H(x) = sign(∑T

t=1 αtht(x)), given weak base classifiers h1, . . . , hT

Boosting in general can perform badly if some examples are outliers

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 24

Page 71: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Boosting: Some Comments

For AdaBoost, given each model’s error εt = 1/2− γt , the training error consistently gets betterwith rounds

train-error(Hfinal ) ≤ exp(−2T∑

t=1

γ2t )

Boosting algorithms can be shown to be minimizing a loss function

E.g., AdaBoost has been shown to be minimizing an exponential loss

L =N∑

n=1

exp{−ynH(xn)}

where H(x) = sign(∑T

t=1 αtht(x)), given weak base classifiers h1, . . . , hT

Boosting in general can perform badly if some examples are outliers

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 24

Page 72: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging vs Boosting

No clear winner; usually depends on the data

Bagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)

Both reduce variance (and overfitting) by combining different models

The resulting model has higher stability as compared to the individual ones

Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)

Bagging usually performs better than boosting if we don’t have a high bias and only want toreduce variance (i.e., if we are overfitting)

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 25

Page 73: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging vs Boosting

No clear winner; usually depends on the data

Bagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)

Both reduce variance (and overfitting) by combining different models

The resulting model has higher stability as compared to the individual ones

Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)

Bagging usually performs better than boosting if we don’t have a high bias and only want toreduce variance (i.e., if we are overfitting)

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 25

Page 74: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging vs Boosting

No clear winner; usually depends on the data

Bagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)

Both reduce variance (and overfitting) by combining different models

The resulting model has higher stability as compared to the individual ones

Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)

Bagging usually performs better than boosting if we don’t have a high bias and only want toreduce variance (i.e., if we are overfitting)

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 25

Page 75: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging vs Boosting

No clear winner; usually depends on the data

Bagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)

Both reduce variance (and overfitting) by combining different models

The resulting model has higher stability as compared to the individual ones

Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)

Bagging usually performs better than boosting if we don’t have a high bias and only want toreduce variance (i.e., if we are overfitting)

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 25

Page 76: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging vs Boosting

No clear winner; usually depends on the data

Bagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)

Both reduce variance (and overfitting) by combining different models

The resulting model has higher stability as compared to the individual ones

Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)

Bagging usually performs better than boosting if we don’t have a high bias and only want toreduce variance (i.e., if we are overfitting)

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 25

Page 77: Ensemble Methods: Bagging and Boosting - IIT Kanpur · 2017-08-22 · Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models \Stacking": Use predictions

Bagging vs Boosting

No clear winner; usually depends on the data

Bagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)

Both reduce variance (and overfitting) by combining different models

The resulting model has higher stability as compared to the individual ones

Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)

Bagging usually performs better than boosting if we don’t have a high bias and only want toreduce variance (i.e., if we are overfitting)

Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 25