Top Banner
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1, 2010 Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an A
150

A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Jul 17, 2018

Download

Documents

hoanghanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

A Decision-Theoretic Generalization of On-LineLearning and an Application to Boosting

Sachin Gupta and Nitish Lakhanpal

November 1, 2010

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 2: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 3: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 4: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 5: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 6: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 7: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 8: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 9: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 10: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 11: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategies

pti ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 12: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy i

n∑i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 13: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 14: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironment

Loss suffered by A is∑

i (pti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 15: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation rule

This loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 16: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 17: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 18: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 19: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 20: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 21: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 22: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 23: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 24: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 25: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 26: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 27: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 28: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 29: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 30: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 31: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 32: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 33: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 34: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 35: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 36: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 37: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

Proof

Note first that P t+1i −M ≤ P t

i ≤ P t+1i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 38: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 39: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 40: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 41: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 42: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 43: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 44: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 45: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 46: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 47: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 48: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 49: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 50: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 51: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 52: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 53: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 54: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problems

Take decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 55: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]

At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 56: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .

Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 57: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]

Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 58: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .

At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 59: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)

Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 60: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 61: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 62: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 63: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 64: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 65: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 66: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 67: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 68: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 69: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 70: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 71: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 72: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 73: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 74: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 75: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 76: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 77: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic Error

Low Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 78: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High Variance

Ex: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 79: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 80: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 81: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 82: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 83: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 84: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 85: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 86: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 87: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 88: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 89: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the space

Forces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 90: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 91: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 92: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 93: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.

A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 94: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 95: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 96: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial p

Intuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 97: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 98: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 99: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 100: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examples

Compute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 101: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distribution

Feed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 102: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithm

Generate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 103: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weights

Output a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 104: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 105: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 106: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 107: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 108: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 109: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εt

Calculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 110: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 111: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 112: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 113: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 114: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”

The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 115: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of Hedge

We also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 116: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 117: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 118: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 119: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 120: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 121: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 122: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 123: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 124: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 125: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 126: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 127: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 128: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 129: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 if

Pr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 130: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 131: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 132: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 133: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 134: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 135: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 136: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 137: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., k

Correspondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 138: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 139: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 140: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 141: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 142: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 143: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 144: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 145: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 146: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 147: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 148: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 149: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 150: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Summary

Hedge Algorithm, applications of Hedge

Adaboost Algorithm, relationship to Hedge, relationship toBayesian statistics, and extensions

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting