A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

A Decision-Theoretic Generalization of On-LineLearning and an Application to Boosting

Sachin Gupta and Nitish Lakhanpal

November 1, 2010

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems


Introduction







Introduction







Introduction







Introduction







Introduction







On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss








i=1

pt = 1


∑i (p










i=1

pt = 1


∑i (p








Allocator A decides on distribution pt over the strategies

pti ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1


∑i (p









i ≥ 0 is the amount allocated to strategy i

n∑i=1

pt = 1


∑i (p










i=1

pt = 1


∑i (p










i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironment

Loss suffered by A is∑

i (pti · l ti ) i.e. the average loss of the









i=1

pt = 1


∑i (p


strategies with respect to A’s chosen allocation rule

This loss function is called the mixture loss








i=1

pt = 1


∑i (p










i=1

pt = 1


∑i (p




Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i


Bounds on Losses








Li =∑T



Bounds on Losses








Li =∑T



Bounds on Losses








Li =∑T



Bounds on Losses








Li =∑T



Bounds on Losses








Li =∑T



Bounds on Losses








Li =∑T



Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it




















Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n





εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n





εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n


Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t




1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi


∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi


∑xi log Z t = 1



∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t




1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi


∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi


∑xi log Z t = 1



∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t




1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi


∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi


∑xi log Z t = 1



∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t


Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i




∣∣ ≤ 4εM2

Proof

Note first that P t+1i −M ≤ P t

i ≤ P t+1i + M and hence


ti ≤ eεP

t+1i eεM


x ti =

eεPti


t+1i

Z t+1∀i ≤ n


i




∣∣ ≤ 4εM2


i −M ≤ P ti ≤ P t+1

i + M and hence


ti ≤ eεP

t+1i eεM


x ti =

eεPti


t+1i

Z t+1∀i ≤ n


i




∣∣ ≤ 4εM2


i −M ≤ P ti ≤ P t+1

i + M and hence


ti ≤ eεP

t+1i eεM


x ti =

eεPti


t+1i

Z t+1∀i ≤ n


i




∣∣ ≤ 4εM2


i −M ≤ P ti ≤ P t+1

i + M and hence


ti ≤ eεP

t+1i eεM


x ti =

eεPti


t+1i

Z t+1∀i ≤ n


i




∣∣ ≤ 4εM2


i −M ≤ P ti ≤ P t+1

i + M and hence


ti ≤ eεP

t+1i eεM


x ti =

eεPti


t+1i

Z t+1∀i ≤ n


i



Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t




x t = (1− λ)x t+1 + λz t



z ti = 1


x ti = 1,






x t = (1− λ)x t+1 + λz t



z ti = 1


x ti = 1,





Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma



Finally,





Finally,




Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣


Regret Bound


regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof


regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣


Regret Bound


regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof


regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣


Regret Bound


regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof


regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣


Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε


Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε


Applications

Framework quite general and can be applied to a number oflearning problems

Take decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert


Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]

At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t




Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .

Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t




Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]

Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t




Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .

At each time step t, each expert i will produce its owndistribution E t




Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t


t)

Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert


Applications





Applications





Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)


Applications



∑Ni=1 pt

i Eti


∑Ni=1 pt

i (Λ(E ti , ω

t))




t=1 Λ(E ti , ω

t) +√



Applications



∑Ni=1 pt

i Eti


∑Ni=1 pt

i (Λ(E ti , ω

t))




t=1 Λ(E ti , ω

t) +√



Applications



∑Ni=1 pt

i Eti


∑Ni=1 pt

i (Λ(E ti , ω

t))




t=1 Λ(E ti , ω

t) +√



Applications



∑Ni=1 pt

i Eti


∑Ni=1 pt

i (Λ(E ti , ω

t))




t=1 Λ(E ti , ω

t) +√



Applications



∑Ni=1 pt

i Eti


∑Ni=1 pt

i (Λ(E ti , ω

t))




t=1 Λ(E ti , ω

t) +√



Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts





Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2









Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2









Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2









Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2









Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2









Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2









Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2






Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves


Boosting


Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic Error

Low Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves


Boosting


Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High Variance

Ex: 1st Order Curves vs. 9th Order Curves


Boosting




Boosting




Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted


Boosting


Use committees

Bias


Variance



Boosting


Use committees

Bias


Variance



Boosting


Use committees

Bias


Variance



Boosting


Use committees

Bias


Variance



Boosting


Use committees

Bias


Variance



Boosting


Use committees

Bias


Variance



Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts


Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the space

Forces weak learners to generate new hypotheses → Fewermistakes are made on these parts


Boosting

Intuition



Boosting

Intuition



Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)


Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.

A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning





Boosting

Strong Learning


Weak Learning





Boosting

Strong Learning


Weak Learning





Boosting

Strong Learning


Weak Learning



decreases as 1/p for polynomial p

Intuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)


Boosting

Strong Learning


Weak Learning





Boosting

Strong Learning


Weak Learning





Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote


Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examples

Compute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote


Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distribution

Feed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote


Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithm

Generate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote


Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weights

Output a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote


Boosting

The Algorithm



Boosting

The Algorithm



Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t


Boosting



εt =∑N


Set βt =εt


w t+1i = w t



Boosting



εt =∑N


Set βt =εt


w t+1i = w t



Boosting



εt =∑N


Set βt =εt

1− εt

Calculate new weights as:

w t+1i = w t



Boosting



εt =∑N


Set βt =εt


w t+1i = w t



Boosting



εt =∑N


Set βt =εt


w t+1i = w t



Boosting



εt =∑N


Set βt =εt


w t+1i = w t



Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems




“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”

The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction





“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of Hedge

We also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction


















Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier



Not enough data










Not enough data










Not enough data










Not enough data










Not enough data










Not enough data










Not enough data










Not enough data









Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost




Predict 1 if

Pr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as


t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]









t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]






And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as


t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]









t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]









t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]









t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]









t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]









t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏


εt = Pr [ht(x) 6= y ]



Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., k

Correspondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]



We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y



εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]






εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]






εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]






εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]






εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]






εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]






εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]






εt =∑N


otherwise

Let βt =εt

(1− εt)





T∑t=1

(log

1

βi

)[[ht(x) = y ]]


Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

















Summary

Hedge Algorithm, applications of Hedge

Adaboost Algorithm, relationship to Hedge, relationship toBayesian statistics, and extensions


A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Documents