The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

The Multi-Armed Bandit Problem

Sumeet Katariya

Electrical and Computer Engineering

December 7, 2013

Sumeet Katariya Multi-armed Bandit

MotivationModel

Algorithms

Outline

1 Motivation

2 Mathematical Model

3 Algorithmsǫ-Greedy AlgorithmSoftmax algorithmUpper Confidence Bound Algorithm


MotivationModel

Algorithms

A/B Testing


MotivationModel

Algorithms

Exploration vs. Exploitation

Scientist View

Explore new ideas

Businessman View

Exploit best idea found so far


MotivationModel

Algorithms

Terminology

pulling an arm = making a choice (which ad/color todisplay)

reward/regret = measure of success (user-click, item-buy)


MotivationModel

Algorithms

Problem Formulation

FormulationK arms 1, · · · ,K

Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )

νi ’s unknown

Finite time horizon (arm-pulls) n

At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt


MotivationModel

Algorithms

Problem Formulation



νi ’s unknown




MotivationModel

Algorithms

Problem Formulation



νi ’s unknown




MotivationModel

Algorithms

Problem Formulation



νi ’s unknown




MotivationModel

Algorithms

Problem Formulation



νi ’s unknown




MotivationModel

Algorithms

Definitions

Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi

∆i = µ∗ − µi Ti(n) =n∑

t=11It=i

Cumulative regret R̂n =n∑

t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm

Minimize expected regret

Rn = ER̂n = nµ∗ − E

K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)


MotivationModel

Algorithms

Definitions


∆i = µ∗ − µi Ti(n) =n∑

t=11It=i


t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm



K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)


MotivationModel

Algorithms

Definitions


∆i = µ∗ − µi Ti(n) =n∑

t=11It=i


t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm



K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)


MotivationModel

Algorithms

Definitions


∆i = µ∗ − µi Ti(n) =n∑

t=11It=i


t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm



K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)


MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Outline

1 Motivation




MotivationModel

Algorithms


Clarification

Objectively and Subjectively Best Options

Objectively best: Which option is truly the best (as knownto an oracle)

Subjectively best: Which option has been best in the past?

Exploitation vs. Exploration

Exploitation: Choose the subjectively best arm

Exploration: Choosing anything else


MotivationModel

Algorithms


Clarification

Objectively and Subjectively Best Options

Objectively best: Which option is truly the best (as knownto an oracle)

Subjectively best: Which option has been best in the past?

Exploitation vs. Exploration

Exploitation: Choose the subjectively best arm

Exploration: Choosing anything else


MotivationModel

Algorithms


ǫ-Greedy Algorithm

1

2

K

Strategy = ǫ·Scientist +(1 − ǫ)·Businessman

At each time t

With probability 1 − ǫ, pick the subjectively best arm

With probability ǫ

K , pick a random arm


MotivationModel

Algorithms


Probability of Selecting Best Arm

5 Bernoulli arms with reward probabilities 0.1, 0.1, 0.1, 0.1, 0.9

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200 250Time

Pro

babi

lity

of S

elec

ting

Bes

t Arm

Epsilon

0.1

0.2

0.3

0.4

0.5

Accuracy of the Epsilon Greedy Algorithm

ǫ = 0.1(Businessman)

Learns slowly

Does well at the end

ǫ = 0.5(Scientist)

Learns quickly

Doesn’t exploit at theend


MotivationModel

Algorithms


Theoretical guarantee

Weakness - ǫ constant: Solution - annealing

Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002)

Let ∆ = mini:∆i>0 ∆i and consider ǫt = min(

6K∆2t , 1

)

When t ≥ 6K∆2 , the probability of choosing a suboptimal arm

i is bounded by C∆2t , for some constant C > 0.

As a consequence, E[Ti(n)] ≤ C∆2 log n and

Rn ≤∑

i:∆i>0C∆i∆2 log n → logarithmic regret.


MotivationModel

Algorithms






6K∆2t , 1

)




Rn ≤∑



MotivationModel

Algorithms






6K∆2t , 1

)




Rn ≤∑



MotivationModel

Algorithms


Weakness of ǫ−Greedy

Exploration insensitive to relative performance levels

Two arms with rewards 0.9 and 0.1

Two arms with rewards 0.15 and 0.1

Solution - Softmax algorithm


MotivationModel

Algorithms


Idea:

P(arm 1) =µ̂1

µ̂1 + µ̂2

P(arm 2) =µ̂2

µ̂1 + µ̂2

Variant:

P(arm 1) =e

µ̂1T

eµ̂1T + e

µ̂2T

P(arm 2) =e

µ̂2T

eµ̂1T + e

µ̂2T

T → ∞ : Pure exploration

T = 0 : Pure exploitation


MotivationModel

Algorithms


Idea:

P(arm 1) =µ̂1

µ̂1 + µ̂2

P(arm 2) =µ̂2

µ̂1 + µ̂2

Variant:

P(arm 1) =e

µ̂1T

eµ̂1T + e

µ̂2T

P(arm 2) =e

µ̂2T

eµ̂1T + e

µ̂2T

T → ∞ : Pure exploration

T = 0 : Pure exploitation


MotivationModel

Algorithms


Weakness of Softmax

Doesn’t use confidencep̂1 = 0.15 after 100 plays, p̂2 = 0.1 after 100 plays.p̂1 = 0.15 after 100K plays, p̂2 = 0.1 after 100K plays.

Solution - UCB (Upper Confidence Bound) Algorithm


MotivationModel

Algorithms


UCB Algorithm

Optimism in the Face of Uncertainty

At time t , construct most optimistic estimate for each arm

Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)

Play arm with max upper bound.i.e. play It ∈ arg max

i∈{1,··· ,K}

{

Vi,t−1}

Proof based on Hoeffding’s inequality


MotivationModel

Algorithms


UCB Algorithm



Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)


i∈{1,··· ,K}

{

Vi,t−1}



MotivationModel

Algorithms


UCB Algorithm



Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)


i∈{1,··· ,K}

{

Vi,t−1}



MotivationModel

Algorithms


UCB Algorithm



Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)


i∈{1,··· ,K}

{

Vi,t−1}



MotivationModel

Algorithms


Results

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200 250Time

Pro

babi

lity

of S

elec

ting

Bes

t Arm

Accuracy of the UCB1 Algorithm


MotivationModel

Algorithms


Theoretical Guarantee

UCB Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002)

Rn ≤

[

∑

i:µi<µ∗

(

log n∆i

)

]

+(

1 + π2

3

)

(

K∑

i=1∆i

)

Lower bound (Lai and Rubbins 1985)

Asymptotic total regret is at least logarithmic in number of stepslim

n→∞Rn ≥ log n

∑

i:∆i>0

∆iKL(νi ||ν∗)


MotivationModel

Algorithms


Theoretical Guarantee

UCB Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002)

Rn ≤

[

∑

i:µi<µ∗

(

log n∆i

)

]

+(

1 + π2

3

)

(

K∑

i=1∆i

)

Lower bound (Lai and Rubbins 1985)

Asymptotic total regret is at least logarithmic in number of stepslim

n→∞Rn ≥ log n

∑

i:∆i>0

∆iKL(νi ||ν∗)


MotivationModel

Algorithms


Comparison

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200 250Time

Pro

babi

lity

of S

elec

ting

Bes

t Arm

Algorithm

Annealing epsilon−Greedy

UCB1

Annealing Softmax

Accuracy of Different Algorithms


MotivationModel

Algorithms


Summary

1 Motivation




MotivationModel

Algorithms


References

White, John. Bandit Algorithms for Website Optimization.O’Reilly, 2012.

Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer."Finite-time analysis of the multiarmed bandit problem."Machine learning 47.2-3 (2002): 235-256.


The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

Documents