Top Banner
Motivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical and Computer Engineering December 7, 2013 Sumeet Katariya Multi-armed Bandit
36

The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

Mar 09, 2018

Download

Documents

VũDương
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

The Multi-Armed Bandit Problem

Sumeet Katariya

Electrical and Computer Engineering

December 7, 2013

Sumeet Katariya Multi-armed Bandit

Page 2: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Outline

1 Motivation

2 Mathematical Model

3 Algorithmsǫ-Greedy AlgorithmSoftmax algorithmUpper Confidence Bound Algorithm

Sumeet Katariya Multi-armed Bandit

Page 3: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

A/B Testing

Sumeet Katariya Multi-armed Bandit

Page 4: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Exploration vs. Exploitation

Scientist View

Explore new ideas

Businessman View

Exploit best idea found so far

Sumeet Katariya Multi-armed Bandit

Page 5: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Terminology

pulling an arm = making a choice (which ad/color todisplay)

reward/regret = measure of success (user-click, item-buy)

Sumeet Katariya Multi-armed Bandit

Page 6: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Problem Formulation

FormulationK arms 1, · · · ,K

Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )

νi ’s unknown

Finite time horizon (arm-pulls) n

At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt

Sumeet Katariya Multi-armed Bandit

Page 7: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Problem Formulation

FormulationK arms 1, · · · ,K

Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )

νi ’s unknown

Finite time horizon (arm-pulls) n

At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt

Sumeet Katariya Multi-armed Bandit

Page 8: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Problem Formulation

FormulationK arms 1, · · · ,K

Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )

νi ’s unknown

Finite time horizon (arm-pulls) n

At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt

Sumeet Katariya Multi-armed Bandit

Page 9: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Problem Formulation

FormulationK arms 1, · · · ,K

Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )

νi ’s unknown

Finite time horizon (arm-pulls) n

At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt

Sumeet Katariya Multi-armed Bandit

Page 10: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Problem Formulation

FormulationK arms 1, · · · ,K

Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )

νi ’s unknown

Finite time horizon (arm-pulls) n

At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt

Sumeet Katariya Multi-armed Bandit

Page 11: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Definitions

Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi

∆i = µ∗ − µi Ti(n) =n∑

t=11It=i

Cumulative regret R̂n =n∑

t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm

Minimize expected regret

Rn = ER̂n = nµ∗ − E

K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)

Sumeet Katariya Multi-armed Bandit

Page 12: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Definitions

Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi

∆i = µ∗ − µi Ti(n) =n∑

t=11It=i

Cumulative regret R̂n =n∑

t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm

Minimize expected regret

Rn = ER̂n = nµ∗ − E

K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)

Sumeet Katariya Multi-armed Bandit

Page 13: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Definitions

Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi

∆i = µ∗ − µi Ti(n) =n∑

t=11It=i

Cumulative regret R̂n =n∑

t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm

Minimize expected regret

Rn = ER̂n = nµ∗ − E

K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)

Sumeet Katariya Multi-armed Bandit

Page 14: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

Definitions

Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi

∆i = µ∗ − µi Ti(n) =n∑

t=11It=i

Cumulative regret R̂n =n∑

t=1gi∗,t −

n∑

t=1gIt ,t

Objective

Find best arm

Minimize expected regret

Rn = ER̂n = nµ∗ − E

K∑

i=1Ti(n)µi =

K∑

i=1∆iETi(n)

Sumeet Katariya Multi-armed Bandit

Page 15: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Outline

1 Motivation

2 Mathematical Model

3 Algorithmsǫ-Greedy AlgorithmSoftmax algorithmUpper Confidence Bound Algorithm

Sumeet Katariya Multi-armed Bandit

Page 16: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Clarification

Objectively and Subjectively Best Options

Objectively best: Which option is truly the best (as knownto an oracle)

Subjectively best: Which option has been best in the past?

Exploitation vs. Exploration

Exploitation: Choose the subjectively best arm

Exploration: Choosing anything else

Sumeet Katariya Multi-armed Bandit

Page 17: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Clarification

Objectively and Subjectively Best Options

Objectively best: Which option is truly the best (as knownto an oracle)

Subjectively best: Which option has been best in the past?

Exploitation vs. Exploration

Exploitation: Choose the subjectively best arm

Exploration: Choosing anything else

Sumeet Katariya Multi-armed Bandit

Page 18: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

ǫ-Greedy Algorithm

1

2

K

Strategy = ǫ·Scientist +(1 − ǫ)·Businessman

At each time t

With probability 1 − ǫ, pick the subjectively best arm

With probability ǫ

K , pick a random arm

Sumeet Katariya Multi-armed Bandit

Page 19: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Probability of Selecting Best Arm

5 Bernoulli arms with reward probabilities 0.1, 0.1, 0.1, 0.1, 0.9

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200 250Time

Pro

babi

lity

of S

elec

ting

Bes

t Arm

Epsilon

0.1

0.2

0.3

0.4

0.5

Accuracy of the Epsilon Greedy Algorithm

ǫ = 0.1(Businessman)

Learns slowly

Does well at the end

ǫ = 0.5(Scientist)

Learns quickly

Doesn’t exploit at theend

Sumeet Katariya Multi-armed Bandit

Page 20: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Theoretical guarantee

Weakness - ǫ constant: Solution - annealing

Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002)

Let ∆ = mini:∆i>0 ∆i and consider ǫt = min(

6K∆2t , 1

)

When t ≥ 6K∆2 , the probability of choosing a suboptimal arm

i is bounded by C∆2t , for some constant C > 0.

As a consequence, E[Ti(n)] ≤ C∆2 log n and

Rn ≤∑

i:∆i>0C∆i∆2 log n → logarithmic regret.

Sumeet Katariya Multi-armed Bandit

Page 21: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Theoretical guarantee

Weakness - ǫ constant: Solution - annealing

Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002)

Let ∆ = mini:∆i>0 ∆i and consider ǫt = min(

6K∆2t , 1

)

When t ≥ 6K∆2 , the probability of choosing a suboptimal arm

i is bounded by C∆2t , for some constant C > 0.

As a consequence, E[Ti(n)] ≤ C∆2 log n and

Rn ≤∑

i:∆i>0C∆i∆2 log n → logarithmic regret.

Sumeet Katariya Multi-armed Bandit

Page 22: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Theoretical guarantee

Weakness - ǫ constant: Solution - annealing

Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002)

Let ∆ = mini:∆i>0 ∆i and consider ǫt = min(

6K∆2t , 1

)

When t ≥ 6K∆2 , the probability of choosing a suboptimal arm

i is bounded by C∆2t , for some constant C > 0.

As a consequence, E[Ti(n)] ≤ C∆2 log n and

Rn ≤∑

i:∆i>0C∆i∆2 log n → logarithmic regret.

Sumeet Katariya Multi-armed Bandit

Page 23: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Weakness of ǫ−Greedy

Exploration insensitive to relative performance levels

Two arms with rewards 0.9 and 0.1

Two arms with rewards 0.15 and 0.1

Solution - Softmax algorithm

Sumeet Katariya Multi-armed Bandit

Page 24: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Idea:

P(arm 1) =µ̂1

µ̂1 + µ̂2

P(arm 2) =µ̂2

µ̂1 + µ̂2

Variant:

P(arm 1) =e

µ̂1T

eµ̂1T + e

µ̂2T

P(arm 2) =e

µ̂2T

eµ̂1T + e

µ̂2T

T → ∞ : Pure exploration

T = 0 : Pure exploitation

Sumeet Katariya Multi-armed Bandit

Page 25: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Idea:

P(arm 1) =µ̂1

µ̂1 + µ̂2

P(arm 2) =µ̂2

µ̂1 + µ̂2

Variant:

P(arm 1) =e

µ̂1T

eµ̂1T + e

µ̂2T

P(arm 2) =e

µ̂2T

eµ̂1T + e

µ̂2T

T → ∞ : Pure exploration

T = 0 : Pure exploitation

Sumeet Katariya Multi-armed Bandit

Page 26: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Weakness of Softmax

Doesn’t use confidencep̂1 = 0.15 after 100 plays, p̂2 = 0.1 after 100 plays.p̂1 = 0.15 after 100K plays, p̂2 = 0.1 after 100K plays.

Solution - UCB (Upper Confidence Bound) Algorithm

Sumeet Katariya Multi-armed Bandit

Page 27: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

UCB Algorithm

Optimism in the Face of Uncertainty

At time t , construct most optimistic estimate for each arm

Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)

Play arm with max upper bound.i.e. play It ∈ arg max

i∈{1,··· ,K}

{

Vi,t−1}

Proof based on Hoeffding’s inequality

Sumeet Katariya Multi-armed Bandit

Page 28: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

UCB Algorithm

Optimism in the Face of Uncertainty

At time t , construct most optimistic estimate for each arm

Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)

Play arm with max upper bound.i.e. play It ∈ arg max

i∈{1,··· ,K}

{

Vi,t−1}

Proof based on Hoeffding’s inequality

Sumeet Katariya Multi-armed Bandit

Page 29: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

UCB Algorithm

Optimism in the Face of Uncertainty

At time t , construct most optimistic estimate for each arm

Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)

Play arm with max upper bound.i.e. play It ∈ arg max

i∈{1,··· ,K}

{

Vi,t−1}

Proof based on Hoeffding’s inequality

Sumeet Katariya Multi-armed Bandit

Page 30: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

UCB Algorithm

Optimism in the Face of Uncertainty

At time t , construct most optimistic estimate for each arm

Vi,t−1 = µ̂i,t−1 +√

2 log tTi (t−1)

Play arm with max upper bound.i.e. play It ∈ arg max

i∈{1,··· ,K}

{

Vi,t−1}

Proof based on Hoeffding’s inequality

Sumeet Katariya Multi-armed Bandit

Page 31: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Results

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200 250Time

Pro

babi

lity

of S

elec

ting

Bes

t Arm

Accuracy of the UCB1 Algorithm

Sumeet Katariya Multi-armed Bandit

Page 32: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Theoretical Guarantee

UCB Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002)

Rn ≤

[

i:µi<µ∗

(

log n∆i

)

]

+(

1 + π2

3

)

(

K∑

i=1∆i

)

Lower bound (Lai and Rubbins 1985)

Asymptotic total regret is at least logarithmic in number of stepslim

n→∞Rn ≥ log n

i:∆i>0

∆iKL(νi ||ν∗)

Sumeet Katariya Multi-armed Bandit

Page 33: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Theoretical Guarantee

UCB Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002)

Rn ≤

[

i:µi<µ∗

(

log n∆i

)

]

+(

1 + π2

3

)

(

K∑

i=1∆i

)

Lower bound (Lai and Rubbins 1985)

Asymptotic total regret is at least logarithmic in number of stepslim

n→∞Rn ≥ log n

i:∆i>0

∆iKL(νi ||ν∗)

Sumeet Katariya Multi-armed Bandit

Page 34: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Comparison

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200 250Time

Pro

babi

lity

of S

elec

ting

Bes

t Arm

Algorithm

Annealing epsilon−Greedy

UCB1

Annealing Softmax

Accuracy of Different Algorithms

Sumeet Katariya Multi-armed Bandit

Page 35: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

Summary

1 Motivation

2 Mathematical Model

3 Algorithmsǫ-Greedy AlgorithmSoftmax algorithmUpper Confidence Bound Algorithm

Sumeet Katariya Multi-armed Bandit

Page 36: The Multi-Armed Bandit Problem - CAE Usershomepages.cae.wisc.edu/~sumeet/files/banditsslides.pdfMotivation Model Algorithms The Multi-Armed Bandit Problem Sumeet Katariya Electrical

MotivationModel

Algorithms

ǫ-GreedySoftmax algorithmUCB

References

White, John. Bandit Algorithms for Website Optimization.O’Reilly, 2012.

Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer."Finite-time analysis of the multiarmed bandit problem."Machine learning 47.2-3 (2002): 235-256.

Sumeet Katariya Multi-armed Bandit