Top Banner
Introduction Main Results Experiments Conclusions The UCB Algorithm Paper Finite-time Analysis of the Multiarmed Bandit Problem by Auer, Cesa-Bianchi, Fischer, Machine Learning 27, 2002 Presented by Markus Enzenberger. Go Seminar, University of Alberta. March 14, 2007 The UCB Algorithm
23

The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Jan 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

The UCB Algorithm

Paper Finite-time Analysis of the Multiarmed Bandit Problemby Auer, Cesa-Bianchi, Fischer, Machine Learning 27, 2002

Presented by Markus Enzenberger.Go Seminar, University of Alberta.

March 14, 2007

The UCB Algorithm

Page 2: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

IntroductionMultiarmed Bandit ProblemRegretLai and Robbins (1985)In this Paper

Main ResultsTheorem 1 (UCB1)Theorem 2 (UCB2)Theorem 3 (εn-GREEDY)Theorem 4 (UCB1-NORMAL)Independence Assumptions

ExperimentsUCB1-TUNEDSetupBest value for αSummary of ResultsComparison on Distribution 11

ConclusionsThe UCB Algorithm

Page 3: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Multiarmed Bandit Problem

Multiarmed Bandit Problem

I Example for the exploration vs exploitation dilemma

I K independent gambling machines (armed bandits)

I Each machine has an unknown stationary probabilitydistribution for generating the reward

I Observed rewards when playing machine i :Xi ,1,Xi ,2, . . .

I Policy A chooses next machine based on previous sequence ofplays and rewards

The UCB Algorithm

Page 4: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Regret

Regret

Regret of policy A

µ∗n − µj

K∑j=1

E[Tj(n)]

µi expectation of machine iµ∗ expectation of optimal machineTj number of times machine j was played

The regret is the expected loss after n plays due to the fact thatthe policy does not always play the optimal machine.

The UCB Algorithm

Page 5: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Lai and Robbins (1985)

Lai and Robbins (1985)

Policy for a class of reward distributions (including: normal,Bernoulli, Poisson) with regret asymptotically bounded bylogarithm of n:

E[Tj(n)] ≤ ln(n)

D(pj ||p∗)n →∞

D(pj ||p∗) Kullback-Leibler divergence between reward densities

I This is the best possible regret

I Policy computes upper confidence index for each machine

I Needs entire sequence of rewards for each machine

The UCB Algorithm

Page 6: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

In this Paper

In this Paper

I Show policies with logarithmic regret uniformly over time

I Policies are simple and efficient

I Notation: ∆i := µ∗ − µi

The UCB Algorithm

Page 7: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 1 (UCB1)

Theorem 1 (UCB1)

Policy with finite-time regret logarithmically bounded for arbitrarysets of reward distributions with bounded support

The UCB Algorithm

Page 8: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 1 (UCB1)

I E[Tj(n)] ≤ 8∆2

jln(n) worse than Lai and Robbins

I D(pj ||p∗) ≥ 2∆2j with best possible constant 2

→ UCB2 brings main constant arbitrarily close to 12∆2

j

The UCB Algorithm

Page 9: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 2 (UCB2)

Theorem 2 (UCB2)

More complicated version of UCB1 with better constants forbound on regret.

The UCB Algorithm

Page 10: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 2 (UCB2)

The UCB Algorithm

Page 11: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 2 (UCB2)

I First term brings constant arbitrarily close to 12∆2

jfor small α

I cα →∞ as α → 0

I Let α = αn slowly decrease

The UCB Algorithm

Page 12: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 3 (εn-GREEDY)

Theorem 3 (εn-GREEDY)

Similar result for ε-greedy heuristic.(ε needs to go to 0; constant ε has linear regret)

The UCB Algorithm

Page 13: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 3 (εn-GREEDY)

I For c large enough, the bound is of order c/(d2n) + o(1/n)→ logarithmic bound on regret

I Bound on instanteneous regret

I Need to know lower bound d on expectation between bestand second-best machine

The UCB Algorithm

Page 14: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 4 (UCB1-NORMAL)

Theorem 4 (UCB1-NORMAL)

Indexed based policy with logarithmically bounded finite-timeregret for normally distributed reward distributions with unknownmean and variance.

The UCB Algorithm

Page 15: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 4 (UCB1-NORMAL)

The UCB Algorithm

Page 16: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Theorem 4 (UCB1-NORMAL)

I Like UCB1, but since kind of distribution is known, samplevariance is used to estimate variance of distribution

I Proof depends on bounds for tails of χ2 and Studentdistribution, which were only verified numerically

The UCB Algorithm

Page 17: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Independence Assumptions

Independence Assumptions

I Theorem 1–3 also hold for rewards that arenot independent across machines:Xi ,s and Xj ,t might be dependent for any s, t and i 6= j

I The rewards of a single machine do not need to beindependent and identically-distributed.Weaker assumption:E[Xi ,t |Xi ,1, . . . ,Xi ,t−1] = µi for all 1 ≤ t ≤ n

The UCB Algorithm

Page 18: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

UCB1-TUNED

UCB1-TUNED

Fined-tuned version of UCB taking the measured variance intoaccount (no proven regret bounds)Upper confidence bound on variance of machine j

Replace upper confidence bound in UCB1 by

1/4 is upper bound on variance of a Bernoulli random variable

The UCB Algorithm

Page 19: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Setup

Distributions

The UCB Algorithm

Page 20: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Best value for α

Best value for α

I Relatively insensitive, as long as α is small

I Use fixed α = 0.001

The UCB Algorithm

Page 21: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Summary of Results

Summary of Results

I An optimally tuned εn-GREEDY performs almost always best

I Performance of not well-tuned εn-GREEDY degrades rapidly

I In most cases UCB1-TUNED performs comparably to awell-tuned εn-GREEDY

I UCB1-TUNED not sensitive to the variances of the machines

I UCB2 performs similar to UCB1-TUNED, but always slightlyworse

The UCB Algorithm

Page 22: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Comparison on Distribution 11

Comparison on Distribution 11

The UCB Algorithm

Page 23: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm

Introduction Main Results Experiments Conclusions

Conclusions

I Simple, efficient policies for the bandit problem on any set ofreward distributions with known bounded support withuniform logarithmic regret

I Based on upper confidence bounds (with exception ofεn-GREEDY)

I Robust with respect to the introduction of moderatedependencies

I Many extensions of this work are possible

I Generalize to non-stationary problems

I Based on Gittins allocation indices(needs preliminary knowledge or learning of the indices)

The UCB Algorithm