Taming the monster : A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob Schapire
Dec 15, 2015
Taming the monster :A fast and s imple a lgor i thm for contextual bandi tsP R E S E N T E D B Y S a t y e n K a l e
J o i n t w o r k w i t h A l e k h A g a r w a l , D a n i e l H s u , J o h n L a n g f o r d , L i h o n g L i a n d R o b S c h a p i r e
Learning to interact: example #1
Loop:› 1. Patient arrives with symptoms, medical history, genome, …› 2. Physician prescribes treatment.› 3. Patient’s health responds (e.g., improves, worsens).
Goal: prescribe treatments that yield good health outcomes.
Learning to interact: example #2
Loop:› 1. User visits website with profile, browsing history, …› 2. Website operator choose content/ads to display.› 3. User reacts to content/ads (e.g., click, “like”).
Goal: choose content/ads that yield desired user behavior.
Contextual bandit setting (i.i.d. version)
Set X of contexts/features and K possible actions For t = 1,2,…,T:
› 0. Nature draws (xt, rt) from distribution D over X × [0,1]K.
› 1. Observe context xt. [e.g., user profile, browsing history]
› 2. Choose action at ϵ [K]. [e.g., content/ad to display]
› 3. Collect reward rt(at). [e.g., indicator of click or positive feedback]
Goal: algorithm for choosing actions at that yield high reward.
Contextual setting: use features xt to choose good actions at.
Bandit setting: rt(a) for a ≠ at is not observed.› Exploration vs. exploitation
Learning objective and difficulties
No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X [K]
(e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π:
… a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies;
explicit bookkeeping is computationally infeasible when Π is large.
Arg max oracle (AMO)
Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns
Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex
relax., backprop) for cost-sensitive multiclass learning algorithms.
Our results
New fast and simple algorithm for contextual bandits› Optimal regret bound (up to log factors): › Amortized calls to argmax oracle (AMO) per round.
Comparison to previous work› [Thompson’33]: no general analysis.› [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies.› [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round.› [DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round.
Note: Exp4 also works in adversarial setting.
Rest of this talk
1. Action distributions, reward estimates viainverse probability weights [oldies but goodies]
2. Algorithm for finding policy distributionsthat balance exploration/exploitation
3. Warm-start / epoch trick
New
New
Basic algorithm structure (same as Exp4)
Start with initial distribution Q1 over policies Π.
For t=1,2,…,T:› 0. Nature draws (xt,rt) from distribution D over X × [0,1]K.
› 1. Observe context xt.
› 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt).
› 2b. Draw action at from pt.
› 3. Collect reward rt(at).
› 4. Compute new distribution Qt+1 over policies Π.
Inverse probability weighting (old trick)
Importance-weighted estimate of reward from round t:
Unbiased, and has range & variance bounded by 1/pt(a).
Can estimate total reward and regret of any policy:
Constructing policy distributions
Optimization problem (OP):
Find policy distribution Q such that:
Low estimated regret (LR) – “exploitation"
Low estimation variance (LV) – “exploration”
Theorem: If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most
Feasibility
Feasibility of (OP): implied by minimax argument.
Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.
Coordinate descent algorithm
INPUT: Initial weights Q. LOOP:
› IF (LR) is violated, THEN replace Q by cQ.› IF there is a policy π causing (LV) to be violated, THEN
• UPDATE Q(π) = Q(π) + α.› ELSE
• RETURN Q.
Above, both 0 < c < 1 and α have closed form expressions.
(Technical detail: actually optimize over sub-distributions Q that may sum to < 1.)
Claim: Can check by making one AMO call per iteration.
Iteration bound for coordinate descent
# steps of coordinate descent =
Also gives bound on sparsity of Q.
Analysis via a potential function argument.
Warm-start
If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need
coordinate descent iterations over all T rounds.
Caveat: need one AMO call/round to even check if (OP) is solved.
Epoch trick
Regret analysis: Qt has low instantaneous expected regret(crucially relying on i.i.d. assumption).› Therefore same Qt can be used for O(t) more rounds!
Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,…
› Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors).
Squares: only update on rounds 12,22,32,42,…› Total of O(T1/2) updates, each requiring AMO calls, on average.
Experiments
Algorithm Epsilon-greedy
Bagging Linear UCB “Online Cover”
[Supervised]
Loss 0.095 0.059 0.128 0.053 0.051
Time (seconds)
22 339 212000 17 6.9
Bandit problem derived from classification task (RCV1).Reporting progressive validation loss.
“Online Cover” = variant with stateful AMO.