Statistical Learning in Operations Management David Simchi-Levi 1
Statistical Learning in Operations Management
David Simchi-Levi
1
Executive Summary
βΊStrategic Intent: Develop solutions to leading edge problems for Lab partners through research that brings together data, modeling and analysis to achieve industry leading improvements in business performance.
βΊCross Industry: Oil/Gas, Retail, Financial Services, Government, Insurance, Airlines, Industrial Equipment, Software
βΊGlobal footprint: NA, EU, Asia, LA
Supply Chain Resiliency
Price Optimization
Personalized Offering
Supply Chain Digitization
Online Resources Allocation
Inventory, Transportation &
Procurement Optimization
Online Learning
No data is available at the beginning of the process.
Data is generated on-the-fly according to some unknown model
and the decisions made by the platform.
Objective: Design algorithms that maximize the accumulated reward, i.e., achieve low regret.Regret = optimal accumulated reward of a clairvoyant β collected accumulated reward
Unknown Modelππβ(ππ,ππ)
Rewardππππ
Decisionππππ
Optimize
Observe
Learn
3
Featureππππ
Receive
generated by nature
generated by nature generated by learner
ππβ the ground-truthreward function
Offline Learning
The entire data set (of i.i.d. samples) is available at the beginning.
The decision maker cannot adapt decisions to the new data.
Training data set (i.i.d.)ππππ,ππππ;ππππ , ππππ,ππππ; ππππ ,β― , (ππππ,ππππ; ππππ)~π«π«
Offline Learning Algorithms
Predictive functionsοΏ½ππ ππ,ππ
4
Objective: Design algorithms that with limited data will generate the οΏ½ππ so that with high probability οΏ½ππ will have low error compared to the ground truth ππβ.Estimation error = πΌπΌ π₯π₯,ππ;ππ βΌπ·π·[β(ππ π₯π₯, ππ ,ππβ(π₯π₯,ππ))]; it is MSE when β is square loss which
is the average squared difference between the estimated values and the actual value
5
The Interplay between Online and Offline Learning
β’ Reducing Online Learning to Offline Learning D. Simchi-Levi and Y. Xu (2020). Bypassing the Monster: A
Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability.
β’ Online Learning with Offline Data J. Bu, D. Simchi-Levi, and Y. Xu (2019). Online Pricing with
Offline Data: Phase Transition and Inverse Square Law.
6
The Interplay between Online and Offline Learning
β’ Reducing Online Learning to Offline Learning D. Simchi-Levi and Y. Xu (2020). Bypassing the Monster: A
Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability.
β’ Online Learning with Offline Data J. Bu, D. Simchi-Levi, and Y. Xu (2019). Online Pricing with
Offline Data: Phase Transition and Inverse Square Law.
6
7
Part I: Talk Outline
β’ Motivation and Research Question
β’ Technical Hurdles and Our Contribution
β’ The Algorithm and Theory
β’ Computational Experiments
7
A General Contextual Bandit Model
βΊFor round π‘π‘ = 1,β― ,ππβ’ Nature generates a random context π₯π₯π‘π‘ according to a fixed unknown distribution π·π·ππβ’ Learner observes π₯π₯π‘π‘ and makes a decision πππ‘π‘ β {1, β¦ ,πΎπΎ}β’ Nature generates a random reward πππ‘π‘(πππ‘π‘) β [0,1] according to an unknown distribution
with conditional meanπΌπΌ πππ‘π‘(πππ‘π‘) π₯π₯π‘π‘ = π₯π₯, πππ‘π‘ = ππ = ππβ(π₯π₯, ππ)
βΊWe call ππβ the ground-truth reward function; ππβ β πΉπΉβΊRegret: the total reward loss compared with a clairvoyant who knows ππβ
βΊIn statistical learning, people use a function class πΉπΉ to approximate ππβ. Examples of πΉπΉ:
β’ Linear class / high-dimension linear class / generalized linear modelsβ’ Non-parametric class / reproducing kernel Hilbert space (RKHS)β’ Regression treesβ’ Neural networks
Why is the problem important and challenging?
βΊContextual bandits combine statistical learning and decision making under uncertainty
βΊContextual bandits capture two essential features of sequential decision making under uncertainty
β’ Bandit feedback: for each context π₯π₯π‘π‘, learner only observes the reward for her chosen action πππ‘π‘; no other rewards are observed
β’ Learner faces a trade-off between exploration and exploitation
β’ Heterogeneity: the effectiveness of each action depends on the contextβ’ The context space is huge --- Not clear how to learn across contexts for general
function class
Literature on Contextual Bandits
βΊAlgorithms:β’ Upper Confidence Bounds (Filippi et al. 2010, Rigollet and Zeevi 2010, Abbasi-
Yadkori et al. 2011, Chu et al. 2011, Li et al. 2017, β¦)β’ Thompson Samplings (Agrawal and Goyal 2013, Russo et al. 2018, β¦)β’ Exponential Weighting (Auer et al. 2002, McMahan and Streeter 2009,
Beygelzimer et al. 2011, β¦)β’ Oracle-based (Dudik et al. 2011, Agarwal et al. 2014, Foster et al. 2018, Foster
and Rakhlin 2020, β¦)β’ Many Others β¦
βΊApplications:β’ Recommender systems (Li et al. 2010, Agarwal et al. 2016, ...)β’ Ride-hailing platforms (Chen et al. 2019, β¦)β’ Dynamic pricing (Ferreira et al. 2018β¦)β’ Healthcare (Tewari and Murphy 2017, Bastani and Bayati 2020, β¦)
10
Relevance to Operations
βΊProduct recommendation:β’ πΎπΎ productsβ’ ππ customers arriving in a sequential manner. Each customer has a feature π₯π₯π‘π‘ describing gender,
age, shopping history, device type, etc.
β’ The task it to recommend a product πππ‘π‘ (based on π₯π₯π‘π‘) that generates revenue as high as possibleβ’ The revenue distribution is unknown, with its conditional mean ππβ(π₯π₯π‘π‘, πππ‘π‘) to be learned
βΊPersonalized medicine:β’ πΎπΎ treatments / dose levelsβ’ ππ patients arriving in a sequential manner. Each patient has a feature π₯π₯π‘π‘ describing her
demographics, diagnosis, genes, etc.
β’ The task is to pick a personalized treatment (or dose level) πππ‘π‘ (based on π₯π₯π‘π‘) that is as effective as possible
β’ The efficacy is random and unknown, with the efficacy rate ππβ(π₯π₯π‘π‘,πππ‘π‘) to be learned
11
The Challenge
βΊWe are interested in contextual bandits with a general function class πΉπΉβΊRealizability assumption:
ππβ β πΉπΉ
βΊStatistical challenge: How can we achieve the optimal regret for any general function class?
βΊComputational challenge: How can we make the algorithm computationally efficient?
βΊClassical contextual bandits approaches fail to simultaneously address the above two challenges in practice, as they typically
β’ Become statistically suboptimal for general πΉπΉ (e.g., UCB variants and Thompson Sampling)β’ Become computationally intractable for large πΉπΉ (e.g., Exponential weighting, Elimination-
based methods)
Research Question
βΊ Observation: Given a general function class πΉπΉ, the statistical and computational aspects of βoffline regressionβ are well-studied in ML.
βΊ Specifically, given i.i.d. offline data, advances in ML enable us to find a predictor ππ such thatβ’ (statistically) ππ achieves low estimation error: support vector machines, random forests, boosting, neural net β¦
β’ (computationally) ππ can be efficiently computed: gradient descent methods
βΊ Can we reduce general contextual bandits to general offline regression?
βΊ Given πΉπΉ, and an offline regression oracle, e.g., a least-squares regression oracle
arg minππβπΉπΉ
οΏ½π‘π‘=1
ππππ π₯π₯π‘π‘ ,πππ‘π‘ β πππ‘π‘(πππ‘π‘) 2
or its regularized counterparts (e.g., Ridge and Lasso),
Challenge: Design a contextual bandit algorithm such thatβ’ (statistically) it achieves the optimal regret whenever the offline regression oracle attains the optimal
estimation error
β’ (computationally) it requires no more computation than calling the offline regression oracle
βΊ An open problem mentioned in Agarwal et al. (2012), Foster et al. (2018), Foster and Rakhlin (2020)
Talk Outline
β’ Motivation and Research Question
β’ Technical Hurdles and Our Contribution
β’ The Algorithm and Theory
β’ Computational Experiments
14
Why is the research question so challenging?
βΊTwo key challenges for reducing contextual bandits to offline regression
β’ 1. Statistical difficulties associated with confidence boundsβ’ 2. Statistical difficulties associated with analyzing dependent actions
15
1. Stat. difficulties with confidence bounds
βΊMany classical contextual bandits algorithms, e.g., UCB and Thompson Sampling, only work with certain parametric models
βΊThis is because they usually rely on effective confidence bounds constructed for each (π₯π₯,ππ) pair
βΊWhile this is possible for a simple class πΉπΉ, like the linear class, it is impossible for a general πΉπΉ
βΊFoster et al. (2018) propose a computationally efficient confidence-bounds-based algorithm using an offline regression oracle
β’ The algorithm only has statistical guarantees under some strong distributional assumptions
16
2. Stat. difficulties with analyzing dependent actions
βΊTranslating offline estimation error guarantees to contextual bandits is a challenge
βΊThis is because the data collected in the learning process is not i.i.dβ’ The action distribution in later rounds depend on the data in previous rounds
βΊRecently, Foster and Rakhin (2020) develop an optimal and efficient algorithm for contextual bandits assuming access to an online regression oracle
βΊThe online regression oracle provides statistical guarantees for an arbitrary data sequence possibly generated by an (adaptive) adversary
βΊComputationally efficient algorithms for the required online regression oracle are only known for specific function classes
β’ Lack of efficient algorithms for many natural function classes, e.g., sparse linear class, HΓΆlder classes, neural networks, β¦ 17
Our Contribution
βΊWe provide the first optimal and efficient black-box reductionfrom general contextual bandits to offline regression
β’ The algorithm is simpler and faster than existing approaches to general contextual bandits
β’ The design of the algorithm builds on Abe and Long (1999), Agrawal et al. (2014), Foster and Rakhlin (2020)
β’ The analysis of the algorithm is highly non-trivial and reveals surprising connections between several historical approaches to contextual bandits
β’ Any advances in offline regression immediately translate to contextual bandits, statistically and computationally
Our Contribution
βΊOur algorithmβs computational complexity is much better than existing algorithms for complicated πΉπΉ
FALCONβs computational complexity is equivalent to solving a few offline regression problems
What Does βMonsterβ Refer To?
βΊIn the contextual bandit literature, βmonsterβ refers to algorithms that require huge amount of computation
βΊDudik M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal learning for contextual bandits.
β’ These authors refer to their paper as the βMonster Paperββ’ Optimal regret but requires βa monster amount of computationβ
βΊAgarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the Monster: A fast and simple algorithm for contextual bandits.
β’ Optimal regret with reduced computational costβ’ Requires using the offline classification oracle
βΊThis paper: Bypassing the Monster β’ Under a weak βrealizabilityβ assumption: ππβ β πΉπΉ
20
Talk Outline
β’ Motivation and Research Question
β’ Technical Hurdles and Our Contribution
β’ The Algorithm and Theory
β’ Computational Experiments
21
The algorithm
βΊThree componentsβ’ An epoch schedule (to exponentially save computation)β’ Greedy calls to the offline regression oracle (to obtain reward predictor)β’ A sampling rule (=randomized algorithm over actions) determined by the
predictor and an epoch-varying learning rate (to make decisions)β’ The sampling rule is introduced by Abe and Long (1999) and adopted in Foster
and Rakhlin (2020)
βΊThe algorithm is fast and we call it FALCON (FAst Least-squares-regression-oracle for CONtextual bandits)
22
Falcon, the fastest animal on earthSource: Kirstin Fawcett, fakuto.com
Component 1: Epoch Schedule
βΊAn epoch schedule ππ1, ππ2, β¦
βΊThe algorithm only calls the regression oracle at the start of each epoch.
β’ When ππππ = 2ππ, it only makes ππ(logππ) calls to the oracle over ππ roundsβ’ When ππ is known, the oracle calls can be reduced to ππ(log logππ) (which is
not a trivial property and is useful in clinical trials)
βΊThis implies that the oracle is called less and less frequently as the algorithm proceeds
23
Tππππππ1
β¦
ππ2 ππππβ1
Component 2: Oracle Calls
βΊBefore the start of each epoch ππ, solves
minππβπΉπΉ
οΏ½π‘π‘=1
ππππβ1ππ π₯π₯π‘π‘ ,πππ‘π‘ β πππ‘π‘(π₯π₯π‘π‘ ,πππ‘π‘) 2
via the least square oracle, and obtains a predictor ππππβΊWe can replace the least squares oracle by any other offline
regression oracles (e.g., regularized oracles like Ridge and Lasso)
βΊWhat to do next for making decisions?βΊIf we directly follow the predictor to choose greedy actions, then
the algorithm does not explore at all, and may perform poorβ’ We address the exploration-exploitation dilemma via sampling
24
Component 3: Sampling Rule
βΊFor each epoch ππ, we have a learning rate πΎπΎππ β ππππβ1
βΊAt round π‘π‘, we do the following
25
The greedy action
The probability of selecting each non-greedy action is inversely proportionalto the predicted gap between this action and the greedy action, as well as the learning rate πΎπΎππ. This corresponds to βexplorationβ
The probability of selecting the greedy action is the highest. This corresponds to βexploitationβ
The learning rate balancesbetween exploration and exploitation. The algorithm explores more at the beginning and gradually exploits more.
Statistical Guarantees: Finite Function Class
βΊTheorem: FALCON guarantees expected regret ofοΏ½ππ πΎπΎππlog|πΉπΉ|
through ππ(logππ) calls to the least-squares regression oracle. The number of oracle calls can be reduced to ππ(log logππ) if ππ is known in advance.
βΊCombined with a Ξ© πΎπΎππlog|πΉπΉ| lower bound (Agrawal et al. 2012), we know that our regret is minimax optimal.
26
Statistical Guarantees: General Function Class
βΊInput (offline guarantee): Given ππ i.i.d. samples π₯π₯ππ , ππππ; ππππ ~π·π·, and an offline regression oracle that returns an estimator ππ such that β possible π·π·,
πΌπΌ π₯π₯,ππ;ππ βΌπ·π· ππ π₯π₯, ππ β ππβ π₯π₯, ππ2β€ π¬π¬ππ(ππ;ππ)
where the estimation error guarantee π¬π¬ππ(ππ;ππ) depends on the number of samples ππ and the complexity of πΉπΉ
βΊTheorem: Given an offline regression oracle with estimation error πΈπΈππ(ππ;πΉπΉ) for ππ samples, FALCON guarantees expected regret of
οΏ½ππ πΎπΎπΈπΈππ ππ;πΉπΉ ππ
through ππ(logππ) calls to the offline regression oracle. The number of oracle calls can be reduced to ππ(log logππ) if ππ is known
β’ Plugging in the rate-optimal πΈπΈππ(ππ;πΉπΉ) ensures that the regret is optimal in terms of ππ, which matches the regret lower bound proved in Foster and Rakhlin (2020).
27
Examples
βΊWhen πΉπΉ is a linear class with dimension ππβ’ Least squares estimator ensures πΈπΈππ ππ;πΉπΉ = ππ ππ
ππ
β’ FALCON achieves ππ( πΎπΎππ(ππ + logππ)) regret using the least squares regression oracle. While the dependence on πΎπΎ is suboptimal, the dependence on ππ improves over best known algorithm by a logππ factor
βΊWhen πΉπΉ is a linear class with sparsity π π β’ LASSO ensures πΈπΈππ ππ;πΉπΉ = οΏ½ππ slog ππ
ππ(under certain conditions of π·π·ππ)
β’ FALCON achieves οΏ½ππ( πΎπΎπ π ππlogππ) regret using LASSO as the offline oracle
βΊWhen πΉπΉ is a class of neural networkβ’ There are many methods to find an estimator ππ that perform extremely well in practiceβ’ Our results can directly transform low estimation error into the best-possible regret bound
(based on such error), theoretically or empirically
βΊOther Examples: generalized linear models, non-parametric classes, β¦28
Proof Sketch
βΊTranslating offline estimation error guarantees to contextual bandits is a challenge
βΊData collected in the online learning process is not i.i.d
βΊOffline guarantees provide upper bounds on the βdistanceβ between ππand ππβ for a fixed action distribution
βΊInitialization. A dual interpretation: our algorithm adaptively maintains a distribution over policies in the universal policy space Ξ¨ = πΎπΎ ππ
β’ A policy ππ:ππ β¦ [πΎπΎ] is a deterministic decision functionβ’ Let ππππβ be the true optimal policy, and πποΏ½ΜοΏ½πππ be the greedy policy
β’ At epoch ππ, ππππ and πΎπΎππ induce a distribution over policies ππππ(β )β’ ππππ ππ = βπ₯π₯βππ ππππ(ππ(π₯π₯)|π₯π₯), where ππππ(ππ(π₯π₯)|π₯π₯) is the probability that the sampling
rule selects action ππ(π₯π₯) given context π₯π₯
Proof Sketch
Step 1. Per-round property: At each epoch ππ, given ππππ and πΎπΎππ, the distribution ππππ ensures that
οΏ½ππβΞ¨
ππππ ππ πΌπΌπ·π·ππ ππππ π₯π₯,πποΏ½ΜοΏ½πππ π₯π₯ β ππππ π₯π₯,ππ π₯π₯ = ππ πΎπΎ/πΎπΎππ
Step 2. Proof by induction: At each epoch ππ, ππ1, β¦ ,ππππβ1 ensure that for all ππ β Ξ¨,
πΌπΌπ·π·ππ ππβ π₯π₯,ππππβ π₯π₯ β ππβ(π₯π₯,ππ(π₯π₯)) β€ 2πΌπΌπ·π·ππ ππππ π₯π₯,πποΏ½ΜοΏ½πππ π₯π₯ β ππππ π₯π₯,ππ π₯π₯ + ππ πΎπΎ/πΎπΎππ
Step 3. Putting together: At each epoch ππ, our per-round expected regret is
Estimated per-round expected regret of ππ
True per-round expected regret of ππ
Our algorithmβs (per-round) expected regret if ππππ were ground-truth
οΏ½ππβΞ¨
ππππ ππ πΌπΌπ·π·ππ ππβ π₯π₯,ππππβ π₯π₯ β ππβ(π₯π₯,ππ(π₯π₯))
β€ 2οΏ½ππβΞ¨
ππππ ππ πΌπΌπ·π·ππ ππππ π₯π₯,πποΏ½ΜοΏ½πππ π₯π₯ β ππππ π₯π₯,ππ π₯π₯ + ππ πΎπΎ/πΎπΎππ
= ππ πΎπΎ/πΎπΎππ
by step 2
by step 1
This bound is directly guaranteed by the sampling rule
We choose {πΎπΎππ} such that Step 2 holds and Step 3 leads to the optimal accumulated regret
A closer look at Step 2
Step 2. Proof by induction: At each epoch ππ, ππ1, β¦ ,ππππβ1 ensure that for all ππ β Ξ¨,πΌπΌπ·π·ππ ππβ π₯π₯,ππππβ π₯π₯ β ππβ(π₯π₯,ππ(π₯π₯)) β€ 2πΌπΌπ·π·ππ ππππ π₯π₯,πποΏ½ΜοΏ½πππ π₯π₯ β ππππ π₯π₯,ππ π₯π₯ + ππ πΎπΎ/πΎπΎππ
βΊ It connects ππππ and ππβ without specifying an action distributionβ’ It connects the βestimated worldβ and the βtrue worldβ
βΊ It holds for non-iid and dependent decision process (of {πππ‘π‘})β’ The induction argument shows how exploration in early rounds benefit exploitation in later
rounds
βΊ It utilizes the iid properties of {π₯π₯π‘π‘}
βΊ It establishes a bridge from offline estimation guarantees to online decision making guarantees
β’ The analysis is general and does not rely on any refined property of πΉπΉ
Estimated per-round expected regret of ππTrue per-round expected regret of ππ
A Few Observations
βΊ The Statistical Guarantees hold even if the MSE Loss Function is replaced by Strongly Convex Loss Function
β’ Generalized Linear Model: Logistics Loss Function
βΊ Comparing FALCON to SquareCB (Foster and Rakhlin, 2020)β’ FALCON assumes iid contexts; SquareCB allows for general contextsβ’ Computational Efficient Algorithms:
β’ SquareCB requires computationally efficient algorithms for online regression oracle. This is only known for specific function classes.
β’ Many more functions classes are covered by computational efficient offline regression oracles (as required by FALCON)
β’ FALCON allows occasional updates; SquareCB requires continuous updatesβ’ Important in healthcare applications where rewards are delayed
32
Talk Outline
β’ Motivation and Research Question
β’ Technical Hurdles and Our Contribution
β’ The Algorithm and Theory
β’ Computational Experiments
33
Initial Computational ExperimentsOn real-world datasets for three types of problems: Multiclass Classification, Recommendation, Price Optimization.
34
βΊ10 multiclass classification data sets from OpenML and 2learning-to-rank data sets from Microsoft and Yahoo! Public Data Set Website.
βΊReduction to Online Contextual Bandit Problems.β’ 0/1 Loss encoding
Classification & Recommendation Datasets
Retail Applications: Classification
βΊMulti-class classification is a fundamental task in ML. Online multi-class classification can be used for many applications from handwriting recognition and face recognition to customer group recognition and promotion design.
βΊClassify the customer correctly, then we incur 0 loss; otherwise we incur a loss of 1.
36
Retailer Applications: Recommendation
βΊA recommendation system seeks to predict the ranking (or rating) that a user would give to a product.
βΊWhen a customer arrives, we want to recommend a product that she likes most. If we recommend a product she ranked high, then we incur smaller loss; otherwise we incur larger loss.
37
βΊFor simplicity, we use benchmark contextual bandit algorithms implemented in Vowpal Wabbit (v8.8.0), an open-source library for online learning algorithms
β’ Greedy: use only greedy actionβ’ ππ-Greedy: use greedy action with prob (1- ππ) and uniform on all other actionsβ’ Online Cover & Cover NU: heuristic versions of βTaming the Monster,β see
Agarwal et al. 2014β’ Bagging & Bagging Greedy: heuristic versions of Thompson Samplingβ’ RegCB-elimination: a generalization of βsuccessive elimination,β see Foster
et al. 2018β’ RegCB-optimistic: a generalization of UCB, see Foster et al. 2018
Benchmark Algorithms
Stat significant win-loss difference across 12 datasets
The statistical significance is defined based on approximate Z-testFALCON 1 uses a linear class with least squares estimatorFALCON 2 uses a linear class with ridge estimatorFALCON 3 uses a regression tree class with gradient boosting estimator
Stat significant win-loss difference across 12 datasets
The statistical significance is defined based on approximate Z-testFALCON 1 uses a linear class with least squares estimatorFALCON 2 uses a linear class with ridge estimatorFALCON 3 uses a regression tree class with gradient boosting estimator
Each entry shows the statistically significant win-loss of a raw against a column
Stat significant win-loss difference across 12 datasets
The statistical significance is defined based on approximate Z-testFALCON 1 uses a linear class with least squares estimatorFALCON 2 uses a linear class with ridge estimatorFALCON 3 uses a regression tree class with gradient boosting estimator
Stat significant win-loss difference across 12 datasets
The statistical significance is defined based on approximate Z-testFALCON 1 uses a linear class with least squares estimatorFALCON 2 uses a linear class with ridge estimatorFALCON 3 uses a regression tree class with gradient boosting estimator
Dynamic Pricing
βΊThe public data set for revenue management is retrieved from CPRM, Columbia University.
βΊ200,000 examples of Auto Loans across 134 days
βΊ0/1 response on βapply / not applyβ
βΊ144 featuresβ’ We selected the five most important features: CarType, Primary_FICO, Term, Competition_Rate,
OneMonth.
βΊWe use cumulative revenue to evaluate performance of each algorithm in the simulation.
43
βΊHere is the revenue performance across algorithms for two settings.
βΊBelow are comparison of different algorithms in Set 1&2:
Linear Classes Across Algorithms
βΊWe consider Linear / Ridge / GradientBoosting regressions for these two settings and GBR performs better as it eliminates misspecification.
Function Classes in FALCON
References
βΊD. Simchi-Levi and Y. Xu (2020). Bypassing the Monster: A Faster and Simpler
Optimal Algorithm for Contextual Bandits under Realizability.
β’ Mathematics of Operations Research, to appear
βΊD. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu (2020). Instance-Dependent
Complexity of Contextual Bandits and Reinforcement Learning: A
Disagreement-based Perspective
β’ Conference version appeared in COLT 2021
46
Other Extensions
βΊXu and Zeevi (2020) extend our results to contextual bandits with infinite actions. They also introduce a new optimism-based algorithmic principle.
βΊKrishnamurthy et al. (2021) extend our results to the setting where πΉπΉ is misspecified.
βΊWei and Luo (2021) extend our results to non-stationary contextual bandits.
βΊSen et al. (2021) extend our results to contextual bandits with a combinatorial action space.
47
Thank you!
48