-
Adaptive Sequential Experiments
with Unknown Information Arrival Processes
Yonatan Gur
Stanford University
Ahmadreza Momeni∗
Stanford University
April 26, 2020
Abstract
Sequential experiments are often designed to strike a balance
between maximizing immediate payoffsbased on available information,
and acquiring new information that is essential for maximizing
futurepayoffs. This trade-off is captured by the multi-armed bandit
(MAB) framework that has been studiedand applied, typically when at
each time epoch feedback is received only on the action that
wasselected at that epoch. However, in many practical settings,
including product recommendations,dynamic pricing, retail
management, and health care, additional information may become
availablebetween decision epochs. We introduce a generalized MAB
formulation in which auxiliary informationmay appear arbitrarily
over time. By obtaining matching lower and upper bounds, we
characterize theminimax complexity of this family of problems as a
function of the information arrival process, andstudy how salient
characteristics of this process impact policy design and achievable
performance. Interms of achieving optimal performance, we establish
that: (i) upper confidence bound and posteriorsampling policies
possess natural robustness with respect to the information arrival
process withoutany adjustments, which uncovers a novel property of
these popular families of policies and furtherlends credence to
their appeal; and (ii) policies with exogenous exploration rate do
not possesssuch robustness. For such policies, we devise a novel
virtual time indices method for dynamicallycontrolling the
effective exploration rate. We apply our method for designing
�t-greedy-type policiesthat, without any prior knowledge on the
information arrival process, attain the best performance (interms
of regret rate) that is achievable when the information arrival
process is a priori known. Weuse data from a large media site to
analyze the value that may be captured in practice by
leveragingauxiliary information for designing content
recommendations.
Keywords: sequential experiments, data-driven decisions, product
recommendation systems, onlinelearning, adaptive algorithms,
multi-armed bandits, exploration-exploitation, minimax regret.
1 Introduction
1.1 Background and motivation
Effective design of sequential experiments requires balancing
between new information acquisition
(exploration), and optimizing payoffs based on available
information (exploitation). A well-studied
∗The authors are grateful to Omar Besbes and Lawrence M. Wein
for their valuable comments. An initial version ofthis work,
including some preliminary results, appeared in Gur and Momeni
(2018). Correspondence:
[email protected],[email protected].
1
-
framework that captures the trade-off between exploration and
exploitation is the one of multi-armed
bandits (MAB) that first emerged in Thompson (1933) in the
context of drug testing, and was later
extended by Robbins (1952) to a more general setting. In this
framework, an agent repeatedly chooses
between different actions (arms), where each action is
associated with an a priori unknown reward
distribution, and the objective is to maximize cumulative return
over a certain time horizon.
The MAB framework focuses on balancing exploration and
exploitation under restrictive assumptions
on the future information collection process: in every round
noisy observations are collected only on the
action that was selected at that period. Correspondingly, policy
design is typically predicated on that
information collection procedure. However, in many practical
settings to which bandit models have been
applied for designing sequential experimentation, additional
information might be realized and leveraged
over time. Examples of such settings include dynamic pricing
(Bastani et al. 2019, Bu et al. 2019), retail
management (Zhang et al. 2010), clinical trials (Bertsimas et
al. 2016, Anderer et al. 2019), as well as
many machine learning domains (see a list of relevant settings
in the survey by Pan and Yang 2009).
To illustrate one particular such setting, consider the problem
of designing product recommendations
with limited data. Product recommendation systems are widely
deployed over the web with the objective
of helping users navigate through content and consumer products
while increasing volume and revenue for
service and e-commerce platforms. These systems commonly apply
techniques that leverage information
such as explicit and implicit user preferences, product
consumption, and consumer ratings (see, e.g., Hill
et al. 1995, Konstan et al. 1997, Breese et al. 1998). While
effective when ample relevant information
is available, these techniques tend to under-perform when
encountering products that are new to the
system and have little or no trace of activity. This phenomenon,
known as the cold-start problem, has
been documented and studied in the literature; see, e.g., Schein
et al. (2002), Park and Chu (2009).
In the presence of new products, a recommendation system needs
to strike a balance between
maximizing instantaneous performance indicators (such as
revenue) and collecting valuable information
that is essential for optimizing future recommendations. To
illustrate this tradeoff (see Figure 1), consider
a consumer (A) that is searching for a product using the organic
recommendation system of an online
retailer (e.g., Amazon). The consumer provides key product
characteristics (e.g., “headphones” and
“bluetooth”), and based on this description (and, perhaps,
additional factors such as the browsing history
of the consumer) the retailer recommends products to the
consumer. While some of the candidate
products that fit the desired description may have already been
recommended many times to consumers,
and the mean returns from recommending them are known, other
candidate products might be new brands
that were not recommended to consumers before. Evaluating the
mean returns from recommending new
products requires experimentation that is essential for
improving future recommendations, but might be
2
-
Additional Information
Denisy
Old item
New
item(A)
(B)
RecommendationSystem
Figure 1: Additional information in product recommendations. In
the depicted scenario, consumers of type (A)sequentially arrive to
an online retailer’s organic recommendation system to search for
products that match theterms “headphones” and “bluetooth.” Based on
these terms, the retailer can recommend one of two products: anold
brand that was already recommended many times to consumers, and a
new brand that was never recommendedbefore. In parallel, consumers
of type (B) arrive to the new brand’s product page directly from a
search engine(e.g., Google) by searching for “headphones,”
“bluetooth,” and “Denisy” (the name of the new brand).
costly if consumers are less likely to purchase these new
products.
Several MAB settings were suggested and applied for designing
recommendation algorithms that
effectively balance information acquisition and instantaneous
revenue maximization, where arms represent
candidate recommendations; see the overview in Madani and
DeCoste (2005), as well as later studies by
Agarwal et al. (2009), Caron and Bhagat (2013), Tang et al.
(2014), and Wang et al. (2017). Aligned
with traditional MAB frameworks, these studies consider settings
where in each time period information
is obtained only on items that are recommended at that period,
and suggest recommendation policies
that are predicated on this information collection process.
However, in many instances additional
browsing and consumption information may be maintained in
parallel to the sequential recommendation
process, as a significant fraction of website traffic may take
place through means other than its organic
recommender system; see browsing data analysis in Sharma et al.
(2015) that estimate that product
recommendation systems are responsible for at most 30% of site
traffic and revenue and that substantial
fraction of remaining traffic arrives to product pages directly
from search engines, as well as Besbes et al.
(2016) that report similar findings for content recommendations
in media sites.
To illustrate an additional source of traffic and revenue
information, let us revert to Figure 1 and
consider a different consumer (B) that does not use the organic
recommender system of the retailer,
but rather arrives to a certain product page directly from a
search engine (e.g., Google) by searching
for that particular brand. While consumers (A) and (B) may have
inherently different preferences, the
actions taken by consumer (B) after arriving to the product page
(e.g., whether or not she bought the
product) can be informative about the returns from recommending
that product to other consumers.
This additional information could potentially be used to improve
the performance of recommendation
3
-
algorithms, especially when the browsing history that is
associated with some products is limited.
The above example illustrates that additional information often
becomes available between decision
epochs of recommender systems, and that this information might
be essential for designing effective
product recommendations. In comparison with the restrictive
information collection process that is
assumed in most MAB formulations, the availability of such
auxiliary information (and even the prospect
of obtaining it in the future) might impact the achievable
performance and the design of sequential
experiments and learning policies. When additional information
is available one may potentially need
to “sacrifice” less decisions for exploration in order to obtain
effective estimators for the mean rewards.
While this intuition suggests that exploration can be reduced in
the presence of additional information, it
is a priori not clear what type of performance improvement can
be expected given different information
arrival processes, and how policy design should depend on these
processes.
Moreover, monitoring exploration rates in real time in the
presence of an arbitrary information arrival
process introduces additional challenges that have distinct
practical relevance. Most importantly, an
optimal exploration rate may depend on several characteristics
of the information arrival process, such
as the amount of information that arrives and the time at which
it appears (e.g., early on versus later
on along the decision horizon). Since it may be hard to predict
upfront the salient characteristics of the
information arrival process, an important challenge is to
identify the extent to which it is possible to
adapt in real time to an a priori unknown information arrival
process, in the sense of approximating the
performance that is achievable under prior knowledge of the
sample path of information arrivals. This
paper is concerned with addressing these challenges.
The research questions that we study in this paper are along two
lines. First, we study how the
information arrival process and its characteristics impact the
performance one could aspire to when
designing a sequential experiment. Second, we study what type of
policies should one adopt for designing
sequential experiments in the presence of general information
arrival processes, especially when the
structure of these processes is a priori unknown.
1.2 Main contributions
The main contribution of this paper lies in (1) introducing a
new, generalized MAB framework with
unknown and arbitrary information arrival process; (2)
characterizing the minimax regret complexity of
this broad class of problems and, by doing so, providing a sharp
criterion for identifying policies that are
rate-optimal in the presence of auxiliary information arrivals;
and (3) identifying policies (and proposing
new ones) that are rate-optimal in the presence of general and a
priori unknown information arrival
processes. More specifically, our contribution is along the
following lines.
4
-
(1) Modeling. To capture the salient features of settings with
auxiliary information arrivals we
formulate a new class of MAB problems that generalizes the
classical stochastic MAB framework, by
relaxing strong assumptions that are typically imposed on the
information collection process. Our
formulation considers a broad class of distributions that are
informative about mean rewards, and allows
observations from these distributions to arrive at arbitrary and
a priori unknown rates and times. Our
model therefore captures a large variety of real-world
phenomena, yet maintains tractability that allows
for a crisp characterization of the performance improvement that
can be expected when more information
might become available throughout the sequential decision
process.
(2) Analysis of achievable performance. We establish lower
bounds on the performance that is
achievable by any non-anticipating policy, where performance is
measured in terms of regret relative to
the performance of an oracle that constantly selects the action
with the highest mean reward. We further
show that our lower bounds can be achieved through suitable
policy design. These results identify the
minimax complexity associated with the MAB problem with unknown
information arrival processes, as a
function of the underlying information arrival process itself
(together with other problem characteristics
such as the length of the problem horizon). This provides a
yardstick for evaluating the performance
of policies, and a sharp criterion for identifying rate-optimal
ones. Our results identify a spectrum of
minimax regret rates ranging from the classical regret rates
that appear in the stochastic MAB literature
when there is no auxiliary information, to regret that is
uniformly bounded over time when information
arrives frequently and/or early enough.
(3) Policy design. We establish that polices that are based on
posterior sampling and upper confidence
bounds may leverage additional information naturally, without
any prior knowledge on the information
arrival process. In that sense, policies such as Thompson
sampling and UCB exhibit remarkable
robustness with respect to the information arrival process:
without any adjustments, these policies
guarantee the best performance that can be achieved (in terms of
regret rate). Moreover, the regret rate
that is guaranteed by these policies cannot be improved even
with prior knowledge of the information
arrival process. The “best of all worlds” type of guarantee that
is established implies that Thompson
sampling and UCB achieve rate optimality uniformly over a
general class of information arrival processes;
Figure 2 visualizes the regret incurred by these polices as the
amount of auxiliary information increases.
This result identifies a new important property of these popular
policies, generalizes the class of problems
to which they have been applied thus far, and further lends
credence to their appeal.
Nevertheless, we observe that policies with an exogenous
exploration rate do not exhibit such robustness.
For such policies, we devise a novel virtual time indices method
for endogenizing the exploration rate.
We apply this method for designing �t-greedy-type policies that,
without any prior knowledge on the
5
-
Stanford GSB
Yoni Gur ∙ Adaptive Sequential Experiments
1
experimentation effectively stops
Figure 2: Impact of stochastic information arrivals (see §3.1.1)
with different rates, λ, of auxiliary informationarrivals: little
auxiliary data (λ = 0.001), more auxiliary data (λ = 0.01) and a
lot of auxiliary data (λ = 0.05).The top plots depict regret
incurred by UCB1 tuned by c = 1.0 with (aUCB1) and without (UCB1)
accountingfor auxiliary information arrivals. Bottom plots depict
regret incurred by Thompson sampling tuned by c = 0.5with (aTS) and
without (TS) accounting for auxiliary observations. The setup
consists of horizon length T = 104,three arms and Gaussian rewards
with means µ = (0.7, 0.5, 0.5)> and standard deviation σ = 0.5.
Auxiliaryobservations are i.i.d. samples from the reward
distributions. Plots are averaged over 400 replications.
information arrival process, uniformly attain the best
performance (in terms of regret rate) that is
achievable when the information arrival process is a priori
known. Figure 3 uses the settings described
in Figure 2 to visualize the impact of adjusting the exploration
rate of an εt-greedy policy using virtual
time indices as the amount of auxiliary information increases.
The virtual time indices method adjusts
the exploration rate of the policy in real time, through
replacing the time index used by the policy
with virtual time indices that are dynamically updated over
time. Whenever auxiliary information on
a certain action arrives, the virtual time index associated with
that action is carefully advanced to
effectively reduce the rate at which the policy is experimenting
with that action.
Stanford GSB
Yoni Gur ∙ Adaptive Sequential Experiments
1
Figure 3: The information arrival settings described in Figure
2, and performance of εt-greedy-type policies(tuned by c = 1.0)
that: ignore auxiliary observations (EG); use this information to
adjust estimates of meanrewards without appropriately adjusting the
exploration rate of the policy (nEG); and adjust reward
estimatesand dynamically adjusts exploration rates using virtual
time indices (aEG).
Finally, using data from a large media site, we analyze the
value that may be captured by leveraging
auxiliary information in order to recommend content with unknown
impact on the future browsing path
6
-
of readers. The auxiliary information we utilize consists of
users that arrived to media pages directly
from search, and the browsing path of these users once arrived
to that media page.
1.3 Related work
Multi-armed bandits. Since its inception, the MAB framework has
been adopted for studying a
variety of applications including clinical trials (Zelen 1969),
strategic pricing (Bergemann and Välimäki
1996), assortment selection (Caro and Gallien 2007), online
auctions (Kleinberg and Leighton 2003),
online advertising (Pandey et al. 2007), and product
recommendations (Madani and DeCoste 2005, Li
et al. 2010). For a comprehensive overview of MAB formulations
we refer the readers to Berry and
Fristedt (1985) and Gittins et al. (2011) for Bayesian / dynamic
programming formulations, as well as
Cesa-Bianchi and Lugosi (2006) and Bubeck and Cesa-Bianchi
(2012) that cover the machine learning
literature and the so-called adversarial setting. A sharp regret
characterization for the more traditional
formulation (random rewards realized from stationary
distributions), often referred to as the stochastic
MAB problem, was first established by Lai and Robbins (1985),
followed by analysis of policies such as
�t-greedy, UCB1, and Thompson sampling; see, e.g., Auer et al.
(2002) and Agrawal and Goyal (2013a).
The MAB framework focuses on balancing exploration and
exploitation under restrictive assumptions
on the future information collection process. Correspondingly,
optimal policy design is typically predicated
on the assumption that at each period a reward observation is
collected only on the action that is
selected at that time period (exceptions to this common
information collection process are discussed
below). In that sense, such policy design does not account for
information that may become available
between decision epochs, and that might be essential for
achieving good performance in many practical
settings. In the current paper, we relax the assumptions on the
information arrival process in the MAB
framework by allowing arbitrary information arrival processes.
Our focus is on studying the impact of
the information arrival characteristics (such as frequency and
timing) on achievable performance and
policy design, especially when the information arrival process
is a priori unknown.
Few MAB settings deviate from the information collection process
that is described above, but these
settings consider specific information arrival processes that
are known a priori to the agent, as opposed
to our formulation where the information arrival process is
arbitrary and a priori unknown. One example
include the so-called contextual MAB setting, also referred to
as bandit problem with side observations
(Wang et al. 2005), or associative bandit problem (Strehl et al.
2006), where at each trial the decision
maker observes a context carrying information about other arms.
(In Appendix E we demonstrate
that the policy design approach we advance here could be applied
to contextual MAB settings as well.)
Another example is the full-information adversarial MAB setting,
where rewards are arbitrary and
7
-
can even be selected by an adversary (Auer et al. 1995, Freund
and Schapire 1997). In this setting,
after each decision epoch the agent has access to a reward
observation from each arm (not only the one
that was selected). The adversarial nature of this setting makes
it fundamentally different in terms of
achievable performance, analysis, and policy design, from the
stochastic formulation that is adopted
in this paper. Another related example, which could be viewed as
a special case of our framework, is
the setting of Shivaswamy and Joachims (2012), who consider a
stochastic MAB problem where initial
reward observations are available at the beginning of the
horizon.
Balancing and regulating exploration. Several papers have
considered settings of dynamic opti-
mization with partial information and distinguished between
cases where myopic policies guarantee
optimal performance, and cases where exploration is essential,
in the sense that myopic policies may lead
to incomplete learning and large losses; see, e.g., Araman and
Caldentey (2009), Farias and Van Roy
(2010), Harrison et al. (2012), and den Boer and Zwart (2013)
for dynamic pricing without knowing
the demand function, Huh and Rusmevichientong (2009) and Besbes
and Muharremoglu (2013) for
inventory management without knowing the demand distribution,
and Lee et al. (2003) in the context
of technology development. Bastani et al. (2017) consider the
contextual MAB framework and show
that if the distribution of contexts guarantees sufficient
diversity, then exploration becomes unnecessary
and greedy policies can leverage the natural exploration that is
embedded in the information diversity
to achieve asymptotic optimality. Relatedly, Woodroofe (1979)
and Sarkar (1991) consider a Bayesian
one-armed contextual MAB problem and show that a myopic policy
is asymptotically optimal when
the discount factor converges to one. Considering sequential
recommendations to customers that make
decisions based on the relevance of recommendations, Bastani et
al. (2018) show that classical MAB
policies may over-explore and propose proper modifications for
those policies.
On the other hand, few papers have studied cases where
exploration is not only essential but should
be enhanced in order to maintain optimality. For example,
Cesa-Bianchi et al. (2006) introduce a partial
monitoring setting where after playing an arm the agent does not
observe the incurred loss but only a
limited signal about it, and show that such feedback structure
requires higher exploration rates. Besbes
et al. (2019) consider a general framework where the reward
distribution may change over time according
to a budget of variation, and characterize the manner in which
optimal exploration rates increase as a
function of said budget. Shah et al. (2018) consider a platform
in which the preferences of arriving users
may depend on the experience of previous users, show that in
this setting classical MAB policies may
under-explore, and propose a balanced-exploration approach that
leads to optimal performance.
The above studies demonstrate a variety of practical settings
where the extent of exploration that is
required to maintain optimality strongly depends on particular
problem characteristics that may often be
8
-
a priori unknown to the decision maker. This introduces the
challenge of dynamically endogenizing the
rate at which a decision-making policy explores to approximate
the best performance that is achievable
under ex ante knowledge of the underlying problem
characteristics. In this paper we address this
challenge from an information collection perspective. We
identify conditions on the information arrival
process that guarantee the optimality of myopic policies, and
further identify adaptive MAB policies
that guarantee rate-optimality without prior knowledge on the
information arrival process.
Recommender systems. An active stream of literature has been
studying recommender systems,
focusing on modelling and maintaining connections between users
and products; see, e.g., (Ansari et al.
2000), the survey by (Adomavicius and Tuzhilin 2005), and a book
by (Ricci et al. 2011). One key
element that impacts the performance of recommender systems is
the often limited data that is available.
Focusing on the prominent information acquisition aspect of the
problem, several studies (to which we
referred earlier) have addressed sequential recommendation
problems using a MAB framework where at
each time period information is obtained only on items that are
recommended at that period. Another
approach is to identify and leverage additional sources of
relevant information. Following that avenue,
Farias and Li (2019) consider the problem of estimating
user-item propensities, and propose a method
to incorporate auxiliary data such as browsing and search
histories to enhance the predictive power of
recommender systems. While their work concerns with the impact
of auxiliary information in an offline
prediction context, our paper focuses on the impact of auxiliary
information streams on the design,
information acquisition, and appropriate exploration rate in a
sequential experimentation framework.
2 Problem formulation
We formulate a class of multi-armed bandit problems with
auxiliary information arrivals. We note that
various modeling assumptions can be generalized and are made
only to simplify exposition and analysis.
Some model assumptions, as well as few extensions of our model,
are discussed in §2.1.
Let K = {1, . . . ,K} be a set of arms (actions) and let T = {1,
. . . , T} denote a sequence of decision
epochs. At each time period t ∈ T , a decision maker selects one
of the K arms. When selecting an
arm k ∈ K at time t ∈ T , a reward Xk,t ∈ R is realized and
observed. For each t ∈ T and k ∈ K,
the reward Xk,t is assumed to be independently drawn from some
σ2-sub-Gaussian distribution with
mean µk.1 We denote the profile of rewards at time t by Xt =
(X1,t, . . . , XK,t)
> and the profile of mean
rewards by µ = (µ1, . . . , µK)>. We further denote by ν =
(ν1, . . . , νK)
> the distribution of the rewards
1A real-valued random variable X is said to be sub-Gaussian if
there is some σ > 0 such that for every λ ∈ R one hasEeλ(X−EX) ≤
eσ
2λ2/2. This broad class of distributions includes, for instance,
Gaussian random variables, as well as any
random variable with a bounded support (if X ∈ [a, b] then X is
(a−b)2
4-sub-Gaussian) such as Bernoulli random variables.
In particular, if a random variable is σ2-sub-Gaussian, it is
also σ̃2-sub-Gaussian for all σ̃ > σ.
9
-
profile Xt. We assume that rewards are independent across time
periods and arms. We denote the
highest expected reward and the best arm by µ∗ and k∗
respectively, that is:2
µ∗ = maxk∈K{µk}, k∗ = arg max
k∈Kµk.
We denote by ∆k = µ∗−µk the difference between the expected
reward of the best arm and the expected
reward of arm k, and by ∆ a lower bound such that 0 < ∆ ≤
mink∈K\{k∗}
∆k.
Information arrival process. Before each round t, the agent may
or may not observe auxiliary
information for some of the arms without pulling them. For each
period t and arm k, we denote by
hk,t ∈ {0, 1} the indicator of observing auxiliary information
on arm k between decision epochs t − 1
and t. We denote by ht = (h1,t, . . . , hK,t)> the vector of
auxiliary information indicators at period t, and
by H = (h1, . . . ,hT ) the information arrival matrix; we
assume that this matrix is independent of the
policy’s actions. If hk,t = 1, then between decision epochs t−1
and t the agent observes a random variable
Yk,t ∼ ν ′k. For each arm k we assume that there exists a
mapping φk : R→ R through which an unbiased
estimator for µk can be constructed, that is, E [φk(Yk,t)] = µk,
and that φk(Yk,t) is σ̂2-sub-Gaussian
for some σ̂ > 0. We denote Yt = (Y1,t, . . . , YK,t)>, and
assume that the random variables Yk,t are
independent across time periods and arms, and are also
independent from reward realizations. We
denote the vector of information received between epochs t− 1
and t by Zt = (Z1,t, . . . , ZK,t)> where for
any k one has Zk,t = hk,t · φk(Yk,t).
Example 1 (Linear mappings) A special case of the information
arrival processes we formalized
consists of observations of independent random variables that
are linear mappings of rewards. That is,
there exist vectors (α1, . . . , αK) ∈ RK and (β1, . . . , βK) ∈
RK such that for all k and t,
E[βk + αkYk,t] = µk.
This simple class of mappings illustrates the possibility of
utilizing available data in many domains,
including maximizing conversions of product recommendations (for
example, in terms of purchases of
recommended products). The practicality of using linear mappings
for improving product recommendations
will be demonstrated empirically in §6 using data from a large
media site.
Admissible policies, performance, and regret. Let U be a random
variable defined over a prob-
ability space (U,U ,Pu). Let πt : Rt−1 × RK×t × {0, 1}K×t × U →
K for t = 1, 2, 3, . . . be measurable
functions (with some abuse of notation we also denote the action
at time t by πt ∈ K) given by2For simplicity, when using the arg
min and arg max operators we assume ties to be broken in favor of
the smaller index.
10
-
πt =
π1(Z1,h1, U) t = 1,πt(Xπt−1,t−1, . . . , Xπ1,1,Zt, . . . ,Z1,ht,
. . . ,h1, U) t = 2, 3, . . . .The mappings {πt : t = 1, . . . ,
T}, together with the distribution Pu define the class of
admissible
policies denoted by P . Policies in P depend only on the past
history of actions and observations as well
as auxiliary information arrivals, and allow for randomization
via their dependence on U . We denote
by S = S(∆, σ2, σ̂2) the class that includes pairs of allowed
reward distribution profiles (ν,ν ′).
We evaluate the performance of a policy π ∈ P by the regret it
incurs under information arrival
process H relative to the performance of an oracle that selects
the arm with the highest expected reward.
We define the worst-case regret as follows:
RπS(H, T ) = sup(ν,ν′)∈S
Eπ(ν,ν′)
[T∑t=1
(µ∗ − µπt)
], (1)
where the expectation Eπ(ν,ν′)[·] is taken with respect to the
noisy rewards and noisy auxiliary observations,
as well as to the policy’s actions (throughout the paper we will
denote by Pπ(ν,ν′), Eπ(ν,ν′), and R
π(ν,ν′)
the probability, expectation, and regret when the arms are
selected according to policy π, rewards are
distributed according to ν, and auxiliary observations are
distributed according to ν ′). We note that
regret is incurred over decisions made in epochs t = 1, . . .
,T; the main distinction relative to classical
regret formulations is that in (1) the mappings {πt; t = 1, . .
. , T} can be measurable with respect to the
σ-field that also includes information that arrives between
decision epochs, as captured by the matrix H.
We denote by R∗S(H, T ) = infπ∈PRπS(H, T ) the best achievable
guaranteed performance: the minimal
regret that can be guaranteed by an admissible policy π ∈ P. In
the following sections we study the
magnitude of R∗S(H, T ) as a function of the information arrival
process H.
2.1 Discussion on model assumptions and extnetions
Known mappings. A key modelling assumption in our framework is
that the mappings {φk} through
which auxiliary observations may carry information on reward
distributions, are a priory known to
the decision maker. The interpretability of feedback signals for
the purpose of forming mean rewards
estimates is a simplifying modeling assumption that is essential
for tractability in many online dynamic
optimization frameworks, and in particular in the MAB
literature.3 Therefore, our model could be
viewed as extending prior work by relaxing assumptions on the
information collection process that are
3Typically in the MAB literature feedback simply consists of the
reward observations. However, even when this relationbetween
feedback and rewards distribution is more general, e.g., in Russo
et al. 2018, it is assumed to be a priori known.
11
-
common in the MAB literature, while offering assumptions on the
information structure (and what is
known about it) that are comparable to what is assumed in
existing literature.
From a practical perspective, in many settings firms might be
able to estimate the mappings {φk}
from existing data. In the context of product recommendations,
similar products might exhibit similar
mappings from the same source of auxiliary information; for
example, conversion rates of customers that
arrive to similar products directly from search might be mapped
to conversion rates from recommendations
through similar mappings. In §6 we use data from a large US
media site to evaluate the additional
value that might be captured by utilizing one specific form of
auxiliary information, for which the
mappings {φk} are unknown upfront, and are estimated from
existing data. Finally, the challenge of
adapting to unknown mappings {φk} is further discussed in
§7.
Extension to contextual MAB. Our model extends the stochastic
MAB framework of Lai and
Robbins (1985) to allow for auxiliary information arrivals. We
nevertheless note it can be applied to more
general frameworks, including ones where connection between
auxiliary observations and mean rewards
is established through the parametric structure of the problem
rather than through mappings {φk}. In
Appendix E we demonstrate this by extending the model of
Goldenshluger and Zeevi (2013) where mean
rewards linearly depend on stochastic context vectors, to derive
performance bounds in the presence of
an unknown information arrival process. This extension also
captures additional problem features that
are relevant to various application domains, such as
idiosyncratic consumer characteristics.
Multiple information arrivals. For simplicity we assume that
only one information arrival can occur
between consecutive time periods for each arm (that is, hk,t ∈
{0, 1} for each time t and arm k). In fact,
all our results hold without any adjustments when there are more
than one information arrivals per time
step per arm (allowing any integer values for the entries
{hk,t}).
Endogenous information arrival processes. We focus on a setting
where the information arrival
process (the matrix H) is exogenous and does not depend on the
sequence of decisions. In Appendix D
we study policy design and achievable performance under a broad
class of information arrival processes
that are reactive to the past decisions of the policy; see also
related discussion in §7.
Non-separable mean rewards. For the sake of simplicity we refer
to the lower bound ∆ on the
differences in mean rewards relative to the best arm as a fixed
parameter that is independent of the
horizon length T . This corresponds to the case of separable
mean rewards, which is prominent in the
stochastic MAB literature. Nevertheless, we do not make any
explicit assumption on the separability
of mean rewards and our analysis and results hold for the more
general case where ∆ is a decreasing
function of the horizon length T .
12
-
3 Impact on achievable performance
In this section we study the impact that information arrival
process may have on the performance that
one could aspire to achieve. Our first result formalizes what
cannot be achieved, establishing a lower
bound on the best achievable performance as a function of the
information arrival process.
Theorem 1 (Lower bound on the best achievable performance) For
any T ≥ 1 and information
arrival matrix H, the worst-case regret for any admissible
policy π ∈ P is bounded from below as follows:
RπS(H, T ) ≥C1∆
K∑k=1
log
(C2∆
2
K
T∑t=1
exp
(−C3∆2
t∑s=1
hk,s
)),
where C1, C2, and C3 are positive constants that only depend on
σ and σ̂.
The precise expressions of C1, C2, and C3 are provided in the
discussion below. Theorem 1 establishes a
lower bound on the achievable performance in the presence of an
information arrival process. This lower
bound depends on an arbitrary sample path of information
arrivals, captured by the elements of the
matrix H. In that sense, Theorem 1 provides a spectrum of bounds
on achievable performances, mapping
many potential information arrival trajectories to the best
performance they allow. In particular, when
there is no additional information over what is assumed in the
classical MAB setting, one recovers a
lower bound of order K∆ log T that coincides with the bounds
established in Lai and Robbins (1985) and
Bubeck et al. (2013) for that setting. Theorem 1 further
establishes that when additional information is
available, achievable regret rates may become lower, and that
the impact of information arrivals on the
achievable performance depends on the frequency of these
arrivals, but also on the time at which these
arrivals occur; we further discuss these observations in
§3.1.
Key ideas in the proof. The proof of Theorem 1 adapts to our
framework ideas of identifying a
worst-case nature “strategy.” While the full proof is deferred
to the appendix, we next illustrate its key
ideas using the special case of two arms. We consider two
possible profiles of reward distributions, ν and
ν ′, that are “close” enough in the sense that it is hard to
distinguish between the two, but “separated”
enough such that a considerable regret may be incurred when the
“correct” profile of distributions is
misidentified. In particular, we assume that the decision maker
is a priori informed that the first arm
generates rewards according to a normal distribution with
standard variation σ and a mean that is
either −∆ (according to ν) or +∆ (according to ν ′), and that
the second arm generates rewards with
normal distribution of standard variation σ and mean zero. To
quantify a notion of distance between the
possible profiles of reward distributions we use the
Kullback-Leibler (KL) divergence. The KL divergence
between two positive measures ρ and ρ′ with ρ absolutely
continuous with respect to ρ′, is defined as:
13
-
KL(ρ, ρ′) :=
∫log
(dρ
dρ′
)dρ = Eρ log
(dρ
dρ′(X)
),
where Eρ denotes the expectation with respect to probability
measure ρ. Using Lemma 2.6 from Tsybakov
(2009) that connects the KL divergence to error probabilities,
we establish that at each period t the
probability of selecting a suboptimal arm must be at least
psubt =1
4exp
(−2∆
2
σ2
(Eν,ν′ [ñ1,T ] +
t∑s=1
σ2
σ̂2h1,s
)),
where ñ1,t denotes the number of times the first arm is pulled
up to time t. Each selection of suboptimal
arm contributes ∆ to the regret, and therefore the cumulative
regret must be at least ∆T∑t=1
psubt . We
further observe that if arm 1 has a mean reward of −∆, the
cumulative regret must also be at least
∆ · Eν,ν′ [ñ1,T ]. Therefore the regret is lower bounded by
∆2
(T∑t=1
psubt + Eν,ν′ [ñ1,T ])
, which is greater
than σ2
4∆ log
(∆2
2σ2
T∑t=1
exp
(−2∆2
σ̂2
t∑s=1
h1,s
)). The argument can be repeated by switching arms 1 and 2.
For K arms, we follow the above lines and average over the
established bounds to obtain:
RπS(H, T ) ≥σ2(K − 1)
4K∆
K∑k=1
log
(∆2
σ2K
T∑t=1
exp
(−2∆
2
σ̂2
t∑s=1
hk,s
)),
which establishes the result.
3.1 Discussion and subclasses of information arrival
processes
Theorem 1 suggests that auxiliary information arrivals may be
leveraged to reduce regret rates, and
that their impact on the achievable performance increases when
information arrives more frequently,
and earlier. This observation is consistent with the following
intuition: (i) at early time periods we
have collected only few observations and therefore the marginal
impact of an additional observation on
the stochastic error rates is large; and (ii) when information
appears early on, there are more future
opportunities where this information can be used. To emphasize
this observation we next demonstrate
the implications on achievable performance of two information
arrival processes of natural interest: a
process with a fixed arrival rate, and a process with a
decreasing arrival rate.
3.1.1 Stationary information arrival process
Assume that hk,t’s are i.i.d. Bernoulli random variables with
mean λ. Then, for any T ≥ 1 and admissible
policy π ∈ P, one obtains the following lower bound for the
achievable performance:
14
-
1. If λ ≤ σ̂24∆2T
, then
EH [RπS(H, T )] ≥σ2(K − 1)
4∆log
((1− e−1/2)∆2T
K
).
2. If λ ≥ σ̂24∆2T
, then
EH [RπS(H, T )] ≥σ2(K − 1)
4∆log
(1− e−1/2
2λKσ2/σ̂2
).
This class includes instances in which, on average, information
arrives at a constant rate λ. Analyzing
these arrival processes reveals two different regimes. When the
information arrival rate is low enough,
auxiliary observations become essentially ineffective, and one
recovers the performance bounds that were
established for the classical stochastic MAB problem. In
particular, as long as there are no more than
order ∆−2 information arrivals over T time periods, this
information does not impact the achievable regret
rates. When ∆ is fixed and independent of the horizon length T ,
the lower bound scales logarithmically
with T . When ∆ can scale with T , a bound of order√T is
recovered when ∆ is of order T−1/2. In
both cases, there are known policies (such as UCB1) that
guarantee rate-optimal performance; for more
details see policies, analysis, and discussion in Auer et al.
(2002).
On the other hand, when there are more than order ∆−2
observations over T periods, the lower bound
on the regret becomes a function of the arrival rate λ. When the
arrival rate is independent of the
horizon length T , the regret is bounded by a constant that is
independent of T , and a myopic policy
(e.g., a policy that for the first K periods pulls each arm
once, and at each later period pulls the arm
with the current highest estimated mean reward, while
randomizing to break ties) is optimal. For more
details see sections C.3 and C.4 of the Appendix.
3.1.2 Diminishing information arrival process
Assume that hk,t’s are random variables such that for each arm k
∈ K and time step t,
E
[t∑
s=1
hk,s
]=
⌊σ̂2κ
2∆2log t
⌋,
for some fixed κ > 0. Then, for any T ≥ 1 and admissible
policy π ∈ P, one obtains the following lower
bounds for the achievable performance:
1. If κ < 1 then:
EH [RπS(H, T )] ≥σ2(K − 1)
4∆log
(∆2/Kσ2
1− κ((T + 1)1−κ − 1
)).
15
-
2. If κ > 1 then:
EH [RπS(H, T )] ≥σ2(K − 1)
4∆log
(∆2/Kσ2
κ− 1
(1− 1
(T + 1)κ−1
)).
This class includes information arrival processes under which
the expected number of information arrivals
up to time t is of order log t. Therefore, it demonstrates the
impact of the timing of information arrivals
on the achievable performance, and suggests that a constant
regret may be achieved when the rate of
information arrivals is decreasing. Whenever κ < 1, the lower
bound on the regret is logarithmic in T ,
and there are known policies (e.g., UCB1) that guarantee
rate-optimal performance. When κ > 1, the
lower bound becomes a constant, and one may observe that when κ
is large enough a myopic policy is
asymptotically optimal. (In the limit κ→ 1 the lower bound is of
order log log T .) For more details see
sections C.5 and C.6 of the Appendix.
3.1.3 Discussion
One may contrast the classes of information arrival processes
described in §3.1.1 and §3.1.2 by selecting
κ = 2∆2λT
σ2 log T. Then, in both settings the total number of information
arrivals for each arm is λT on average.
However, while in the first class the information arrival rate
is fixed over the horizon, in the second class
this arrival rate is higher in the beginning of the horizon and
decreases over time. The different timing of
the λT information arrivals may lead to different regret rates.
For example, selecting λ = σ2 log T∆2T
implies
κ = 2. Then, the lower bound in §3.1.1 is logarithmic in T
(establishing the impossibility of constant
regret in that setting), but the lower bound in §3.1.2 is
constant and independent of T (in §4 we show
that constant regret is indeed achievable in this setting). This
observation conforms to the intuition that
earlier observations have higher impact on achievable
performance, as at early periods there is only little
information that is available (and therefore the marginal impact
of an additional observation is larger),
and since earlier information can be used for more decisions (as
the remaining horizon is longer).4
The analysis above suggests that effective optimal policy design
might depend on the information
arrival process: while policies that explore (and in that sense
are not myopic) may be rate-optimal in
some cases, a myopic policy that does not explore (except
perhaps in a small number of periods in the
4The subclasses described in §3.1.1 and §3.1.2 are special cases
of the following setting. Let hk,t’s be independent randomvariables
such that for each arm k and time t, the expected number of
information arrivals up to time t satisfies:
E
[t∑
s=1
hk,s
]= λT
t1−γ − 1T 1−γ − 1 .
While the expected number of total information arrivals for each
arm, λT , is determined by the parameter λ, theconcentration of
arrivals is governed by the parameter γ. When γ = 0 the arrival
rate is constant, corresponding to theclass described in §3.1.1. As
γ increases, information arrivals concentrate in the beginning of
the horizon, and γ → 1 leadsto E
[∑ts=1 hk,s
]= λT log t
log T, corresponding to the class in §3.1.2. Then, when λT is of
order T 1−γ or higher, the lower
bound is a constant independent of T .
16
-
beginning of the horizon) can be rate-optimal in other cases.
However, this identification of rate-optimal
policies relies on prior knowledge of the information arrival
process. In the following sections we therefore
study the adaptation of polices to unknown information arrival
processes, in the sense of guaranteeing
rate-optimality without any prior knowledge on the information
arrival process.
4 Natural adaptation to the information arrival process
In this section we establish that policies based on posterior
sampling and upper confidence bounds
may naturally adapt to an a priori unknown information arrival
process to achieve the lower bound of
Theorem 1 uniformly over the general class of information
arrival processes that we consider.
4.1 Robustness of Thompson sampling
Consider the Thompson sampling with Gaussian priors (Agrawal and
Goyal 2012, 2013a). In the following
adjustment of this policy posteriors are updated both after the
policy’s actions as well as after the
auxiliary information arrivals. Denote by nk,t and X̄k,nk,t the
weighted number of times a sample from
arm k has been observed and the weighted empirical average
reward of arm k up to time t, respectively:
nk,t :=
t−1∑s=1
1{πs = k}+t∑
s=1
σ2
σ̂2hk,s, X̄k,nk,t :=
t−1∑s=1
1σ21{πs = k}Xk,s +
t∑s=1
1σ̂2Zk,s
t−1∑s=1
1σ21{πs = k}+
t∑s=1
1σ̂2hk,s
. (2)
Thompson sampling with auxiliary observations. Inputs: a tuning
constant c.
1. Initialization: set initial counters nk,0 = 0 and initial
empirical means X̄k,0 = 0 for all k ∈ K
2. At each period t = 1, . . . , T :
(a) Observe the vectors ht and Zt
(b) Sample θk,t ∼ N (X̄k,nk,t , cσ2(nk,t + 1)−1) for all k ∈ K,
and select the arm
πt = arg maxk
θk,t.
(c) Receive and observe a reward Xπt,t
The next result establishes that by deploying a Thompson
sampling policy that updates posteriors
after the policy’s actions and whenever auxiliary information
arrives, one may guarantee rate optimal
performance in the presence of unknown information arrival
processes.
17
-
Theorem 2 (Near optimality of Thompson sampling with auxiliary
observations) Let π be
a Thompson sampling with auxiliary observations policy, with c
> 0. For every T ≥ 1, and auxiliary
information arrival matrix H:
RπS(H, T ) ≤∑
k∈K\{k∗}
(C4∆k
log
(T∑t=0
exp
(−∆2kC4
t∑s=1
hk,s
))+C5∆3k
+ C6∆k
),
for some absolute positive constants C4, C5, and C6 that depend
on only c, σ and σ̂.
The upper bound in Theorem 2 holds for any arbitrary sample path
of information arrivals that is
captured by the matrix H, and matches the lower bound in Theorem
1 with respect to the dependence
on the sample path of information arrivals hk,t’s, as well as
the time horizon T , the number of arms K,
and the minimum expected reward difference ∆. This establishes
that, by updating posteriors after
both policy’s actions and auxiliary information arrivals, the
Thompson sampling policy guarantees rate
optimality uniformly over the general class of information
arrival processes defined in §2. In particular,
for any information arrival matrix H, the regret rate guaranteed
by Thompson sampling cannot be
improved by any policy, even when the information arrival
process is known upfront.
When put together, Theorem 1 and Theorem 2 establish the minimax
complexity of our problem,
which is characterized as follows.
Remark 1 Theorems 1 and 2 together identify the minimax regret
rate for the MAB problem with any
unknown information arrival process H as a function of the
entries of H:
R∗S(H, T ) �∑k∈K
log
(T∑t=1
exp
(−c ·
t∑s=1
hk,s
)),
where c is a constant that depends on problem parameters such as
K, ∆, and σ.
Remark 1 has a couple of important implications. First, it
identifies a spectrum of minimax regret
rates ranging from the classical regret rates that appear in the
stochastic MAB literature when there
is no auxiliary information, to regret that is uniformly bounded
over time when information arrives
frequently and/or early enough. Furthermore, identifying the
minimax complexity provides a yardstick
for evaluating the performance of policies, and a sharp
criterion for identifying rate-optimal ones. This
already identifies Thompson sampling as rate optimal, but will
shortly be used for establishing the
optimality (or sub-optimality) of other policies.
Key ideas in the proof. To establish Theorem 2 we decompose the
regret associated with each
suboptimal arm k into three components: (i) Regret from pulling
an arm when its empirical mean
18
-
deviates from its expectation; (ii) Regret from pulling an arm
when its empirical mean does not deviate
from its expectation, but θk,t deviates from the empirical mean;
and (iii) Regret from pulling an arm
when its empirical mean does not deviate from its expectation
and θk,t does not deviate from the empirical
mean. Following the analysis in Agrawal and Goyal (2013b) one
may use concentration inequalities
such as Chernoff-Hoeffding bound to bound the cumulative
expressions of cases (i) and (iii) uniformly
over time. Case (ii) can be analyzed considering an arrival
process with arrival rate that is decreasing
exponentially with each arrival. We bound the total number of
arrivals of this process (Lemma 3) and
establish that this suffices to bound the cumulative regret
associated with case (ii).
4.2 Robustness of UCB
We next consider a UCB1 policy (Auer et al. 2002) in which in
addition to updating mean rewards and
observation counters after the policy’s actions, these are also
updated after auxiliary observations. We
denote by nk,t and X̄k,nk,t the observation counters and mean
reward estimates as defined in (2).
UCB1 with auxiliary observations. Inputs: a constant c.
1. At each period t = 1, . . . , T :
(a) Observe the vectors ht and Zt
(b) Select the arm
πt =
t if t ≤ Karg maxk∈K {X̄k,nk,t +√ cσ2 log tnk,t } if t > K(c)
Receive and observe a reward Xπt,t
The next result establishes that by updating counters and reward
estimates after both policy’s actions
and auxiliary information arrivals, the UCB1 policy guarantees
rate-optimal performance over the class
of general information arrival processes defined in §2.
Theorem 3 Let π be UCB1 with auxiliary observations, tuned by c
> 2. Then, for any T ≥ 1, K ≥ 2,
and additional information arrival matrix H:
RπS(H, T ) ≤∑k∈K
C7∆2k
log
(T∑t=0
exp
(−∆2kC7
t∑s=1
hk,s
))+ C8∆k,
where C7, and C8 are positive constants that depend only on σ
and σ̂.
The upper bound in Theorem 3 holds for any arbitrary sample path
of information arrivals that is
captured by the matrix H. It establishes that, similarly to
Thompson sampling, when accounting the
19
-
auxiliary observations as suggested above, UCB guarantees rate
optimality uniformly over the general
class of information arrival processes defined in §2.
Key ideas in the proof. The proof adjusts the analysis of UCB1
in Auer et al. (2002). Pulling a
suboptimal arm k at time step t implies that at least one of the
following three events occur: (i) the
empirical average of the best arm deviates from its mean; (ii)
the empirical mean of arm k deviates
from its mean; or (iii) arm k has not been pulled sufficiently
often in the sense that
ñk,t−1 ≤ l̂k,t −t∑
s=1
σ2
σ̂2hk,s,
where l̂k,t =4cσ2 log(τk,t)
∆2with τk,t :=
t∑s=1
exp
(∆2k
4cσ̂2
t∑τ=s
hk,τ
), and ñk,t−1 is the number of times arm k is
pulled up to time t. The probability of the first two events can
be bounded using Chernoff-Hoeffding
inequality, and the probability of the third one can be bounded
using:
T∑t=1
1
{πt = k, ñk,t−1 ≤ l̂k,t −
t∑s=1
σ2
σ̂2hk,s
}≤ max
1≤t≤T
{l̂k,t −
t∑s=1
σ2
σ̂2hk,s
}
Therefore, we establish that for c > 2,
RπS(H, T ) ≤∑
k∈K\{k∗}
C7∆2k· max
1≤t≤Tlog
(t∑
m=1
exp
(∆2kC7
t∑s=m
hk,s −∆2kC7
t∑s=1
hk,s
))+ C8∆k.
Numerical analysis. We further evaluate the performance of
Thompson sampling and UCB policies
through numerical experiments that are detailed in Appendix F
and include various selections of tuning
parameters and information arrival scenarios. Results are
consistent with the ones depicted in Figure 2
across different settings. While optimal values for the tuning
parameters are case dependent, we note
that, in general, for both Thompson sampling and UCB,
appropriate values for the parameter c decrease
as the amount of auxiliary information that is realized over
time grows large.
4.3 Discussion
Theorems 2 and 3 establish that Thompson sampling and UCB
polices guarantee the best achievable
performance (up to some multiplicative constant) uniformly over
the class of information arrival processes
defined in §2, that is, under any arbitrary information arrival
process captured by the matrix H. For
example, upper bounds that match the lower bounds established
for the settings in §3.1.1 and §3.1.2 for
any values of λ and κ are given in Appendix C.2 (see Corollaries
1 and 2).
Our results imply that Thompson sampling and UCB possess
remarkable robustness with respect to
20
-
the information arrival process: without any adjustments they
guarantee the best achievable performance,
and furthermore, guarantee a regret rate that cannot be improved
even when the information arrival
process is known upfront. These results therefore uncover a
novel property of these policies, and generalize
the class of problems to which they have been applied.
Nevertheless, policies in which the exploration rate is
exogenous, such as forced sampling and �t-
greedy-type policies, do not exhibit such robustness (as we will
observe in §5.1).5 In order to study how
policies with exogenous exploration rate can be adjusted to
account for arrival of auxiliary information,
we devote §5 for devising a novel virtual time indices method
for endogenizing the effective exploration
rate. We apply this method for designing �t-greedy-type policies
that, without any prior knowledge on
the information arrival process, attain the best performance
that is achievable when the information
arrival process is a priori known.
5 Endogenizing exploration via virtual time indices
5.1 Policy design intuition
In this subsection we first demonstrate, through the case of
�t-greedy policy, that policies with exogenous
exploration rate may fail to achieve the lower bound in Theorem
1 in the presence of auxiliary information
arrivals, and then develop intuition for how such policies may
be adjusted to better leverage auxiliary
information through the virtual time indices method. Formal
policy design and analysis follow in §5.2.
The inefficiency of “naive” adaptations of �t-greedy. Despite
the robustness that was established
in §4 for Thompson sampling and UCB policies, in general,
accounting for auxiliary information while
otherwise maintaining the policy structure may not suffice for
achieving the lower bound established
in Theorem 1. To demonstrate this, consider the �t-greedy policy
(Auer et al. 2002), which at each
period t selects an arm randomly with probability �t that is
proportional to t−1, and with probability
1 − �t selects the arm with the highest reward estimate. Without
auxiliary information, �t-greedy
guarantees rate-optimal regret of order log T , but this
optimality does not carry over to settings with
other information arrival processes: as was visualized in Figure
3, using auxiliary observations to update
estimates without appropriately adjusting the exploration rate
leads to sub-optimality.6 For example,
5One advantage of policies with exogenous exploration is that
they allow the collection of independent samples
(withoutinner-sample correlation) and thus facilitate statistical
inference based on standard methods; see, e.g., the analysis in
Nieet al. (2017) and the empirical study in Villar et al. (2015).
For discussion on other appealing features of such policies, suchas
generality and practicality, see, e.g., Kveton et al. (2018).
6A natural alternative is to increase the time index by one (or
any other constant) each time auxiliary information arrives.Such an
approach, that could be implemented by arm-dependent time indices
with update rule τk,t = τk,t−1 + 1 + hk,t, may
lead to sub-optimality as well. For example, suppose hk,1
=⌊C
∆2k
log T⌋
for some constant C > 0, and hk,t = 0 for all t > 1.
By Theorems 2 and 3, when the constant C is sufficiently large
constant regret is achievable. However, following the above
21
-
consider stationary information arrivals (described in §3.1.1),
with an arrival rate λ ≥ σ̂24∆2T
. While
constant regret is achievable in this case, without adjusting
its exploration rate, �t-greedy explores with
suboptimal arms at a rate that is independent of the number of
auxiliary observations, and therefore
still incurs regret of order log T .
Over-exploration in the presence of additional information. For
simplicity, consider a 2-armed
bandit problem with µ1 > µ2. At each time t the �t-greedy
policy explores over arm 2 independently
with probability �t = ct−1 for some constant c > 0. As a
continuous-time proxi for the minimal number
of times arm 2 is selected by time t, consider the function∫
ts=1
�sds = c
∫ ts=1
ds
s= c log t.
The probability of best-arm misidentification at period t can be
bounded from above by:7
exp
(−c̄∫ ts=1
�sds
)≤ c̃t−1,
for some constants c̄, c̃ > 0. Thus, setting �t = ct−1
balances losses from exploration and exploitation.8
Next, assume that just before time t0, an additional independent
reward observation of arm 2 is
collected. Then, at time t0, the minimal number of observations
from arm 2 increases to (1 + c log t0),
and the upper bound on the probability of best-arm
misidentification decreases by factor e−c̄:
exp
(−c̄(
1 +
∫ ts=1
�sds
))≤ e−c̄ · c̃t−1, ∀ t ≥ t0.
Therefore, when there are many auxiliary observations, the loss
from best-arm misidentification is
guaranteed to diminish, but performance might still be
sub-optimal due to over-exploration.
Endogenizing the exploration rate. Note that there exists a
future time period t̂0 ≥ t0 such that:
1 + c
∫ t01
ds
s= c
∫ t̂01
ds
s. (3)
In words, the minimal number of observations from arm 2 by time
t0, including the one that arrived just
before t0, equals (in expectation) the minimal number of
observations from arm 2 by time t̂0 without
any additional information arrivals. Therefore, replacing �t0 =
ct−10 with �t0 = ct̂
−10 would adjust the
exploration rate to fit the amount of information actually
collected at t = t0, and the respective loss from
update rule the regret incurred due to exploration is at least
of order log(∆2kT
log T).
7This upper bound on the probability of misidentifying the best
arm can be obtained using standard concentrationinequalities and is
formalized, for example, in Step 5 of the proof of Theorem 4.
8Design of exploration schedules that balance losses from
experimenting with suboptimal arms and from misidentificationof the
best arm is common; see, e.g., related discussions in Auer et al.
(2002), Langford and Zhang (2008), Goldenshlugerand Zeevi (2013),
and Bastani and Bayati (2015).
22
-
𝑐𝑡
𝑐 𝜏 𝑡
area = 1𝑒 𝑐𝑡
𝑡 𝑡
auxiliaryinformationarrives
loss from exploration
loss from exploitation
loss from explorationwith advanced virtual time index
loss from exploitationafter additional observation arrives
loss from exploitationwith advanced virtual time index
𝑡
~
Figure 4: Losses from exploration and exploitation (normalized
through division by (µ1 − µ2)) when additionalobservation is
collected just before time t0, with and without replacing the
standard time index t with a virtualtime index τ(t). With a
standard time index, exploration is performed at rate ct−1, which
results in sub-optimalityafter time t0. With a virtual time index
that is advanced at time t0, the exploration rate becomes
c(τ(t))
−1, whichre-balances losses from exploration and exploitation
(dashed lines coincide).
exploitation. The regret reduction by this adjustment is
illustrated in Figure 5. We therefore adjust
the exploration rate to be �t = c(τ(t))−1 for some virtual time
τ(·). We set τ(t) = t for all t < t0, and
τ(t) = t̂0 + (t− t0) for t ≥ t0. Solving (3) for t̂0, we write
τ(t) in closed form:
τ(t) =
t t < t0c0t0 + (t− t0) t ≥ t0,for some constant c0 > 1.
Therefore, the virtual time grows together with t, and advanced by
a
multiplicative constant whenever an auxiliary observation is
collected.
5.2 A rate-optimal adaptive �t-greedy policy
We apply the ideas discussed in §5.1 to design an �t-greedy
policy with adaptive exploration that
dynamically adjusts the exploration rate in the presence of an
unknown information arrival process. For
simplicity, and consistent with standard versions of �t-greedy
(see, e.g., Auer et al. 2002), the policy
below assumes prior knowledge of the parameter ∆. In Appendix
C.1 we provide an approach for
designing �t-greedy policy that estimates ∆ throughout the
decision process and establish performance
guarantees for this policy in some special cases.
Define nk,t and X̄k,nk,t as in (2), and consider the following
adaptation of the �t-greedy policy.
�t-greedy with adaptive exploration. Input: a tuning parameter c
> 0.
1. Initialization: set initial virtual times τk,0 = 0 for all k
∈ K
2. At each period t = 1, 2, . . . , T :
23
-
additional information arrivals
additional information arrivalst t
virtual time advanced
Figure 5: Illustration of the adaptive exploration approach.
(Left) Virtual time index τ(·) is advanced usingmultiplicative
factors whenever auxiliary information arrives. (Right) Exploration
rate decreases as a functionof τ(·), exhibiting discrete “jumps”
whenever auxiliary information is collected.
(a) Observe the vectors ht and Zt, and update virtual time
indices for all k ∈ K:
τk,t = (τk,t−1 + 1) · exp(hk,t∆
2
cσ̂2
)(b) With probability min
{1, cσ
2
∆2
K∑k′=1
τ−1k′,t
}select an arm at random: (exploration)
πt = k with probabilityτ−1k,tK∑k′=1
τ−1k′,t
, for all k ∈ K
Otherwise, select an arm with the highest estimated reward:
(exploitation)
πt = arg maxk∈K
X̄k,nk,t ,
(c) Receive and observe a reward Xπt,t
At every period t, the �t-greedy with adaptive exploration
policy dynamically reacts to the information
sample path by advancing virtual time indices associated with
different arms based on auxiliary
observations that were collected since the last period. Then,
the policy explores with probability that is
proportional toK∑k′=1
τ−1k′,t, and otherwise pulls the arm with the highest empirical
mean reward.
Every time additional information on arm k is observed, a
carefully selected multiplicative factor is
used to advance the virtual time index τk,t according to the
update rule τk,t = (τk,t−1 + 1) · exp (δ · hk,t),
for some suitably selected δ. In doing so, the policy
effectively reduces exploration rates in order to
explore over each arm k at a rate that would have been
appropriate without auxiliary information arrivals
at a future time step τk,t. This guarantees that the loss due to
exploration is balanced with the loss due
to best-arm misidentification throughout the horizon. Advancing
virtual times based on the information
sample path and the impact on the exploration rate of a policy
are illustrated in Figure 5.
24
-
The following result characterizes the guaranteed performance
and establishes the rate optimality of
�t-greedy with adaptive exploration in the presence of unknown
information arrival processes.
Theorem 4 (Near optimality of �t-greedy with adaptive
exploration) Let π be an �t-greedy
with adaptive exploration policy, tuned by c > max{
16, 10∆2
σ2
}. Then, for every T ≥ 1, and for any
information arrival matrix H, one has:
RπS(H, T ) ≤∑k∈K
∆k
(C9∆2
log
(T∑
t=t∗+1
exp
(−∆
2
C9
t∑s=1
hk,τ
))+ C10
),
where C9, and C10 are positive constants that depend only on σ
and σ̂.
The upper bound in Theorem 4 holds for any arbitrary sample path
of information arrivals that is
captured by the matrix H. This establishes that, similarly to
Thompson sampling and UCB policies
(but unlike the standard �t-greedy policy) the virtual time
indices method, as applied in the �t-greedy
with adaptive exploration policy, leads to rate optimality
uniformly over the general class of information
arrival processes defined in §2.
Furthermore, we note that the virtual time indices method can
also be applied to other, and more
general MAB settings. In Appendix E we demonstrate this by
extending the contextual MAB model of
Goldenshluger and Zeevi (2013) where mean rewards linearly
depend on stochastic context vectors, and
applying the virtual time indices method to derive improved
performance bounds in the presence of
unknown information arrival processes.
Key ideas in the proof. The proof decomposes regret into
exploration and exploitation time periods.
To bound the regret at exploration time periods express virtual
time indices as
τk,t =t∑
s=1
exp
(∆2
cσ̂2
t∑τ=s
hk,τ
).
Denoting by tm the time step at which the mth auxiliary
observation for arm k was collected, we establish
an upper bound on the expected number of exploration time
periods for arm k in the time interval
[tm, tm+1 − 1], which scales linearly with cσ2
∆2log(τk,tm+1τk,tm
)− 1. Summing over all values of m, we obtain
that the regret over exploration time periods is bounded from
above by
∑k∈K
∆k ·cσ2
∆2log
(2
T∑t=0
exp
(−∆2
cσ̂2
t∑s=1
hk,τ
)).
To analyze regret at exploitation time periods we first lower
bound the number of observations of each arm
25
-
using Bernstein inequality, and then apply Chernoff-Hoeffding
inequality to bound the probability that
a sub-optimal arm would have the highest estimated reward, given
the minimal number of observations
on each arm. When c > max{
16, 10∆2
σ2
}, this regret component decays at rate of at most t−1.
Numerical analysis. Appendix F also includes numerical analysis
of the adjusted �t-greedy policy
across different selections of the tuning parameter and various
information arrival scenarios, relative to
other policies (including, for example, a more “naive” version
of this policy that updates the empirical
means and the reward observation counters upon auxiliary
information arrivals but does not update
the time indices as prescribed). We also tested for robustness
with respect to misspecification of the
gap parameter ∆, and evaluated the performance of an �t-greedy
policy that is designed to estimate the
gap ∆ throughout the decision process (this policy is provided
in Appendix C.1 along with performance
guarantees that are established in some special cases).
6 Empirical proof of concept using content recommendations
data
To demonstrate the value that may be captured by leveraging
auxiliary information, we use data from a
large US media site to empirically evaluate achievable
performance when recommending articles that
have unknown impact on the future browsing path of readers. We
next provide some relevant context on
content recommendation services, followed by the setup and the
results of our analysis.
Background. Content recommendations, which point readers to
content they “may like,” is a dynamic
service provided by media sites and third-party providers, with
the objective of increasing readership
and revenue streams for media sites. In that context, Besbes et
al. (2016) validate the necessity of two
performance indicators that are fundamental for recommendations
that guarantee good performance
along the reading path of readers: (i) the likelihood of a
reader to click on a recommendation, and (ii)
when clicking on it, the likelihood of the reader to continue
and consume additional content afterwards.
The former indicator is the click-through rate (CTR) of the
recommendation, and the latter indicator
can be viewed as the conversion rate (CVR) of the
recommendation.9
A key operational challenge on which we focus here, is that the
likelihood of readers to continue
consuming content after reading an article is affected by
features of that article, such as length and
number of photos, that may change several times after the
article’s release. Changes in article features
may take place with the purpose of increasing readership, or due
to editing the content itself; For
example, changes in the structure of news article are very
common during the first few days after their
9While CTR is a common performance indicator in online services,
future path has been increasingly recognized asanother key
performance indicator in dynamic services that continue after the
first click. In the context of contentrecommendations, a field
experiment in Besbes et al. (2016) found that recommendation
methods that account for thefuture reading path of users can
increase clicks per visit by 10%.
26
-
release. In both cases, it is common that these changes occur
after estimates for click-through rates are
already formed (which is typically the case a few hours after
the article’s release).
Our setup will describe sequential experiments designed with the
goal of evaluating how structural
changes in a given article impact the likelihood of readers to
continue consuming content after reading it
(the CVR associated with recommending that article). Using this
setup, we will simulate the extent to
which the performance of such experiments could be improved when
utilizing auxiliary observations,
available in the form of the browsing path of readers that
arrived to these articles directly from external
search (and not by clicking a recommendation).
Our data set includes a list of articles, and consists of: (i)
times at which these articles were
recommended and the in-site browsing path that followed these
recommendations; and (ii) times at
which readers arrived to these articles directly from external
search engines (such as Google) and the
browsing path that followed these visits.
Setup. While we defer the complete setup description to Appendix
G, we next provide its key elements.
Rather than which article to recommend, the problem we aim to
analyze here is which version of the
article to recommend. For that purpose, for each article in our
data we considered a one-armed bandit
setting with a known outside option to simulate experimentation
with a new version of that article.
Each one-armed bandit experiment was based on one day of data
associated with the article at hand.
We constructed the decision horizon t = 1, 2, . . . , 2000 based
on the first 2,000 time-stamps of that day
at which the article was recommended from the highest position
(out of 5 recommendations that are
presented in each page). For a fixed article a and day d, we
denote by CTRa,d (click-through rate) the
fraction of occasions where readers clicked on a recommendation
pointing to that article during day d,
out of all occasions where article a was recommended in first
position, and by CVRrecoma,d (conversion
rate) the fraction of occasions where readers continued to read
another article after arriving to article a
by clicking on a content recommendation during day d.
We assume that the click-through rates CTRa,d are known and
independent of the structure of the
article.10 However, we assume that conversion rates depend on
the structure of the article itself, which
is subject to modifications. We denote by CVRrecoma,d,0 the
known conversion rate of article a at its current
structure, and by CVRrecoma,d,1 a new (unknown) conversion rate
that corresponds to a new structure that
is under consideration for article a. We assume that the new
conversion rate CVRrecoma,d,1 can be either a
bit higher or a bit lower than the old conversion rate
CVRrecoma,d,0 .
We assume that each time the article is recommended, the
publisher can choose between recommending
10This is a realistic assumptions as these rates are driven
mainly by the title of the article, which is visible to the
readerin the recommendation, and typically does not change over
time.
27
-
its current or new form. We abstract away from idiosyncratic
reader characteristics and potential
combinatorial externalities associated with other links or ads,
and adapt the objective of maximizing
the (revenue-normalized) one-step lookahead recommendation
value, which was shown by Besbes et al.
(2016) to be an effective approximation of the recommendation
value. Then, the problem of which
version of the article to recommend is reduced to a 1-armed
bandit problem with a known outside
option: Recommending version k ∈ {0, 1} of the article generates
a random payoff X(1)k,t(
1 +X(2)k,t
),
where X(1)k,t ∼ Ber(CTRa,d) and X
(2)k,t ∼ Ber(CVR
recoma,d,k ).
Auxiliary information. While the observations above are drawn
randomly, the matrix H and the
sample path of auxiliary observations are fixed and determined
from the data as follows. In each
one-armed bandit experiment the matrix H is reduced to a vector
{h1,t}Tt=1. For each decision epoch t,
the entry h1,t represents the number of readers that arrived to
the article from an external search engine
between epochs t− 1 and t (that is, between the time-stamps that
correspond with these epochs).
We note that the information arrival process in our data often
includes two or more observations
between consecutive decision epochs. It is therefore important
to highlight that, as discussed in §2.1,
the performance bounds established in this paper hold without
any adjustment for any integer values
assigned to the entries of the matrix H. For each epoch t and
arrival-from-search m ∈ {1, . . . , h1,t} we
denote by Y1,t,m ∈ {0, 1} the indicator of whether the reader
continued to read another article after
visiting the article at hand. We denote by CVRsearcha,d the
fraction of readers that continued to another
article after arriving to article a from search during day d. We
define:
αa,d :=CVRrecoma,d
CVRsearcha,d
to be the fraction of conversion rates for users that arrive to
article a by clicking on a recommendation,
and for users that arrive to it from search. Conversion rates of
readers that click on a recommendation
are typically higher than conversion rates of readers that
arrive directly from search.11 In our data,
values of αa,d are in the range [1, 16].
As the CTR is identical across k ∈ {0, 1}, it suffices to focus
on the conversion feedback with
Z1,t,m = αa,dYk,t,m; see Appendix G for more details. Then, one
may observe that the auxiliary
observations {Y1,t,m} can be mapped to reward observations using
a mapping φ defined by the class of
linear mappings from Example 1, with β1 = 0, and α1 = αa,d.
Estimating the mapping to reward observations. To demonstrate
the estimation of φ from existing
data, we construct an estimator of the fraction αa,d based on
αa,d−1, the fraction of the two conversion
11See related discussions in Besbes et al. (2016) on experienced
versus inexperienced readers, and in Caro and Mart́ınez-deAlbéniz
(2020) on followers versus new readers.
28
-
rates from the previous day. Note that for each a and d, up to a
multiplicative constant, αa,d is a fraction
of two binomial random variables. As such, αa,d−1 is not an
unbiased estimator of αa,d. However it
was shown that the distribution of such fractions can be
approximated using a log-normal distribution
(see, e.g., Katz et al. 1978). Furthermore, since the fraction
of two log-normal random variables also
has a log-normal distribution, we assume that (αa,d−1/αa,d) is a
log-normal random variable, that is,
αa,d−1 = αa,d · exp(σ̃2W
)for a standard normal random variable W ∼ N (0, 1) and a global
parameter
σ̃ > 0 that we estimate from the data. Then, one obtains the
following unbiased estimator of αa,d:
α̂a,d = αa,d−1 · exp(−σ̃2/2
).
Results. We compare the performance of a Thompson sampling
policy when ignoring auxiliary
observations, with the one achieved when utilizing auxiliary
observations (see §4.2) based on the
estimates α̂. The right hand side of Figure 6 depicts the fitted
log-normal distribution for the fraction
α̂/α, with the estimated parameter σ̃2 ≈ 0.53. This implies that
practical estimators for the mapping φ
could be computed based on the previous day, with reasonable
estimation errors. We note that these
errors could be significantly reduced using additional data and
more sophisticated estimation methods.
0
1
0 0.5 1
𝑅 𝐻, 𝑇(Normalized)
Auxiliary Information Ratio ∑ ℎ
,
TS Utilizing Information Flows
Standard TS
Frequency Histogramof 𝐿𝑜𝑔 𝛼/𝛼
Figure 6: (Left) Performance of a “standard” Thompson Sampling
that ignores auxiliary observations relativeto one that utilizes
auxiliary observations (see §4.2), both tuned by c = 0.25. Each
point corresponds to the(normalized) mean cumulative regret
averaged over 200 replications for a certain article on a certain
day, and thefixed sample path of auxiliary observations for that
article on that day. (Right) Histogram of log(αa,d−1/αa,d),with the
fitted normal distribution suggesting αa,d−1 ∼ Lognormal(logαa,d,
σ̃2) with σ̃2 ≈ 0.53.
The performance comparison appears on the left side of Figure 6.
Each point on the scatter plot
corresponds to the normalized long-run average regret incurred
for a certain article on a certain day, by
a Thompson sampling policy, either with or without utilizing
auxiliary information. Without utilizing
auxiliary observations, the performance is independent of the
information arrival process. When utilizes
auxiliary information, performance is comparable to the ignoring
auxiliary information when there are
very little auxiliary observations, and significantly improves
when the amount of auxiliary information
29
-
increases. Main sources of variability in performance are (i)
variability in the baseline conversion rates
CVRrecoma,d , affecting both versions of the algorithm; and (ii)
variability in the estimation quality of α,
affecting performance only when utilizing auxiliary
observations. Utilizing auxiliary information arrivals
led to regret reduction of 27.8% on average; We note that this
performance improvement is achieved
despite (i) suffering occasionally from inaccuracies in the
estimate of the mapping φ(·), and (ii) tuning
both algorithms by the value c that optimizes the performance of
the standard Thompson sampling.
7 Concluding remarks and extensions
In this paper we considered an extension of the multi-armed
bandit framework, allowing for unknown
and arbitrary auxiliary information arrival processes. We
studied the impact of the information arrival
process on the performance that can be achieved. Through
matching lower and upper bounds we
identified the minimax (regret) complexity of this class of
problems as a function of the information
arrival process, which provides a sharp criterion for
identifying rate-optimal policies.
We established that Thompson sampling and UCB policies can be
leveraged to uniformly achieve rate-
optimality and, in that sense, possess natural robustness to the
information arrival process. Moreover,
the regret rate that is guaranteed by these policies cannot be
improved even with prior knowledge of
the information arrival process. This uncovers a novel property
of these popular families of policies and
further lends credence to their appeal. We also observed that
policies with exogenous exploration rate
do not possess such robustness. For such policies, we devised a
novel virtual time indices method to
endogenously control the exploration rate to guarantee
rate-optimality. Using content recommendations
data, we demonstrated the value that can be captured in practice
by leveraging auxiliary information.
We next discuss a couple of model extensions and directions for
future research.
Reactive information arrival processes. In Appendix D we
analy