A Simple Uni ed Framework for High Dimensional Bandit Problems

A Simple Unified Framework for High Dimensional Bandit

Problems

Wenjie Li1, Adarsh Barik2, and Jean Honorio2

1Department of Statistics, Purdue University2Department of Computer Science, Purdue University

Abstract

Stochastic high dimensional bandit problems with low dimensional structures are useful indifferent applications such as online advertising and drug discovery. In this work, we proposea simple unified algorithm for such problems and present a general analysis framework for theregret upper bound of our algorithm. We show that under some mild unified assumptions, ouralgorithm can be applied to different high dimensional bandit problems. Our framework utilizesthe low dimensional structure to guide the parameter estimation in the problem, therefore ouralgorithm achieves the best regret bounds in the LASSO bandit, as well as novel bounds in thelow-rank matrix bandit, the group sparse matrix bandit, and in a new problem: the multi-agentLASSO bandit.

1 Introduction

Stochastic multiarmed contextual bandits are useful models in various application domains, suchas recommendation systems, online advertising, and personalized healthcare [Auer, 2002b, Chu et al.,2011, Abbasi-Yadkori et al., 2011]. Under this setting, the agent chooses one specific arm at each roundand observes a reward, which is modeled as a function of an unknown parameter and the context of thearm. In practice, such problems are often high-dimensional, but the unknown parameter is typicallyassumed to have low-dimensional structure, which in turns implies a succinct representation of thefinal reward.

For example, in high dimensional sparse linear bandits, also known as the LASSO bandit problem[Bastani and Bayati, 2020], both the contexts and the unknown parameter are high-dimensionalvectors, while the parameter is assumed to be sparse with limited nonzero elements. There has been aline of research on LASSO bandits, such as Abbasi-Yadkori et al. [2011], Alexandra Carpentier [2012],Bastani and Bayati [2020], Lattimore et al. [2015], Wang et al. [2018], Kim and Paik [2019], Hao et al.[2020a], and Oh et al. [2020]. Different algorithms are proposed for this problem, and different regretanalyses based on various assumptions are provided.

When the unknown parameter becomes a matrix, some recent works have focused on low-rankmatrix bandits. Katariya et al. [2017a,b] and Trinh et al. [2020] considered the rank-1 matrix banditproblems, but their results cannot be extended to higher ranks. Jun et al. [2019] studied the bilinearbandit problems where the reward is modelled as a bilinear product of the left arm, the parametermatrix, and the right arm. Kveton et al. [2017] first studied the low-rank matrix bandit, but requiredstrong assumptions on the mean reward matrix. Lu et al. [2021] extended the work by Jun et al.

1

arX

iv:2

102.

0962

6v2

[cs

.LG

] 1

4 Ju

n 20

21

Table 1: Summary of regret bounds generated by our framework. d, (d1, d2) are the dimension sizes for vectors andmatrices. s denotes the support size of a vector. r denotes the rank of a matrix.

Problem Regret Bound Remark

LASSO Bandit O(√

sT log(dT ))(Corollary 4.1)

The same as the best regret boundby Oh et al. [2020]

Low-rank MatrixBandit

O(√rT log T log[(d1 + d2)T ])

(Corollary 4.2)A novel bound by our framework (seeRemark 4.2)

Group-sparse MatrixBandit

O(√d2sT +

√sT log(d1T 2))

( Corollary 4.3 with q = 2)A novel bound by our framework (seeRemark 4.3)

Multi-agent LASSOBandit

O(d2√

sT log(d1T ))(Theorem E.1 in Appendix E)

A possible extension of our algorithmin a new problem. The extension canensure group sparsity

[2019] to a more generalized low-rank matrix bandit problem but their action set was fixed. Hao et al.[2020b] further studied the problem of low-rank tensor bandits.

Despite the recent progress in all the above high dimensional bandit problems, both experimentallyand theoretically, prior works are scattered and various algorithms are proposed for these problems. Aninteresting question here is: does there exist a unified algorithm that works in all the high dimensionalbandit problems, and if so, does such an algorithm hold a desirable regret bound in the differentsettings? In this work, we provide affirmative answers to this question. Our work is inspired by theliterature on traditional high dimensional statistical analysis [Negahban et al., 2012] and modern highdimensional bandit algorithms [Kim and Paik, 2019, Oh et al., 2020]. The only similar prior work weare aware of is the framework by Johnson et al. [2016], but their algorithm is much more complicatedthan ours, and requires very strong assumptions on the arm sets as well as low-dimensional structuralinformation of the unknown parameter.

In particular, we want to highlight the following contributions of our paper:

• We present a simple and unified algorithm for high dimensional stochastic bandit problems andprovide a regret analysis framework for our algorithm. We show that to ensure a desirableregret, one simply needs to ensure that two events happen with high probability under mildassumptions, and that the regularization parameter is carefully chosen.

• We demonstrate the usefulness of our framework by applying it to different high dimensionalbandit problems. We show that under a mild unified assumption, our algorithm can achievedesirable regret bounds. In fact, our algorithm achieves the best regret bounds in the LASSObandit, novel bounds in the low-rank matrix bandit and the group sparse matrix bandit. Wealso show that a simple extension of our algorithm can achieve group sparsity with a desirableregret bound in a new problem: the multi-agent LASSO bandit. We summarize the resultsobtained in Table 1.

2 Preliminaries

In this section, we establish some important preliminary notations and definitions in this paper.Notations. Given a subspace M of Rp, its orthogonal complement M⊥ is defined as M⊥ :=

{v ∈ Rp|〈u, v〉 = 0 for all u ∈ M}. The matrix inner product for two matrices A,B of the samesize are defined as 〈〈A,B〉〉 := trace

(ATB

). We use ‖ · ‖ for vector norms and |||·||| for matrix norms.

Given a norm ‖ · ‖, we use the notation θM to represent the projection of a vector θ onto M, i.e.,θM = argminθ′∈M‖θ − θ′‖, and similarly for θM⊥ . The dual of a norm ‖ · ‖ is defined to be ‖u‖∗ :=

2

sup‖v‖≤1〈u, v〉. For the regularization norm R, we denote its dual to be R∗. We denote B(ε) to be theopen ball of radius ε with respect to the norm ‖ · ‖ centered at the origin, i.e., B(ε) = {∆ | ‖∆‖ ≤ ε}.We use O(·) to hide logarithm factors in big-O notations. We frequently use the notation [T ] forT ∈ N to denote the set of integers {1, 2, · · · , T}. For the reader’s reference, a complete list of all thenorms and their duals used in this paper is provided in Appendix A.

Multiarmed Bandits. In modern multiarmed bandit problems, a set of contexts {xt,ai}Ki=1 foreach arm is generated at every round t, and then the agent chooses an action at from the K arms. Thecontexts are assumed to be sampled i.i.d from a distribution PX with respect to t, but the contextsfor different arms can be correlated [Bastani and Bayati, 2020]. After the action is selected, a rewardyt = f(xt,at , θ

∗) + εt for the chosen action is received, where f is a deterministic function, θ∗ is anunknown parameter, and εt is a zero-mean random noise term which is often assumed to be sub-Gaussian or even conditional sub-Gaussian in a few cases. In this paper, we focus on bandit problemswhere θ∗ is high dimensional but with low-dimensional structure such as a sparse vector, a low-rankmatrix, and a group-sparse matrix. Let a∗t = argmaxi∈[K]f(xt,ai , θ

∗) denote the optimal action ateach round. We measure the performance of all algorithms by the expectation of the regret, denotedas

Regret(T ) =

T∑t=1

f(xt,a∗t , θ∗)− f(xt,at , θ

∗)

The goal for all the bandit algorithms studied in this paper is to ensure a sublinear expectation of theregret with respect to T , so that the average regret converges, thus making the chosen actions nearlyoptimal. Moreover, we require most of the regret bounds to depend on the low dimensional structure,for example, the number of non-zero elements for a sparse vector, rather than the large dimensionsize, so that the algorithm utilizes the structure of the problem to reduce the regret.

Optimization Problem. We use the shorthand notations Xt = {xi,ai}ti=1 and Yt = {yi}ti=1 torepresent all the contexts of the chosen actions and the rewards received up to time t. Most of thealgorithms designed for multiarmed bandit problems involve solving an online optimization problemwith a loss function Lt(θ; Xt,Yt) and a regularization norm R(θ), i.e.,

θt ∈ argminθ∈Θ

{Lt(θ; Xt,Yt) + λtR(θ)

}(1)

where λt is the regularization parameter to be chosen later and Θ is the parameter domain. We alsooften use the notation Lt(θ) to represent Lt(θ; Xt,Yt). The solution θt in Eqn. (1) is a regularizedestimate of θ∗ based on the currently available data. Our hope is that R(θ) can identify the low-dimensional structure of θ∗ so that θt can converge to θ∗ fast and the action chosen by the algorithmbecomes optimal after a few rounds.

High Dimensional Statistics. Our general theory applies to M -estimators, i.e., Lt(θ) =1t

∑tτ=1 lτ (θ), where each lτ (θ) is a convex and differentiable loss function and R(θ) is a decomposable

norm. M -estimators and decomposable norms are well-accepted concepts in the high dimensionalstatistics literature [Negahban et al., 2012]. We first establish the definition of a decomposable norm.

Definition 1. Given a pair of subspaces M ⊆ M, the norm R is said to be decomposable with

respect to the subspace pair (M,M⊥) iff R(θ + γ) = R(θ) +R(γ), for all θ ∈M, γ ∈M⊥

We present a few examples of the pair of subspaces and the decomposable regularization, eachcorresponding to one application of our general theory in Section 4.

Example 1. (Sparse Vectors and the l1 Norm). Let θ∗ ∈ Rd be a sparse vector with s � dnonzero entries, we denote S(θ∗) to be the set of non-zero indices of θ∗ (i.e., the support). The pair

3

of subspaces M⊆M are chosen as M =M = {θ ∈ Rd | θj = 0, for all j /∈ S(θ∗)} .Then l1 norm is

decomposable with respect to (M,M⊥) i.e., ‖θ + γ‖1 = ‖θ‖1 + ‖γ‖1 for all θ ∈ M, γ ∈ M⊥, since θand γ have non-zero elements on different entries.

Example 2. (Low Rank Matrices and the Nuclear Norm). Let Θ∗ ∈ Rd1×d2 be a low rank

matrix with rank r � min{d1, d2}. We define the pair of subspace (M,M⊥) as

M := {Θ ∈ Rd1×d2 | row(Θ) ⊆ V, col(Θ) ⊆ U}

M⊥ := {Θ ∈ Rd1×d2 | row(Θ) ⊆ V ⊥, col(Θ) ⊆ U⊥},

where U and V represent the space of the left and right singular vectors of the target matrix Θ∗. Note

that in this case M (M. Then the nuclear norm |||·|||nuc is decomposable with respect to (M,M⊥),

i.e., |||Θ + Γ|||nuc = |||Θ|||nuc + |||Γ|||nuc for all Θ ∈M,Γ ∈M⊥.

Example 3. (Group-sparse Matrices and the l1,q Norm). Let Θ∗ ∈ Rd1×d2 be a matrix withgroup sparse rows, i.e., each row Θ∗i is nonzero only if i ∈ S(Θ∗), and |S(Θ∗)| = s � d1. Similar to

Example 1, we define the pair of subspace (M,M⊥) as

M =M = {Θ ∈ Rd1×d2 | Θi = 0, for all i /∈ S(Θ∗)}

The orthogonal complement can be defined with respect to the matrix inner product. Then the l1,q

norm, defined as |||Θ|||1,q =∑d1i=1[

∑d1i=1 |Θij |q]1/q, is decomposable with respect to (M,M⊥).

Definition 2. For a decomposable R on the subspace pair (M,M⊥), the constraint set is defined as

C :={

∆ | R(∆M⊥) ≤ 3R(∆M) + 4R(θ∗M⊥))}

It is shown in Negahban et al. [2012] that when the regularization parameter λt ≥ 2R∗(∇Lt(θ∗)),the error θt − θ∗ belongs to C. Next we present the definition of restricted strong convexity.

Definition 3. The loss function Lt(θ) is said to be restricted strongly convex (RSC) around θ∗ withrespect to ‖ · ‖ with curvature α > 0 and tolerance function Zt(θ

∗) if

Bt(θ, θ∗) := Lt(θ)− Lt(θ∗)− 〈∇Lt(θ∗), θ − θ∗〉 ≥ α

∥∥∥θ − θ∗∥∥∥2

− Zt(θ∗)

In the high dimensional statistics literature, restricted strong convexity is often ensured by asufficient number of samples for some specific distributions of Xt such as the Gaussian distribution[Negahban et al., 2012]. In the online case, we need some special assumptions to guarantee thatrestricted strong convexity holds after a number of rounds. In addition, the following subspace com-patibility constant plays a key role in restricting the distance between the true parameter and itsestimate by the low dimensional structural constraint, hence generating a desirable regret bound.

Definition 4. For a subspaceM, the subspace compatibility constant with respect to the pair (‖·‖, R)is given by φ := supu∈M\{0} (R(u)/‖u‖)

For instance, φ =√s for the (‖ · ‖2, ‖ · ‖1) norm pair and M defined as in Example 1 because√

s‖u‖2 ≥ ‖u‖1 for u ∈M by the Cauchy-Schwarz inequality.

4

3 Main results

In this section, we present our general algorithm and our analysis framework. Let α ∈ R be aconstant to be specified later, we define the following two probability events At and Et

At :={λt ≥ 2R∗(∇Lt(θ∗))

}, Et :=

{Bt(θ, θ

∗) ≥ α∥∥∥θ − θ∗∥∥∥2

− Zt(θ∗)}

where At represents a correctly chosen regularization parameter and Et means that Lt(θ) is RSC withrespect to the undefined norm ‖ · ‖ with curvature α and tolerance Zt(θ

∗) at round t. As we will showbelow, ensuring that these two events happen with high probability is of vital importance.

3.1 Oracle inequality

We first present a general oracle inequality between the estimate and the true parameter, whichis extended from the results in Negahban et al. [2012].

Lemma 3.1. (Oracle Inequality) If the probability event At holds and Et holds for θ−θ∗ ∈ C∩B(rt),

where r2t ≥ 9

λ2t

α2φ2 + λt

α (2Zt(θ∗) + 4R(θ∗M⊥)), then the difference θt − θ∗ satisfies the bound∥∥∥θt − θ∗∥∥∥2

≤ 9λ2t

α2φ2 +

λtα

[2Zt(θ

∗) + 4R(θ∗M⊥)]

Remark 3.1. The proof of Lemma 3.1 is provided in Appendix F for completeness. Lemma 3.1 statesan oracle inequality for each choice of the pair of norms (‖ · ‖, R), and the corresponding subspaceswhere R is decomposable. The difference converges to zero if both λt and Zt(θ

∗) are sublinear withrespect to t, and therefore the estimate θt becomes more and more accurate as rounds progress. Thekey insight here is that we only need restricted strong convexity (Et) to hold on a small subset of Cwith some tolerance [Negahban et al., 2012]. This differs from literature where Et is forced to holdon the entire set C and with zero tolerance. We will later show that the difference actually plays animportant role in our analysis.

3.2 The general algorithm

Now we present our simple general Explore-Then-Exploit algorithm for high dimensional problems.Our algorithm consists of two stages: the exploration stage where arms are randomly picked and theexploitation stage where the best arm is chosen. Algorithm 1 shows our procedure in detail.

Remark 3.2. Algorithm 1 is different from the famous Explore-Then-Commit strategy in the sensethat we keep updating the parameter after the exploration stage. More importantly, we do not requireany prior knowledge of the low-dimensional structure in the algorithm, such as the sparsity level inLASSO bandits or the rank in low-rank matrix bandits. Similar algorithms have been proposed insome recent papers. For example, Oh et al. [2020] proposed a sparsity agnostic algorithm in theLASSO bandit problem, which was an application of Algorithm 1 without the exploration stage,i.e., T0 = 0, but required many assumptions in the multi-arm case. Hao et al. [2020a] proposed anExplore-Then-Commit algorithm for LASSO bandits, but its regret bound is worse than ours. Luet al. [2021] proposed a Low-ESTR algorithm that solved the same optimization problem in the low-rank matrix bandit. Our algorithm is arguably simple and general, and it can be applied to manyhigh-dimensional settings. Furthermore, we provide better regrets for existing problems and novelresults for new problems (see Table 1, Section 4, and 5)

5

Algorithm 1 The general Explore-Then-Exploit algorithm

1: Input: {λt}Tt=1,K ∈ N, Lt(θ), R(θ), f(x, θ), θ0, T0

2: Initialize X0,Y0 = (∅, ∅), θt = θ0

3: for t = 1, 2, · · · , T do4: Observe K contexts, xt,1, xt,2, · · · , xt,K from PX5: if t ≤ T0 then6: Choose action at uniformly randomly # Exploration Stage7: else8: Choose action at = argmaxaf(xt,a, θt−1) # Exploitation Stage9: end if

10: Receive reward yt = f(xt,at , θ∗) + εt

11: Xt = Xt−1 ∪ {xt,at},Yt = Yt−1 ∪ {yat}12: θt ∈ argminθ∈Θ {Lt(θ; Xt,Yt) + λtR(θ)}13: end for

3.3 Regret analysis

To analyze the regret of Algorithm 1, we impose some very weak assumptions on the rewardfunction and the size of the context. These two assumptions are listed below for the general theorem,and we will later demonstrate how they can be easily guaranteed in the specific applications.

Assumption 1. x is normalized with respect to the norm ‖ · ‖., i.e. ‖x‖ ≤ k1 for some constant k1 .

Assumption 2. f(x, θ) is C1-Lipschitz over x and C2-Lipschitz over θ with respect to ‖ · ‖. i.e.,

f(x1, θ)− f(x2, θ) ≤ C1

∥∥∥x1 − x2

∥∥∥, f(x, θ1)− f(x, θ2) ≤ C2

∥∥∥θ1 − θ2

∥∥∥Remark 3.3. Assumption 1 is very standard in the contextual bandit literature, see e.g., Chu et al.[2011], Bastani and Bayati [2020], Kim and Paik [2019], Lu et al. [2021], where the size of the contextsis assumed to be either bounded by 1 or a constant xmax. In the linear case, Assumption 1 can beachieved without loss of generality through normalization of the contexts and the rewards. Assumption2 can also be easily guaranteed with the help of Assumption 1 for linear models, as well as somegeneralized linear models such as the logistic model. Given Assumptions 1 and 2, we present our maintheorem of the regret bound.

Theorem 3.1. (Problem Independent Regret Bound) Suppose that Assumption 1 and 2 hold.Then the expected cumulative regret of Algorithm 1 satisfies the bound

E[Regret(T )]

≤ 2C1k1T0︸︷︷︸(a)

+ 2C1k1

T∑t=T0

[P (Act) + P(Ect )]︸︷︷︸(b)

+ 2C2

T∑t=T0

√9λ2t

α2φ2 +

λtα

[2Zt(θ∗) + 4R(θ∗M⊥)]︸︷︷︸(c)

Remark 3.4. The proof is provided in Appendix A. The above regret bound may seem to be compli-cated at first sight, but we can interpret it in the following way. In the initial T0 exploration rounds,since we pull arms randomly to collect more samples, we have to consider the worst case scenario andbound the regret linearly in (a). After T0 rounds, when At and Et do not happen, no conclusions canbe made on the distance between θt and θ∗, contributing to the second term (b). When both events

6

happen, θt and θ∗ are close enough and we can carefully bound the regret with the help of Lemma3.1, which generates the third term (c). Theorem 3.1 indicates that the expected regret upper boundof Algorithm 1 depends on the probability of At, Et after a chosen round T0, as well as the choice ofλt and Zt(θ

∗). Therefore, to obtain a sublinear regret, we only need to ensure that the following twothings happen in the specific applications.

1. At, Et are high probability events after some T0 ∈ N.

2. λt, Zt(θ∗) are carefully chosen so that term (c) is sublinear

For instance, suppose that term (b) is finite, λt = O(1/√t), and Zt(θ

∗) = R(θ∗M⊥) = 0, then the

expected regret is of size O(φ√T ) by simple algebra. Such a result is desirable since the regret bound

is sublinear with respect to T , and it depends on the subspace compatibility constant instead of thedimension size, so we utilize the low-dimensional structure in θ∗. The final regret bound in Theorem3.1 will become clearer when we discuss its specific applications.

3.4 High probability events

Finally, we address the two high probability events as they are needed to prove a final regretbound. As we will show in Section 4, the probability of the event At is often decided by the choice ofλt and the model structure. No further assumptions are needed for At to hold with high probabilityin all the problems in this paper. However, Et does not necessarily hold with high probability evenafter a large number of rounds, and we need another assumption to guarantee its validity.

Assumption 3. (Restricted Eigenvalue Condition) Let X denote the matrix where each row isa feature vector from an arm. The population Gram matrix Σ = 1

KE[XTX] satisfies that there existssome constant α0 > 0 such that βTΣβ ≥ α0‖β‖2, for all β ∈ C

Remark 3.5. In the case when the contexts are matrices, we simply need to vectorize them bystacking the columns and then use the vectorized contexts to obtain the matrix X. We emphasizethat Assumption 3 is very general and mild because it only assumes that the population Gram matrixsatisfies the restricted eigenvalue condition, which is satisfied for many distributions, for example, theuniform distribution on a Euclidean unit ball. Apart from Assumption 3, prior works need many moreassumptions, for example, symmetric distribution and balanced co-variance [Oh et al., 2020], or verycomplicated algorithms to make Et a high probability event [Kim and Paik, 2019]. We avoid theseissues with the following induction lemma.

Lemma 3.2. (Induction Lemma) Define δ2t = t

αZt(θ∗)− t−1

α Zt−1(θ∗) and suppose that 0 < δt+1 ≤δt. Then if Bt(θ, θ

∗) ≥ α‖θ − θ∗‖2 − Zt(θ∗),∀θ − θ∗ ∈ C ∩ B(δt) for some α > 0, we have

Bt+1(θ, θ∗) ≥ α‖θ − θ∗‖2 − Zt+1(θ∗),∀θ − θ∗ ∈ C ∩ B(δt+1)

Remark 3.6. The proof is provided in Appendix A. This lemma essentially claims that if the RSCcondition is satisfied at round t on the set C∩B(δt), then it is also satisfied at round t+1 on a smallersubset C∩B(δt+1). Take Zt(θ

∗) = (α‖θ∗‖)/√t as an example, then δ2

t = (√t−√t− 1)‖θ∗‖ = O(1/

√t)

which decreases when t increases. In other words, if we can find an initial round T0 where the RSCcondition is guaranteed (on the set C or a large subset), then the RSC condition always holds on asubset of C with radius of order O(1/t1/4) at round t > T0 by mathematical induction. Since wesample uniformly from each arm in the exploration stage, the RSC condition can be guaranteed onthe whole set C when T0 is sufficiently large by some concentration inequalities, shown in Section 4.Note that in Lemma 3.1, we only need the RSC condition to hold on a subset that has radius rt larger

7

than the error bound in the oracle inequality. Therefore, the correctness of Lemma 3.1 is guaranteedif θt − θ∗ converges in a rate faster than O(1/t1/4).

Remark 3.7. One of the main reasons why prior works need many extra assumptions or complicatedalgorithms to sample from different arms, is that the RSC condition is forced to hold on the entire setC and with zero tolerance, i.e., Zt(θ

∗) = 0, which is obviously much more challenging since Lemma3.2 becomes useless when Zt(θ

∗) = 0. Moreover, since the best context is consistently pulled in theexploitation stage, we cannot easily guarantee that the RSC condition is always satisfied on the entireset C. Lemma 3.1 and Lemma 3.2 show that these conditions can be relaxed with a carefully-chosentolerance term Zt(θ

∗) and the RSC condition holding only on small subsets of C.

Remark 3.8. Another popular assumption in the LASSO bandit problem is the compatibility condi-tion [Bastani and Bayati, 2020, Kim and Paik, 2019], which would replace the norm in Assumption 3by the l1 norm (the regularization norm R in the LASSO bandit problem). Although the compatibilitycondition is slightly weaker than the restricted eigenvalue condition, we can easily replace the normsin all our arguments by the l1 norm and still obtain the favorable properties similar to Lemma 3.1and Lemma 3.2. The final regret bound may slightly differ from our results in Section 4.1 in termsof constants if we use the compatibility condition, but the proof idea is the same. We leave this forfuture work because it is currently unknown whether the compatibility condition can be extended tothe matrix bandit case.

4 Applications on existing problems

In this section, we present some specific applications of our general framework. Each subsection isorganized in the following way. We first clarify all the unspecified notations in the framework, such asthe loss, the regularization norm, the compatibility constant, and so on. Then we present two lemmasto show that At, Et are indeed high probability events after a fixed number of rounds. Given the twolemmas, we derive a corollary of Theorem 3.1 to present the final regret bound of the correspondingalgorithm. We emphasize that even though we focus on linear models in all examples for clarity, it iseasy to extend our results to nonlinear models that satisfy Assumption 1 and 2. For example, resultsfor generalized linear models whose link functions are Lipschitz can be easily obtained.

4.1 LASSO bandit

We first consider the LASSO bandit problem. In this case, the reward is assumed to be a linearfunction of the context of the chosen action xt,at ∈ Rd and the unknown parameter θ∗ ∈ Rd , i.e.,yt = 〈xt,at , θ∗〉 + εt, where εt is a (conditional) sub-Gaussian noise. In the LASSO bandit problem,the unknown parameter θ∗ is assumed to be sparse with only s non-zero elements and s � d. Thisnaturally leads to the use of l1 regularization.

To fit the problem into our framework so that we can get a regret bound, we first clarify thecorresponding notations. The loss function and the regularization are defined to be

Lt(θ) =1

2t

t∑i=1

(yi − xTi,aiθ)2, R(θ) = ‖θ‖1

In this case, we let (M,M⊥) be defined as in Example 1, then l1 regularization is decomposable withrespect to this pair of spaces. We choose the norm ‖·‖ in Et as the l2 norm, and thus the compatibilityconstant φ =

√s. Following Chu et al. [2011], we assume that the contexts xt,ai and the parameter

8

θ∗ are all normalized so that ‖xt,ai‖2 ≤ 1, ‖θ∗‖2 ≤ 1. Therefore Assumption 1 and 2 are satisfiedautomatically. Now, we present the following lemmas, which show the probability of the two eventsAt and Et. The proofs and the specific algorithm are provided in Appendix B.

Lemma 4.1. Suppose the noise εt is conditional σ-sub-Gaussian. For any δ ∈ (0, 1), use λt =2σ√t

√2 log(2d/δ) in Algorithm 1, then with probability at least 1− δ, we have λt ≥ R∗(∇Lt(θ∗))

Lemma 4.2. Suppose Assumption 3 is satisfied, then with probability at least 1 − exp(−T0C

4

), we

have Bt(θ, θ∗) ≥ α0

4 ‖θ− θ∗‖22−

α0‖θ∗‖24√t

, for all t ≥ T0 ≥ 2 log(d2 +d)/C and θ− θ∗ ∈ C∩B(δt), where

C > 0 is a constant and δ2t = ‖θ‖2/(2

√t).

Given these two lemmas, it is easy to apply Theorem 3.1 to get a specific regret bound in theLASSO bandit. Corollary 4.1 follows from taking δ = 1/t2 in Lemma 4.1 and T0 = Θ(

√T ).

Corollary 4.1. The expected cumulative regret of the Algorithm 1 in the LASSO bandit problem is

upper bounded by E[Regret(T )] = O(√

sT log(dT ))

Remark 4.1. Corollary 4.1 indicates that the regret bound is of size O(√sT ), which matches the

best existing regret bounds proved by Oh et al. [2020] and Hao et al. [2020a]. However, our algorithmis very simple and it does not require strong assumptions. Note that most of the regret bound dependson the size of the support s instead of the high dimension d. Therefore, Algorithm 1 converges muchfaster than directly applying algorithms such as LinUCB [Chu et al., 2011] to the sparse setting, whichsatisfies our requirement.

4.2 Low-Rank matrix bandit

Next, we consider the implications of our general theory in the low-rank matrix bandit problems.In this case, the reward is assumed to be a linear function of the low rank matrix Θ∗ ∈ Rd1×d2 andthe context matrixXi,ai ∈ Rd1×d2 , i.e., Yt = 〈〈Xt,at ,Θ

∗〉〉+ εt, where rank(Θ∗) = r � min{d1, d2}. Wespecify the loss function and the regularization norm to be

Lt(Θ) =1

2t

t∑i=1

(yi − 〈〈Xi,ai ,Θ〉〉)2, R(Θ) = |||Θ|||nuc

By Example 2, the nuclear norm regularization is decomposable on (M,M⊥) defined by the left andright singular vector spaces of Θ∗. Note that in this case φ =

√2r because the space M contains

matrices with rank at most 2r. The norm we choose in the event Et is the Frobenious norm. Also, weassume without loss of generality that the matrices are normalized, so that |||Xt,ai |||F ≤ 1, |||Θ∗|||F ≤ 1.Therefore, Assumption 1 and 2 are automatically satisfied. Similar to Lemma 4.1 and 4.2, the followingtwo lemmas provide a good choice of λt for At and the probability of the event Et.

Lemma 4.3. Suppose the noise εt is σ-sub-Gaussian. At each round t, for any δ ∈ (0, 1), use

λt =4σ√t

√3 log(2t/δ) log2(2(d1 + d2)/δ)

in Algorithm 1, then with probability at least 1− δ, we have λt ≥ R∗(∇Lt(Θ∗))

Lemma 4.4. Suppose Assumption 3 is satisfied, then with probability at least 1−exp(−T0C

4

), we have

Bt(Θ,Θ∗) ≥ α0

4 |||Θ−Θ∗|||2F −α0|||Θ∗|||F

4√t

, for all t ≥ T0 ≥ 2 log(d1d22 +d1d2)/C and Θ−Θ∗ ∈ C∩B(δt),

where C > 0 is a constant and δ2t = |||Θ|||F /(2

√t).

9

Similarly, we can obtain the regret bound by setting δ = 1/t2 in Lemma 4.3 and T0 = Θ(√T ).

Corollary 4.2. The expected cumulative regret of the Algorithm 1 in the low-rank matrix banditproblem is upper bounded by E[Regret(T )] = O

(√rT log T log((d1 + d2)T )

)Remark 4.2. The proofs and the corresponding matrix bandit algorithm are provided in AppendixC. The regret bound in Corollary 4.2 is of size O(

√rT ), which depends on the low rank structure

and is better than directly applying linear bandit algorithms to the problem. We emphasize that ourresult is novel, because Lu et al. [2021] consider a fixed set of the context matrices Xt,at instead of agenerating distribution PX , which is different from our setting.

4.3 Group-sparse matrix bandit

Finally, we also apply our general framework to the group-sparse matrix bandit problem. Similar tothe low-rank matrix bandit case, we assume that the expected reward is a linear function of the contextmatrix Xt,at ∈ Rd1×d2 and the optimal parameter Θ∗ ∈ Rd1×d2 , i.e., Yt = 〈〈Xt,at ,Θ

∗〉〉 + εt. DefineS(Θ∗) = {i ∈ [d1] | Θ∗i = 0} to be the set of non-zero rows in Θ∗ and assume that |S(Θ∗)| = s� d1,so the matrix is group sparse. We again specify the sum of squared errors to be the loss, but use thel1,q norm as the regularization norm.

Lt(Θ) =1

2t

t∑i=1

(yi − 〈〈Xi,ai ,Θ〉〉)2, R(Θ) = |||Θ|||1,q

Note that if q = 1, the above problem is equivalent to the LASSO bandit case, because we only needto vectorize all the matrices by stacking their columns. Then Lt(Θ) becomes the mean squared errorand R(Θ) becomes the l1 regularization, thus we only consider q > 1 here. Define the subspaces usingS(Θ∗) as in Example 3, then the l1,q norm regularization is decomposable.

Similarly, we assume without loss of generality that the matrices are normalized with respect tothe Frobenious norm, so that Assumption 1 and 2 are satisfied automatically. We define the functionη(d2,m) = max{1, dm2 }. This notation is used because we need slightly different choices of λt forq ∈ (1, 2] and q > 2. Also, the compatibility constant φ = η(d2,

1q −

12 )√s, which varies for different

q. Since Lemma 4.4 can be applied to this problem, we prove the following lemma for At.

Lemma 4.5. Suppose that εt is σ-Gaussian. At each round t, for any δ ∈ (0, 1), use

λt =2σ√td

1−1/q2 +

ση(d2,12 −

1q )

√t

√2 log(2d1/δ)

in Algorithm 1, then with probability at least 1− δ, we have λt ≥ R∗(∇Lt(Θ∗)).

Given the above lemma, we can similarly derive a desirable regret bound.

Corollary 4.3. With C1(d2) = d1− 1

q

2 η(d2,1q −

12 ), C2(d2) = η(d2,

1q −

12 )η(d2,

12 −

1q ) being two func-

tions of d2, the expected cumulative regret of the Algorithm 1 in the group-sparse matrix bandit problem

is upper bounded by E[Regret(T )] = O(C1(d2)

√sT + C2(d2)

√sT log(d1T )

)Remark 4.3. The proofs and the algorithm are provided in Appendix D. The regret bound inCorollary 4.3 depends on the choice of the regularization norm. For example, if q = 2, then R(Θ) isthe group-LASSO regularization and the final regret bound is O(

√d2sT +

√sT log(d1T 2)). Besides,

the regret bound only has logarithmic dependence on d1, so it is desirable since we assume that thematrix is group sparse with s � d1. The regret bound is again novel because Johnson et al. [2016]considered a fixed set of context matrices instead of a generating distribution.

10

5 Application on a new problem: the multi-agent LASSO ban-dit

In this work, we present a general algorithm for high dimensional bandit problems and an analysisframework for its regret. To further apply our framework to other high dimensional bandit problems,one simply needs to follow what we have done in Section 4. First, the loss Lt(θ) and the subspacepair needs to be defined based on the low-dimensional structure. Then one determines a pair ofnorms (‖ · ‖, R) such that R is decomposable and that all the constants φ, α, k1, C1, C2 are defined.After showing that both At and Et are high probability events with a reasonable assumption such asAssumption 3, one can derive a regret bound using Theorem 3.1.

Apart from direct applications, our general analysis framework can also inspire less-direct usecases when the optimization problem in the bandit algorithm has the form of Eqn.(1) or when greedyactions are taken. We believe that similar algorithms only need to ensure that the two events At,Et happen with high probability given correctly chosen parameters λt and Zt(θ

∗). Here we brieflypresent a novel application where our framework guides a new algorithm and a new regret bound.

Suppose there are d2 agents solving LASSO bandit problems synchronously, and each agent k

receives a stochastic reward y(k)t = x

(k)Tt,at θ

(k)∗ + ε(k)t at every round t. Here, the contexts for each

problem are similar and thus the parameters are required to be group-sparse instead of just sparse.The major challenges in this problem are that, first, there are multiple actions and multiple rewardsat each round, so it would be impossible to directly apply our framework. Second, the Lipschitznessassumption may only be guaranteed for each agent, but not for the whole problem. Therefore, wepropose a variant of our algorithm where every agent takes greedy actions and they jointly solve thefollowing optimization problem.

θt = argmin{ 1

2t

t∑i=1

d2∑k=1

(y(k)i − x

(k)Ti,ai

θ(k)∗)2 + |||Θ|||1,2}

Note that the above problem also has the form of Eqn. (1). Therefore, following the same logic inour analysis framework, we prove the two high probability events in this multi-agent problem, andthen an O(d2

√sT log(d1T )) regret bound is proved in Appendix E. Although the bound is the same

as applying the LASSO bandit algorithm to each agent independently, our regularization can ensuregroup-sparsity in the parameter, which cannot be guaranteed in the former case.

References

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions andapplication to sparse stochastic bandits. In Proceedings of the Fifteenth International Conferenceon Artificial Intelligence and Statistics, 2011.

Remi Munos Alexandra Carpentier. Bandit theory meets compressed sensing for high dimensionalstochastic linear bandit. Proceedings of the Fifteenth International Conference on Artificial Intelli-gence and Statistics (AISTATS), 2012.

Peter Auer. Using confidence bounds for exploitationexploration trade-offs. Journal of MachineLearning Research, page 3:397–422, 2002b.

Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates. Oper-ation Research, pages 68 (1), 276–294, 2020.

11

Sergey Bobkov, Piotr Nayar, and Prasad Tetali. Concentration properties of restricted measures withapplications to non-lipschitz functions, 2015.

Peter Buhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory andapplications. Springer Science & Business Media, 2011.

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff func-tions. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics(AISTATS), 2011.

Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. In Advancesin Neural Information Processing Systems, 2020a.

Botao Hao, Jie Zhou, Zheng Wen, and Will Wei Sun. Low-rank tensor bandits, 2020b.

Nicholas Johnson, Vidyashankar Sivakumar, and Arindam Banerjee. Structured stochastic linearbandits, 2016.

Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear bandits with low-rank structure. In Proceedings of the 36th International Conference on Machine Learning, Pro-ceedings of Machine Learning Research, pages 3163–3172, Long Beach, California, USA, 09–15 Jun2019. PMLR.

Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. Bernoullirank-1 bandits for click feedback. In Proceedings of the 26th International Joint Conference onArtificial Intelligence, 2017a.

Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. Stochasticrank-1 bandits. In Proceedings of the 20th International Conference on Artificial Intelligence andStatistics, 2017b.

Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. Advances in Neural InformationProcessing Systems 32, pages 5877–5887, 2019.

Branislav Kveton, Csaba Szepesvari, Anup Rao, Zheng Wen, Yasin Abbasi-Yadkori, and S. Muthukr-ishnan. Stochastic low-rank bandits, 2017.

Tor Lattimore, Koby Crammer, and Csaba Szepesvari. Linear multi-resource allocation with semi-bandit feedback. In Advances in Neural Information Processing Systems, volume 28, pages 964–972.Curran Associates, Inc., 2015.

Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Low-rank generalized linear bandit problems.In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.

Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unified frameworkfor high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science,page 27(4):538–557, 2012.

Minhwan Oh, Garud Iyengar, and Assaf Zeevi. Sparsity-agnostic lasso bandit, 2020.

Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research,12(104):3413–3430, 2011.

12

Cindy Trinh, Emilie Kaufmann, Claire Vernade, and Richard Combes. Solving bernoulli rank-onebandits with unimodal thompson sampling. In Proceedings of the 31st International Conference onAlgorithmic Learning Theory, 2020.

Xue Wang, Mingcheng Wei, and Tao Yao. Minimax concave penalized multi-armed bandit modelwith high-dimensional covariates. In Proceedings of the 35th International Conference on MachineLearning, 2018.

13

Supplementary Materials to ”A Simple UnifiedFramework for High Dimensional Bandit Problems”

A Notations and the Main Theorem

A.1 List of Norms

For a vector v ∈ Rd,

• ‖v‖2 is the l2 norm of v, i.e., ‖v‖2 = (v21 + v2

2 + · · ·+ v2d)1/2, which is self-dual.

• ‖v‖1 is the l1 norm of v, i.e., ‖v‖1 = |v1|+ |v2|+ · · ·+ |vd|, whose dual norm is the l∞ norm.

• ‖v‖∞ is the l∞ norm of v, i.e., ‖v‖∞ = maxi=1,2,···d |vi|, whose dual norm is the l1 norm.

For a matrix M ∈ Rd1×d2

• |||M |||F is the Frobenius norm of M , i.e., |||M |||F = (∑d1i=1

∑d2j=1M

2ij)

1/2, which is self-dual.

• |||M |||nuc is the nuclear norm of M , which is defined as |||M |||nuc =∑min{d1,d2}i=1 σi, where {σi}’s

are the singular values of M . The dual norm of the nuclear norm is the l2 induced operatornorm.

• |||M |||op is the operator norm ofM induced by the vector norm ‖·‖2, i.e., |||M |||op = sup‖x‖2=1 ‖Mx‖2.

An important property of |||·|||op is that |||M |||op = maxmin{d1,d2}i=1 σi, where {σi}’s are the singular

values of M . The dual norm of the induced operator norm is the nuclear norm.

• |||M |||1,q =∑d1i=1[

∑d2j=1 |Mij |q]1/q is the l1,q norm of M . For example if q = 2, this corresponds

to the group lasso regularization. The dual norm of l1,q norm is the l∞,p norm, defined as

|||M |||∞,p = maxd1i=1[∑d2j=1 |Mij |p]1/p, with the relationship 1/p+ 1/q = 1.

• |||M |||max is the element-wise maximum norm ofM , which is defined as |||M |||max = maxd1i=1 maxd2j=1 |Mij |.

A.2 Proof of Lemma 3.2

Proof. By the definition of the Bregman divergence Bt+1(θ, θ∗), the proof follows directly fromour choice of δt. Recall that Lt(θ) is a convex function and it has the form of Lt(θ) = 1

t

∑tτ=1 lτ (θ)

14

with lτ (θ) > 0 being convex functions . For any θ − θ∗ ∈ (C ∩ B(δt+1)) ⊂ (C ∩ B(δt)), we have

Bt+1(θ, θ∗) = Lt+1(θ)− Lt+1(θ∗)− 〈∇Lt+1(θ∗), θ − θ∗〉

= (t

t+ 1Lt(θ) +

1

t+ 1lt+1(θ))− (

t

t+ 1Lt(θ

∗) +1

t+ 1lt+1(θ∗))

− 〈∇(t

t+ 1Lt(θ

∗) +1

t+ 1lt+1(θ∗)), θ − θ∗〉

=t

t+ 1Bt(θ, θ

∗) +1

t+ 1[lt+1(θ)− lt+1(θ∗)− 〈∇lt+1(θ∗)), θ − θ∗〉]

≥ t

t+ 1Bt(θ, θ

∗)

≥ t

t+ 1

(α‖θ − θ∗‖2 − Zt(θ∗)

)≥ α‖θ − θ∗‖2 − α

t+ 1‖θ − θ∗‖2 − t

t+ 1Zt(θ

∗)

≥ α‖θ − θ∗‖2 − α

t+ 1(t+ 1

αZt+1(θ∗)− t

αZt(θ

∗))− t

t+ 1Zt(θ

∗)

≥ α‖θ − θ∗‖2 − Zt+1(θ∗) +t

t+ 1Zt(θ

∗)− t

t+ 1Zt(θ

∗)

≥ α‖θ − θ∗‖2 − Zt+1(θ∗)

In other words, we have proved that once the restricted strong convexity is satisfied on the setC ∩ B(δt), it is also satisfied on C ∩ B(δt+1). �

A.3 Proof of Theorem 3.1

Proof. By the Lipschitzness of f over x with respect to ‖ ·‖, and the boundedness of ‖x‖, we have


∗) ≤ C1‖xt,a∗t − xt,at‖ ≤ 2C1k1

Then we can decompose the one-step regret from round t into three parts as follows,


∗)

=[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t ≤ T0) +

[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t > T0, Et)

+[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t > T0, Ect )

≤ 2C1k1I(t ≤ T0) +[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t > T0, Et) + 2C1k1I(t > T0, Ect )

= 2C1k1I(t ≤ T0) +[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t > T0, f(xt,at , θt) ≥ f(xt,a∗t , θt), Et) + 2C1k1I(t > T0, Ect )

where I(·) is the indicator function. The last equality is due to the choice of at = argmaxaf(xt,a, θt),and thus we know that f(xt,at , θt) ≥ f(xt,a∗t , θt). We focus on the second indicator function now. Bythe Lipschitzness of f over θ, we have

I (t > T0, Et) = I(t > T0, f(xt,at , θt) ≥ f(xt,a∗t , θt), Et

)= I

(t > T0, f(xt,at , θt)− f(xt,a∗t , θt) + f(xt,a∗t , θ

∗)− f(xt,at , θ∗) ≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)= I

(t > T0, [f(xt,at , θt)− f(xt,at , θ

∗)] +[f(xt,a∗t , θ

∗)− f(xt,a∗t , θt)]≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)≤ I

(t > T0, 2C2‖θt − θ∗‖ ≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)15

Substitute the above inequality back and take expectation on both sides of the one-step regretfrom round t, we get

E[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]≤ 2C1k1 for t ≤ T0

For t > T0 and any constant vt ∈ R, the expectation is bounded by

E[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]

≤ E[(f(xt,a∗t , θ

∗)− f(xt,at , θ∗))I(2C2‖θt − θ∗‖ ≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)]+ 2C1k1P(Ect )



∗)− f(xt,at , θ∗) > 2C2vt, Et

)]+ 2C1k1P(Ect )

+ E[(f(xt,a∗t , θ


∗)− f(xt,at , θ∗), f(xt,a∗t , θ

∗)− f(xt,at , θ∗) < 2C2vt, Et

)]≤ E

[(f(xt,a∗t , θ


∗)− f(xt,at , θ∗) > 2C2vt, Et

)]+ 2C1k1P(Ect ) + 2C2vt

≤ 2C1k1P (‖θt − θ∗‖ > vt, Et) + 2C1k1P(Ect ) + 2C2vt

Now take vt to be the upper bound of ‖θt − θ∗‖ in Lemma 3.1, i.e., v2t = 9

λ2t

α2φ2 +

λtα

[2Zt(θ∗) +

4R(θ∗M⊥)]. We know by Lemma 3.1 that P(‖θt − θ∗‖ > vt, Et) ≤ P(Act , Et). Then the expectedcumulative regret becomes

E[Regret(T )] ≤ 2C1k1T0 +

T∑t=T0

[2C1k1P (Act , Et) + 2C1k1P(Ect ) + 2C2vt]

≤ 2C1k1T0 + 2C1k1

T∑t=T0

[P (Act) + P(Ect )] + 2C2

T∑t=T0

√9λ2t

α2φ2 +

λtα

[2Zt(θ∗) + 4R(θ∗M⊥)]

�

A.4 Regret Bound with the Regularization Norm

In this subsection, we provide a regret bound when Assumption 1 and 2 are assumed base on theregularization norm R, which would correspond to, for example, using l1 norm and the compatibilitycondition in LASSO bandit. The regret bound we derive in Corollary A.1 can be easily extended tothe LASSO bandit, the low-rank matrix bandit and the group-sparse matrix bandit problems. Wefirst provide the second oracle inequality [Negahban et al., 2012].

Lemma A.1. (Oracle Inequality) If the probability event At holds and Et holds for θ − θ∗ ∈C ∩ B(rt), where r2

t ≥ 9λ2t

α2φ2 + λt

α (2Zt(θ∗) + 4R(θ∗M⊥)). Further suppose that R(θ∗M⊥) = 0, then the

difference θt − θ∗ satisfies the bound

R(θt − θ∗) ≤ 4φ

√9λ2t

α2φ2 +

2λtαZt(θ∗)

Assumption 4. We assume x is normalized with respect to the norm R., i.e., R(x) ≤ k1 for someconstant k1 .

16

Assumption 5. Here we assume similar Lipschitzness conditions on f and boundedness on R(x), aswhat we have done in Section 3. We assume f(x, θ) is C1-Lipschitz over x and C2-Lipschitz over θwith respect to the norm R. i.e.,

f(x1, θ)− f(x2, θ) ≤ C1R(x1 − x2)

f(x, θ1)− f(x, θ2) ≤ C2R(θ1 − θ2)

We show that based on such conditions, we can get a similar result as in Theorem 3.1.

Corollary A.1. (Regret Bound) Let T0 ∈ N be a constant. Suppose that Assumption 4 and 5 hold.Also suppose that θ∗ ∈M, then the expected cumulative regret satisfies the bound

E[Regret(T )] ≤ 2C1k1T0 + 2C1k1

T∑t=T0

[P (Act) + P(Ect )] + 8C2φ

T∑t=T0

√9λ2t

α2φ2 +

λtα

[2Zt(θ∗)]

Proof. By the Lipschitzness of f over x with respect to R(·), and the boundedness of R(x), wehave


∗) ≤ C1R(xt,a∗t − xt,at) ≤ 2C1k1

Then we can decompose the one-step regret from round t into three parts as follows


∗)

=[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t ≤ T0) +

[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t > T0, Et)

+[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t ≥ T0, Ect )

≤ 2C1k1I(t ≤ T0) +[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t > T0, Et) + 2C1k1I(t > T0, Ect )

= 2C1k1I(t ≤ T0) +[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]I(t > T0, f(xt,at , θt) ≥ f(xt,a∗t , θt), Et) + 2C1k1I(t > T0, Ect )


I (t > T0, Et) = I(t > T0, f(xt,at , θt) ≥ f(xt,a∗t , θt), Et

)= I

(t > T0, f(xt,at , θt)− f(xt,a∗t , θt) + f(xt,a∗t , θ

∗)− f(xt,at , θ∗) ≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)= I

(t > T0, [f(xt,at , θt)− f(xt,at , θ

∗)] +[f(xt,a∗t , θ

∗)− f(xt,a∗t , θt)]≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)≤ I

(t > T0, 2C2R(θt − θ∗) ≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)Substitute the above inequality back and take expectation on both sides of the one-step regret

from round t, we get

E[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]≤ 2C1k1 for t ≤ T0

17


E[f(xt,a∗t , θ

∗)− f(xt,at , θ∗)]


∗)− f(xt,at , θ∗))I(2C2R(θt − θ∗) ≥ f(xt,a∗t , θ

∗)− f(xt,at , θ∗), Et

)]+ 2C1k1P(Ect )



∗)− f(xt,at , θ∗) > 2C2vt, Et

)]+ 2C1k1P(Ect )

+ E[(f(xt,a∗t , θ


∗)− f(xt,at , θ∗), f(xt,a∗t , θ

∗)− f(xt,at , θ∗) < 2C2vt, Et

)]≤ E

[(f(xt,a∗t , θ


∗)− f(xt,at , θ∗) > 2C2vt, Et

)]+ 2C1k1P(Ect ) + 2C2vt

≤ 2C1k1P (R(θt − θ∗) > vt, Et) + 2C1k1P(Ect ) + 2C2vt

Now take vt to be the upper bound of R(θt−θ∗) in Lemma A.1, i.e., vt = 4φ

√9λ2t

α2φ2 +

2λtαZt(θ∗).

We know by Lemma A.1 that P(R(θt−θ∗) > vt, Et) ≤ P(Act , Et). Then the expected cumulative regretbecomes

E[Regret(T )] ≤ 2C1k1T0 +

T∑t=T0

[2C1k1P (Act , Et) + 2C1k1P(Ect ) + 2C2vt]

≤ 2C1k1T0 + 2C1k1

T∑t=T0

[P (Act) + P(Ect )] + 8C2φ

T∑t=T0

√9λ2t

α2φ2 +

2λtαZt(θ∗)

�

B Proof for the LASSO Bandit

B.1 Notations and Algorithm

Algorithm 2 The LASSO Bandit Algorithm

1: Input: K, d ∈ N, θ0, T0

2: Initialize θt = θ0

3: for t = 1, 2, · · · , T do4: Observe K contexts, xt,1, xt,2, · · · , xt,K5: if t ≤ T0 then6: Choose action at uniformly randomly # Exploration Stage7: else8: Choose action at = argmaxa〈xt,at , θt−1〉 # Exploitation Stage9: end if

10: Receive reward yt = 〈xt,at , θ∗〉+ εt11: Update λt = 2σ

√2 log(2dt2)/t

12: θt = argminθ∈Rd{

12t

∑ti=1(yi − xTi,aiθ)

2 + λt‖θ‖1}

13: end for

We first clarify the notations we use throughout the proof for the LASSO bandit. We use thenotations Xt ∈ Rt×d, Yt, et ∈ Rt to represent the context matrix, the reward and the error vectors

18

respectively, i.e., [Xt]i = xi,ai ∈ Rd, [Yt]i = yi, [et]i = εi,∀i ∈ [t]. The loss function now becomesLt(θ) = 1

2t‖Yt −Xtθ‖22. The derivative of Lt(θ) evaluated at θ∗ can be computed as

∇Lt(θ∗) = −1

tXTt Yt +XT

t Xtθ∗ = −1

tXTt et

The Bregman divergence Bt(θ, θ∗) can therefore be computed as

Bt(θ, θ∗) = Lt(θ)− Lt(θ∗)− 〈∇Lt(θ∗), θ − θ∗〉

=1

2t‖Yt −Xtθ‖22 −

1

2t‖Yt −Xtθ

∗‖22 −1

t〈−XT

t et, θ − θ∗〉

=1

2t(θ − θ∗)TXT

t Xt(θ − θ∗)

Therefore the event At is equivalent to {λt ≥ 2‖∇Lt(θ∗)‖∞ = 2t ‖X

Tt et‖∞}. The event Et (RSC

condition) is equivalent to

1

2t(θ − θ∗)TXT

t Xt(θ − θ∗) ≥ α‖θ − θ∗‖22 − Zt(θ∗)

Based on the above Bregman divergence, we define the matrix Σt ∈ Rd×d as follows .

Σt =1

t

t∑τ=1

xτ,aτxTτ,aτ

B.2 Technical Lemmas

The following two Bernstein-type inequalities are very useful in our analysis.

Lemma B.1. (Lemma EC.1 of Bastani and Bayati [2020]) Let {Dk,Fk}∞k=1 be a martingaledifference sequence, and suppose that Dk is σ-sub-Gaussian in an adapted sense, i.e., for all α ∈R,E

[eαDk |Fk−1

]≤ eα2σ2/2 almost surely. Then, for all t ≥ 0, we have

P

(|n∑k=1

Dk| ≥ t

)≤ 2e−t

2/(2nσ2)

Lemma B.2. (Lemma EC.4. of Bastani and Bayati [2020]) Let x1, x2, · · · , xt be i.i.d. randomvectors in Rd with ‖xτ‖∞ ≤ 1 for all τ. Let Σ = E

[xτx

Tτ

]and Σt = 1

t

∑tτ=1 xτx

Tτ .. Then for any

w > 0, we have

P

[‖Σt − Σ‖∞ ≥ 2

(w +√

2w +

√2 log(d2 + d)

t+

log(d2 + d)

t

)]≤ exp (−tw)

B.3 Proof of Lemma 4.1

Proof. Denote X(r)t to be the r-th column of the matrix Xt. Since we assume that each ‖xt‖2 ≤ 1,

then the matrix Xt is column normalized, i.e., ‖X(r)t ‖2 ≤

√t. From the union bound, we know that

19

for any constant c ∈ R, we have

P(

2

t‖XT

t et‖∞ ≤ c)

= P(∀r ∈ [d], |〈et, X(r)

t 〉| ≤ct

2

)≥ 1−

d∑r=1

P(|〈et, X(r)

t 〉| ≥ct

2

)Note that {〈et, X(r)

t 〉}tτ=1 is a martingale difference sequence adapted to the filtration F1 ⊂ F2 ⊂· · · ⊂ Fτ , because E[〈et, X(r)

t 〉 | Ft−1] = 0. Also note that each εt is σ2-sub-Gaussian, therefore

E[exp

(α〈et, X(r)

t 〉)|Fk−1

]= E

[exp

(α

t∑i=1

X(r)t,i εi

)|Fk−1

]

= EX

(r)t

[Eεt|xτ

[exp

(α

t∑i=1

X(r)t,i εi

)| Fk−1, xτ

]]

≤ EX

(r)t

[exp

(t∑i=1

α2X2(r)t,i σ2/2

)| Fk−1

]≤ exp

(α2σ2t/2

)Hence we can use Lemma B.1 and thus

P(

2

t

∥∥XTt et‖∞ ≤ c

)≥ 1−

d∑τ=1

P(|〈et, X(r)

t 〉| ≥ct

2

)≥ 1− 2de−c

2t/(8σ2)

If we take λt = c =√

(8σ2 log 2dδ )/t, then with probability 1− δ, we conclude that

P(

2

t‖XT

t et‖∞ ≤ λt)≥ 1− δ �

B.4 Proof of Lemma 4.2

First, we introduce the following two technical lemmas.

Lemma B.3. ( Lemma 9 of Oh et al. [2020]) Suppose that Σ1,Σ2 ∈ Rd×d, β ∈ C ∩Rd, and thatthe matrix Σ1 satisfies the restricted eigenvalue condition βTΣ1β ≥ α‖β‖22 with α > 0. Moreover,suppose the two matrices are close enough such that |||Σ2 − Σ1|||max ≤ δ, where 32φ2δ ≤ α. Then

βTΣ2β ≥α

2‖β‖22

Proof. The proof here modifies the proof by Oh et al. [2020], and we provide it for completeness.By the Cauchy-Schwartz inequality, we have∣∣βTΣ1β − βTΣ2β

∣∣ =∣∣βT (Σ1 − Σ2)β

∣∣≤ ‖(Σ1 − Σ2)β‖∞‖β‖1≤ |||Σ1 − Σ2|||max‖β‖

21

≤ δ‖β‖21

20

For β ∈ C, we have the inequality ‖βM⊥‖1 ≤ 3‖βM‖1. Thus

‖β‖1 ≤ 4‖βM‖1 ≤ 4φ‖βM‖2 ≤ 4φ‖β‖2 ≤ 4φ

√1

αβTΣ1β

Therefore since we assume that 32φ2δ ≤ α, we have

∣∣βTΣ1β − βTΣ2β∣∣ ≤ δ‖β‖21 ≤ 16φ2δ

αβTΣ1β ≤

1

2βTΣ1β

By some simple algebra on the above inequality, we know that

βTΣ2β ≥α

2‖β‖22 �

Lemma B.4. (Distance Between Two Matrices) Define C = (√

α0

64φ2 + 1 − 1)2, where α0 is

defined in Assumption 3. Then for all T0 ≥ 2 log(d2 + d)/C, we have

P(∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣max≥ α0

32φ2

)≤ exp

(−T0C

2

)Proof. Since we sample uniformly in the exploration stage, we know that the contexts {xτ,aτ }

T0τ=1

are i.i.d., and thus for any constant w, by Lemma B.2 we have

P

1

2

∣∣∣∣∣∣∣∣∣ΣT0− Σ

∣∣∣∣∣∣∣∣∣max≥ w +

√2w +

√2 log(d2 + d)

T0+

log(d2 + d)

T0

≤ exp (−wT0)

Now our choices of w and T0 are to let both terms concerning w and T0 to be small, i.e.,

w +√

2w ≤ α0

128φ2and

√2 log (d2 + d)

T0+

log(d2 + d

)T0

≤ α0

128φ2

Solving the two inequalities leads to

w =1

2(

√α0

64φ2+ 1− 1)2 and T0 ≥ 2 log(d2 + d)/(

√α0

64φ2+ 1− 1)2

then we have the following inequality

P(

1

2

∣∣∣∣∣∣∣∣∣ΣT0− Σ

∣∣∣∣∣∣∣∣∣max≥ α0

64φ2

)≤ exp (−wT0) = exp

(−T0C

2

)�

Finally, we provide the proof of Lemma 4.2.Proof. The proof follows from combining Lemma B.3 and Lemma B.4. Since we have that the

restricted eigenvalue condition holds for Σ by Assumption 3 , i.e.,

(θ − θ∗)TΣ(θ − θ∗) ≥ α0‖θ − θ∗‖22,∀θ − θ∗ ∈ C

Also, the two matrices Σ, ΣT0 are close enough when T0 is large by Lemma B.4,

P(∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣max≥ α0

32φ2

)≤ exp

(−T0C

2

)

21

where C is a constant defined as in Lemma B.4. By Lemma B.3 we can claim that

(θ − θ∗)T ΣT0(θ − θ∗) ≥ α0

2‖θ − θ∗‖22,∀θ − θ∗ ∈ C

with high probability 1− exp(−T0C2 ) when T0 ≥ 2 log(d2 + d)/C. Therefore

BT0(θ, θ∗) =

1

2(θ − θ∗)T ΣT0

(θ − θ∗) ≥ α0

4‖θ − θ∗‖22,∀θ − θ∗ ∈ C

Take Zt(θ∗) = (α0‖θ∗‖2)/(4

√t) in Lemma 3.2, we get δ2

t = ‖θ∗‖2(√t −√t− 1) ≥ ‖θ∗‖2/(2

√t).

Note that δt+1 < δt. Therefore for any t ≥ T0, by induction we have

Bt(θ, θ∗) ≥ α0

4‖θ − θ∗‖22 −

α0‖θ∗‖24√t

,∀θ − θ∗ ∈ C ∩ B(δt) �

B.5 Proof of Corollary 4.1

Now given the results in Lemma 4.1, Lemma 4.2 and Theorem 3.1, we can easily derive an upperbound for our LASSO bandit algorithm. We specify all the constants here. For example, the Lipschitzconstants C1 = C2 = 1, the bound of the norm k1 = 1, and we set T0 = Θ(

√T ) ≥ 2 log(d2 + d)/C,

where C is a constant defined in Lemma B.4. The restricted strong convexity holds with α = α0/4and Zt(θ

∗) = (α0‖θ∗‖2)/(4√t). The compatibility constant φ =

√s. Also, θ∗M⊥ = 0 so R(θ∗M⊥) = 0.

Proof. Note that in Theorem 3.1, we require

δ2t ≥

144λ2t

α20

φ2 +4λtα0

[2Zt(θ∗) + 4R(θ∗M⊥)]

If we take δ = 1/t2 in Lemma 4.1, we get λt = 2σ√

(2 log(2dt2))/t. Therefore, the condition isequivalent to

‖θ‖22√t≥ 1152sσ2 log(2dt2)

α20t

+4σ√

2 log(2dt2)‖θ∗‖2t

= O(1

t)

The above condition can be guaranteed when t ≥ T0 = Θ(√T ) and when T is large enough. In

other words, we can use the oracle inequality after the exploration stage. Using Theorem 3.1, we canget the following upper bound

E[Regret(T )] ≤ 2T0 + 2

T∑t=T0

[P (Act) + P(Ect )] + 2

T∑T0

√144λ2

t

α20

φ2 +8λtα0

Zt(θ∗)

Since we take δ = 1/t2 in Lemma 4.1, P (Act) ≤ 1/t2. Furthermore,

T∑t=T0

P (Act) ≤T∑

t=T0

1

t2≤∞∑t=1

1

t2≤ 1 +

[−1

t

] ∣∣∣∣∣∞

1

= 2

Also by our choice of T0, we know that

T∑t=T0

P (Ect ) ≤T∑

t=T0

exp (−T0C

2) ≤ T exp (−T0C

2) = O(1)

22

where the last equality is by the fact that x2e−ax is bounded above by a constant a2

4 e− a22 when

a > 0. Therefore we arrive to the following bound

E[Regret(T )] ≤ 2T0 + 2

T∑t=T0

[P (Act) + P(Ect )] + 2

T∑T0

√144λ2

t

α20

φ2 +8λtα0

Zt(θ∗)

≤ 2T0 + 2

T∑t=T0

[P (Act) + P(Ect )] + 2

T∑t=T0

√1152sσ2 log(2dt2)

α20t

φ2 +4σ√

2 log(2dt2)‖θ∗‖2t

≤ 2T0 + 2

T∑t=T0

[P (Act) + P(Ect )] + 2

T∑t=1

√(1152sσ2

α20

+ 4σ

)log(2dT 2)

t

≤ 2T0 + 2T∑

t=T0

[P (Act) + P(Ect )] + 4

√(1152sσ2

α20

+ 4σ

)T log(2dT 2)

= O(√sT log(2dT 2))

where the second last inequality is from the fact that∑Ti=1

1√t≤ 2√T − 1. �

C Proof for the Low-Rank Matrix Bandit

C.1 Notations and Algorithm

Algorithm 3 The Low Rank Matrix Bandit Algorithm

1: Input: K, d1, d2 ∈ N,Θ0, T0

2: Initialize Θt = Θ0

3: for t = 1, 2, · · · , T do4: Observe K contexts, Xt,1, Xt,2, · · · , Xt,K

5: if t ≤ T0 then6: Choose action at uniformly randomly # Exploration Stage7: else8: Choose action at = argmaxa〈〈Xt,a,Θt−1〉〉 # Exploitation Stage9: end if

10: Receive reward yt = 〈〈Xt,at ,Θ∗〉〉+ εt

11: Update λt = 4σ√t

√log(2t3) log(2(d1 + d2)t2)


12t

∑ti=1(yi − 〈〈Xi,ai ,Θ〉〉)2 + λt|||Θ|||nuc

}13: end for

We first establish the notations and the corresponding algorithm in low rank matrix bandit prob-lems. We use the shorthand vector notations Xt(Θ∗), Yt, et ∈ Rt such that [Xt(Θ∗)]i = 〈〈Xi,ai ,Θ

∗〉〉,[Yt]i = yi, and [et]i = εi. For any matrix A , vec(A) denotes the vectorization of the matrix by

23

stacking all rows of the matrix into a vector, i.e., if A has entriesA11 A12 · · · A1n

A21 A22 · · · A2n

· · ·Am1 Am2 · · · Amn

then vec(A) = (A11, A12, · · ·A1n, A21, A22, · · · , A2n, · · ·Am1, Am2, · · · , Amn)T , and 〈〈A,B〉〉 = tr (ATB) =

vec(A)T vec(B).The loss function now becomes Lt(Θ) = 1

2t‖Yt − Xt(Θ)‖22, and the derivative with respect to Θ∗

can be computed as

∇Lt(Θ∗) =∂

∂Θ∗1

2t(Yt −Xt(Θ∗))T (Yt −Xt(Θ∗))

=1

2t

∂

∂Θ∗(−2Xt(Θ∗)TYt + Xt(Θ∗)TXt(Θ∗))

= −1

t

t∑i=1

Xi,aiεi

where the differentiation is based on the chain rule of matrix calculus and the following facts

∂(−2Xt(Θ∗)TYt + Xt(Θ∗)TXt(Θ∗))∂Xt(Θ∗)

= −2et

∂[Xt(Θ∗)]i∂Θ∗j,k

=∂ tr(XT

i,aiΘ∗)

∂Θ∗j,k= Xi,ai;j,k

Therefore the Bregman divergence Bt(Θ,Θ∗) in the low rank matrix setting is

Bt(Θ,Θ∗)

= Lt(Θ)− Lt(Θ∗)− 〈∇Lt(Θ∗),Θ−Θ∗〉

=1

2t‖Yt −Xt(Θ)‖22 −

1

2t‖Yt −Xt(Θ∗)‖22 + 〈〈1

t

t∑i=1

Xi,aiεi,Θ−Θ∗〉〉

=1

2t[−2Xt(Θ)TYt + Xt(Θ)TXt(Θ) + 2Xt(Θ∗)TYt −Xt(Θ∗)TXt(Θ∗)] + 〈〈1

t

t∑i=1


=1

2t[−2Xt(Θ)T et − 2Xt(Θ)TXt(Θ∗) + Xt(Θ)TXt(Θ) + 2Xt(Θ∗)T et + Xt(Θ∗)TXt(Θ∗)] + 〈〈1

t

t∑i=1


=1

2t[Xt(Θ)TXt(Θ)− 2Xt(Θ)TXt(Θ∗) + Xt(Θ∗)TXt(Θ∗)]

=1

2t[(Xt(Θ)−Xt(Θ∗))T (Xt(Θ)−Xt(Θ∗))]

=1

2t

t∑i=1

〈〈Xi,ai ,Θ−Θ∗〉〉2

24

Note that one important property of the above Bregman divergence is

Bt(Θ,Θ∗) =

1

2t

t∑i=1

〈〈Xi,ai ,Θ−Θ∗〉〉2

=1

2t

t∑i=1

[vec(Xi,ai)T vec(Θ−Θ∗)]2

=1

2t

t∑i=1

vec(Θ−Θ∗)T [vec(Xi,ai)vec(Xi,ai)T ]vec(Θ−Θ∗)

Based on the above discussions, we define the matrix Σt ∈ Rd1d2×d1d2 as follows

Σt =1

t

t∑τ=1

vec(Xτ,aτ )vec(Xτ,aτ )T

Also by Assumption 3, we know that for any B ∈ C

vec(B)TΣvec(B) ≥ α0‖vec(B)‖22 = α0|||B|||2F

C.2 Technical Lemmas

The following matrix version of the Bernstein inequality is useful in our analysis of matrix bandits.

Lemma C.1. (Theorem 3.2 of Recht [2011]) Let X1, . . . , XL be independent zero-mean random

matrices of dimension d1 × d2. Suppose ρ2k = max

{∣∣∣∣∣∣E[XkXTk ]∣∣∣∣∣∣op,∣∣∣∣∣∣E [XT

k Xk

]∣∣∣∣∣∣op

}and |||Xk|||op ≤

M almost surely for all k. Then for any τ > 0, we have

P

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣L∑k=1

Xk

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

> τ

≤ (d1 + d2) exp

(−τ2/2∑L

k=1 ρ2k +Mτ/3

)

C.3 Proof of Lemma 4.3

Proof. The requirement on λt in terms of the spectral norm of ∇Lt(Θ∗) is

P

λt ≥ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

= 1− P

λt ≤ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

Here we define the high probability event B := {maxi=1,...,t |εi| < v}, and we first concentrate on

the probability of the event conditioned on fixed matrices {Xi,ai}Ti=1. By the definition of sub-Gaussianrandom variables.

P(Bc) = P( maxi=1,...,t

|εi| > v) ≤t∑i=1

P(|εi| > v) ≤ t exp(− v2

2σ2)

25

where the first inequality is due to the union bound. Hence if we take v = σ√

2 log(2t/δ), thenP(Bc) ≤ δ/2. Under the event B, the operator norm of each Xi,aiεi can be bounded by

|||Xi,aiεi|||op ≤ |||Xi,aiεi|||F =

√√√√ d1∑j=1

d2∑k=1

X2ai;j,k

ε2i ≤ v|||Xi,ai |||F = σ√

2 log(2t/δ)

Moreover, under the event B, we have the following bounds

max{∣∣∣∣∣∣E[ε2iXi,aiX

Ti,ai | {Xi,ai}ti=1]

∣∣∣∣∣∣op,∣∣∣∣∣∣E[ε2iX

Ti,aiXi,ai | {Xi,ai}ti=1]

∣∣∣∣∣∣op

}≤ 2σ2 log(2t/δ)

Therefore we have

P

λt ≥ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

∣∣∣∣∣{Xi,ai}ti=1

= 1− P

λt ≤ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

,Bc∣∣∣∣∣{Xi,ai}ti=1

− P

λt ≤ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

,B

∣∣∣∣∣{Xi,ai}ti=1

≥ 1− P (Bc)− P

λt ≤ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

,B

∣∣∣∣∣{Xi,ai}ti=1

≥ 1−

(δ

2+ (d1 + d2)exp(

−λ2t t

2/8

2tσ2 log(2t/δ) + λttσ√

2 log(2t/δ)/6)

)

where the last inequality is by Lemma C.1 with the corresponding upper bound constants computedabove. Now we can further bound the probability using a specific choice of λt, i.e., we want that

(d1 + d2)exp(−tλ2

t/4

4σ2 log(2t/δ) + λtσ(√

2 log(2t/δ)/3) ≤ δ/2

Reversing the above inequality, we get

tλ2t ≥ 16σ2 log(2t/δ) +

4

3λtσ√

2 log(2t/δ) log

(2(d1 + d2)

δ

)Therefore if we take λ2

t ≥ 1289t2 σ

2 log(2t/δ) log2(

2(d1+d2)δ

)+ 32

t σ2 log(2t/δ), then the above inequal-

ity holds. This claim is true because

tλ2t

2≥ 16σ2 log(2t/δ) and

tλ2t

2≥ 4

3λtσ√

2 log(2t/δ) log

(2(d1 + d2)

δ

)Note that d1 + d2 ≥ 2 and hence log(2(d1 + d2)/δ) ≥ 1, and thus if we use

λ2t =

48

tσ2 log(2t/δ) log2(2(d1 + d2)/δ)

26

then we have

P

λt ≥ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

= E{Xi,ai}ti=1

Pλt ≥ 2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1t

t∑i=1

Xi,aiεi

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣op

∣∣∣∣∣{Xi,ai}ti=1

≥ 1− (

δ

2+δ

2) ≥ 1− δ

�

C.4 Proof of Lemma 4.4

First, we introduce the following two technical lemmas.

Lemma C.2. Suppose that Σ1,Σ2 ∈ Rd1d2×d1d2 , ∆ ∈ C, and the matrix Σ1 satisfies the stronglyconvex condition vec(∆)TΣ1vec(∆) ≥ α|||∆|||2F with α > 0. Moreover, suppose the two matrices areclose enough such that |||Σ2 − Σ1|||op ≤ δ, where 2δ ≤ α. Then

vec(∆)TΣ2vec(∆) ≥ α

2|||∆|||2F

Proof. The proof here is inspired by the proof of Lemma 6.17 in Buhlmann and Van De Geer[2011] and Lemma B.3 in Appendix B. A similar argument can be made using the nuclear norms|||∆|||nuc since |||∆|||F ≤ |||∆|||nuc. That is, the RSC condition can also be assumed in terms of thenuclear norm. Interested readers may take a deeper look at them by replicating all the lemmas in thissection. Note that∣∣vec(∆)TΣ1vec(∆)− vec(∆)TΣ2vec(∆)

∣∣ =∣∣vec(∆)T (Σ1 − Σ2)vec(∆)

∣∣≤ ‖(Σ1 − Σ2)vec(∆)‖2‖vec(∆)‖2≤ |||Σ1 − Σ2|||op‖vec(∆)‖22= |||Σ1 − Σ2|||op|||∆|||

2F ≤ δ|||∆|||

2F

where the first inequality is by Cauchy-Schwarz and the second inequality is by the definition ofthe matrix induced norm ‖ · ‖. Therefore since we assume that 2δ ≤ α, we know that

∣∣vec(∆)TΣ1vec(∆)− vec(∆)TΣ2vec(∆)∣∣ ≤ δ|||∆|||2F ≤ δ

αvec(∆)TΣ1vec(∆) ≤ 1

2vec(∆)TΣ1vec(∆)

Therefore

vec(∆)TΣ2vec(∆) ≥ 1

2vec(∆)TΣ1vec(∆) ≥ α

2|||∆|||2F �

Lemma C.3. (Distance Between Two Matrices) Define C = (√

α0

4√d1d2

+ 1 − 1)2, where α0 is

defined in Assumption 3. Then for all T0 ≥ 2 log(d21d

22 + d1d2)/C, we have

P(∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣op≥ α0

2

)≤ exp

(−T0C

2

)Proof. Since we sample uniformly in the exploration stage, we know that the contexts {Xτ,aτ }

T0τ=1

are i.i.d., and thus for any constant w, by Lemma B.2 we have

27

P

1

2

∣∣∣∣∣∣∣∣∣ΣT0− Σ

∣∣∣∣∣∣∣∣∣max≥ w +

√2w +

√2 log(d2

1d22 + d1d2)

T0+

log(d21d

22 + d1d2)

T0

≤ exp (−wT0)

Note that by the property of matrix norms, we have∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣2≤√d1d2

∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣max

, there-

fore we get

P(

1

2

∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣max≥ α0

4√d1d2

)≥ P

(1

2

∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣op≥ α0

4

)Now our choices of w and T0 is to let

w +√

2w ≤ α0

8√d1d2

and

√2 log(d2

1d22 + d1d2)

T0+

log(d21d

22 + d1d2)

T0≤ α0

8√d1d2

which leads to the following choices

w =1

2(

√α0

4√d1d2

+ 1− 1)2 and T0 ≥ 2 log(d21d

22 + d1d2)/(

√α0

4√d1d2

+ 1− 1)2

Then we have the following inequality

P(

1

2

∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣op≥ α0

4

)≤ P

(1

2

∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣max≥ α0

4√d1d2

)≤ exp(−T0w) = exp(−T0C

2) �

Finally, we provide the proof of Lemma 4.4.Proof. The proof follows from Lemma C.2 and Lemma C.3. Since we have that the restricted

eigenvalue condition holds for Σ by Assumption 3 , i.e.,

vec(Θ−Θ∗)TΣ vec(Θ−Θ∗) ≥ α0‖Θ−Θ∗‖2F ,∀Θ−Θ∗ ∈ C

Also, the two matrices Σ, ΣT0are close enough when T0 is large by Lemma C.3,

P(∣∣∣∣∣∣∣∣∣Σ− ΣT0

∣∣∣∣∣∣∣∣∣op≥ α0

2

)≤ exp

(−T0C

2

)where C is a constant defined as in Lemma C.3. By Lemma C.2 we can claim that

vec(Θ−Θ∗)T ΣT0vec(Θ−Θ∗) ≥ α0

2‖Θ−Θ∗‖2F ,∀Θ−Θ∗ ∈ C

with high probability 1− exp(−T0C2 ) when T0 ≥ 2 log(d2

1d22 + d1d2)/C. Therefore

BT0(Θ,Θ∗) =

1

2(Θ−Θ∗)T ΣT0

(Θ−Θ∗) ≥ α0

4‖Θ−Θ∗‖2F ,∀Θ−Θ∗ ∈ C

Take Zt(Θ∗) = (α0‖Θ∗‖F )/(4


t = |||Θ∗|||F (√t−√t− 1) ≥ |||Θ∗|||F /(2

√t).

Note that δt+1 < δt. Therefore for any t ≥ T0, by induction we have

Bt(Θ,Θ∗) ≥ α0

4|||Θ−Θ∗|||2F −

α0|||Θ∗|||F4√t

,∀Θ−Θ∗ ∈ C ∩ B(δt) �

28

C.5 Proof of Corollary 4.2

Now given the probability of At and Et in Lemma 4.3 and Lemma 4.4, we can easily derive aregret upper bound for our low rank matrix bandit algorithm. We specify all the constants here,for example, the Lipschitz constants C1 = C2 = 1, the bound of the norm k1 = 1, and we setT0 = Θ(

√T ) ≥ 2 log(d2

1d22 + d1d2)/C, where C is a constant defined in Lemma C.3. The restricted

strong convexity holds with α = α0/4 and Zt(θ∗) = (α0|||Θ∗|||F )/(4

√t). The compatibility constant

φ =√

2r. Also, Θ∗M⊥ = 0 so R(Θ∗M⊥) = 0.Proof. Note that in Theorem 3.1, we require

δ2t ≥

144λ2t

α20

φ2 +4λtα0

[2Zt(θ∗) + 4R(θ∗M⊥)]

If we take δ = 1/t2 in Lemma 4.3, we get λt = 4σ√t

√3(d1 + d2) log(2t3) log(2(d1 + d2)t2). Therefore,

the condition is equivalent to

|||Θ|||F2√t≥ 6912σ2φ2

α20t

log(2t3) log2(2(d1 + d2)t2) +8|||Θ∗|||Fσ

t

√3 log(2t3) log(2(d1 + d2)t2)

= O(1/t)


other words, we have the oracle inequality after the exploration stage. Using Theorem 3.1, we can getthe following upper bound

E[Regret(T )] ≤ 2T0 + 2C1k1

T∑t=T0

[P (Act) + P(Ect )] + 2C2

T∑t=T0

√144λ2

t

α20

φ2 +8λtα0

Zt(Θ∗)

Since we take δ = 1/t2 in Lemma 4.3, then P (Act) ≤ 1/t2. Furthermore

T∑t=T0

P (Act) ≤T∑

t=T0

1

t2≤∞∑t=1

1

t2≤ 1 +

[−1

t

] ∣∣∣∣∣∞

1

= 2

Also by our choice of T0 = Θ(√T ), we know that

T∑t=T0

P (Ect ) ≤T∑

t=T0

exp (−T0C

2) ≤ T exp (−T0C

2) = O(1)


4 e− a22 when

a > 0. The last summation term in the regret can be written as

T∑t=T0

√144λ2

t

α20

φ2 +8λtα0

Zt(Θ∗) ≤T∑

t=T0

√(13824σ2r

α20

+ 8σ

)√3 log(2t3) log(2(d1 + d2)t2)√

t

≤T∑t=1

√(13824σ2r

α20

+ 8σ

)√3 log(2T 3) log(2(d1 + d2)T 2)√

t

≤

√(13824σ2r

α20

+ 8σ

)√3 log(2T 3) log(2(d1 + d2)T 2)

√T

29

where the last inequality is from the fact that∑Ti=1

1√t≤ 2√T − 1. Therefore we arrive to the

following final regret bound

E[Regret(T )] = O(√

rT log(2T 3) log(2(d1 + d2)T 2))

�

D Proof for the Group-sparse Bandit

D.1 Notations and Algorithm

Algorithm 4 The Group-Sparse Matrix Bandit Algorithm

1: Input: K, d1, d2 ∈ N,Θ0, T0

2: Initialize Θt = Θ0

3: for t = 1, 2, · · · , T do4: Observe K contexts, Xt,1, Xt,2, · · · , Xt,K

5: if t ≤ T0 then6: Choose action at uniformly randomly # Exploration Stage7: else8: Choose action at = argmaxa〈〈Xt,a,Θt−1〉〉 # Exploitation Stage9: end if

10: Receive reward yt = 〈〈Xt,at ,Θ∗〉〉+ εt

11: Update λt = 2σd1−1/q2√t

+ ση(d2,12 −

1q )√

2 log(2d1/δ)t


12t

∑ti=1(yi − 〈〈Xi,ai ,Θ〉〉)2 + λt|||Θ|||1,q

}13: end for

We first clarify the notations we use through out this section. As we discuss in Section 4, Θ∗ =[θ(1)∗, θ(2)∗ · · · , θ(d2)∗] is used to denote the matrix whose columns are vectors with similar supports,so that |S(Θ∗)| = s � d1. Similarly, we use the shorthand vector notations Xt(Θ∗), Yt, et ∈ Rt suchthat [Xt(Θ∗)]i = 〈〈Xi,ai ,Θ

∗〉〉, [Yt]i = yi, and [et]i = εi. By our results in the low-rank matrix banditproblem, the derivative with respect to Θ∗ is

∇Lt(Θ∗) =∂

∂Θ∗1

2t(Yt −Xt(Θ∗))T (Yt −Xt(Θ∗))

=1

2t

∂

∂Θ∗(−2Xt(Θ∗)TYt + Xt(Θ∗)TXt(Θ∗))

= −1

t

t∑i=1

Xi,aiεi

Therefore the event At is equivalent toλt ≥ 2|||∇Lt(Θ∗)|||∞,q/(q−1) = 2 maxi∈[d1]

(

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q

D.2 Proof of Lemma 4.5

Proof. The proof here basically follows from Lemma 5 in Negahban et al. [2012], except for thefunction η : N × N → N which we introduce later. From the union bound, we know that for any

30

constant λt ∈ R, we have

P(λt ≥ 2|||∇Lt(Θ∗)|||∞,q/(q−1)

)= P

∀i ∈ [d1], λt ≥ (

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q

≥ 1−

d1∑i=1

P

λt ≤ (

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q

Next, we establish a tail bound for the random variable (

∑d2j=1 |∇Lt(Θ∗)ij |q/(q−1))(q−1)/q. Let

X(i)at be the i-th row of Xat and define X = [X

(i)Ta1 , X

(i)Ta2 , · · · , X(i)T

at ].For any two σ-sub-Gaussian vectors w,w′, we have

‖1

tXw‖q/(q−1) − ‖

1

tXw′‖q/(q−1)| ≤

1

t‖X(w − w′)‖q/(q−1)

=1

tsup‖θ‖q=1

〈XT θ, w − w′〉

≤ 1

tsup‖θ‖q=1

‖XT θ‖2‖w − w′‖2

.

Now if q ∈ (1, 2], we have

sup‖θ‖q=1

‖XT θ‖2 ≤ sup‖θ‖2=1

‖XT θ‖2 ≤ ‖XT ‖F ≤√t

If q > 2, then we have a different inequality

sup‖θ‖q=1

‖XT θ‖2 ≤ d1/2−1/q2 sup

‖θ‖2=1

‖XT θ‖F ≤ d1/2−1/q2

√t

Therefore if we define η(d2,m) = max{1, dm2 }, we have |‖ 1tXw‖q/(q−1)−‖ 1

tXw′‖q/(q−1)| ≤ η(d2,1/2−1/q)√

t‖w−

w′‖2. Thus the function is Lipschitz with constant η(d2, 1/2 − 1/q)/√t. Based on the concentration

inequality of measure for Lipschitz functions [Bobkov et al., 2015, Negahban et al., 2012], we knowthat

P(

1

t‖Xet‖q/(q−1) ≥ E[

1

t‖Xet‖q/(q−1)] + σδ

)≤ 2exp(− tδ2

2η(d2, 1/2− 1/q)2).

By Lemma 5 in Negahban et al. [2012], we know that the mean is bounded by 2d1− 1

q

2 σ/√t, therefore

P

(1

t‖Xet‖q/(q−1) ≥ 2σ

d1−1/q2√t

+ ση(d2,1

2− 1

q)δ

)≤ 2exp(− tδ

2

2).

By change of variables and the union bound, we can get the claim in the lemma. �.

31

D.3 Proof of Corollary 4.3

Now given the probability of At and Et in Lemma 4.5 and Lemma 4.4, we can derive a regretupper bound for our group-sparse matrix bandit algorithm. Similarly, We specify all the constantshere, for example, the Lipschitz constants C1 = C2 = 1, the bound of the norm k1 = 1, and we setT0 = Θ(

√T ) ≥ 2 log(d2

1d22 + d1d2)/C, where C is a constant defined in Lemma C.3. The restricted

strong convexity holds with α = α0/4 and Zt(θ∗) = (α0|||Θ∗|||F )/(4

√t). The compatibility constant

φ = η(d2, 1/q − 1/2)√s. Also, Θ∗M⊥ = 0 so R(Θ∗M⊥) = 0.

Proof. Note that in Theorem 3.1, we require

δ2t ≥

144λ2t

α20

φ2 +4λtα0

[2Zt(θ∗) + 4R(θ∗M⊥)]

If we take δ = 1/t2 in Lemma 4.5, we get λt = 2σd1− 1

q2√t

+ ση(d2,12 −

1q )√

2 log(2d1t2)t . Therefore, the

condition is equivalent to

|||Θ|||F2√t≥

144sη(d2,1q −

12 )2

α20

2σd

1− 1q

2√t

+ ση(d2,1

2− 1

q)

√2 log(2d1t2)

t

2

+ 2σd

1− 1q

2 |||Θ|||Ft

+ ση(d2,1

2− 1

q)|||Θ|||F

√2 log(2d1t2)

t

= O(1/t)


other words, we have the oracle inequality after the exploration stage. Using Theorem 3.1, we can getthe following upper bound

E[Regret(T )] ≤ 2T0 + 2C1k1

T∑t=T0

[P (Act) + P(Ect )] + 2C2

T∑t=T0

√144λ2

t

α20

φ2 +8λtα0

Zt(Θ∗)

Since we take δ = 1/t2 in Lemma 4.5, P (Act) ≤ 1/t2. Furthermore,

T∑t=T0

P (Act) ≤T∑

t=T0

1

t2≤∞∑t=1

1

t2≤ 1 +

[−1

t

] ∣∣∣∣∣∞

1

= 2

Also by our choice of T0 = Θ(√T ), we know that

T∑t=T0

P (Ect ) ≤T∑

t=T0

exp (−T0C

2) ≤ T exp (−T0C

2) = O(1)


4 e− a22 when

a > 0. The last summation term in the regret can be written as

32

T∑t=T0

√144λ2

t

α20

φ2 +8λtα0

Zt(Θ∗)

≤T∑

t=T0

(17√

2sη(d2,1q −

12 )d

1− 1q

2

√(σ + σ2)

α0)

1√t

+

T∑t=T0

17√sη(d2,

1q −

12 )η(d2,

12 −

1q )√

(σ + σ2)

α0

√2 log(2d1T 2)√

t

≤17√

2η(d2,1q −

12 )d

1− 1q

2

√(σ + σ2)

α0

√sT +

17η(d2,1q −

12 )η(d2,

12 −

1q )√

(σ + σ2)

α0

√2s log(2d1T 2)T

where the last inequality is from the fact that∑Ti=1

1√t≤ 2√T − 1. Therefore we arrive to the

following bound

E[Regret(T )] = O(C1(d2)

√sT + C2(d2)

√2s log(2d1T 2)T

)where C1(d2) = η(d2,

1q −

12 )d

1− 1q

2 and C2(d2) = η(d2,1q −

12 )η(d2,

12 −

1q ). �

E Proof for the Multi-agent LASSO

E.1 Notations and Algorithm

Algorithm 5 The Multiple Agent LASSO Bandit Algorithm

1: Input: {λt}Tt=1,K, d1, d2 ∈ N, Lt(θ), R(θ), f(x, θ), θ(k)0 , T0

2: Initialize θ(k)t = θ

(k)0

3: for t = 1, 2, · · · , T do4: Observe K contexts, xt,1, xt,2, · · · , xt,K5: for k = 1, 2, · · · , d2 do6: if t ≤ T0 then7: Agent k choose action a

(k)t uniformly randomly # Exploration Stage

8: else9: Agent k choose action a

(k)t = argmaxaf(x

(k)t,a , θ

(k)t−1) # Exploitation Stage

10: end if11: Receive reward y

(k)t = f(x

(k)t,at , θ

(k)∗) + ε(k)t

12: end for

13: Update λt = 2σd1−1/q2√t

+ ση(d2,12 −

1q )√

2 log(2d1/δ)t

14: θt = argminθ∈Θ{ 12t

∑ti=1

∑d2k=1(y

(k)i − x

(k)Ti,ai

θ(k)∗)2 + λt|||Θ|||1,q}15: end for

Define Θ∗ = [θ(1)∗, θ(2)∗ · · · , θ(d2)∗] ∈ Rd1×d2 and S(Θ∗) = {i ∈ [d1] | Θ∗i = 0} as the set of zerorows, then by our setting |S(Θ∗)| = s � d1. Let the loss be the sum of squared error function andthe regularization function be the l1,q norm. More formally,

Lt(Θ) =1

2t

t∑i=1

d2∑k=1

(y(k)i − x

(k)Ti,ai

θ(k)∗)2, R(Θ) = |||Θ|||1,q

33

Define (M,M⊥) as in Example 3, then the l1,q norm is decomposable and Θ∗M⊥

= 0. For each agent

j, we use the notations X(j)t ∈ Rt×d1 , Y (j)

t , e(j)t ∈ Rt to represent the context matrix, the reward and

the error vectors, i.e., [X(j)t ]i = x

(j)i,ai

, [Y(j)t ]i = y

(j)i , [e

(j)t ]i = ε

(j)i ,∀i ∈ [t], j ∈ [d2]. The derivative with

respect to Θ∗ can be computed as

∇Θ∗Lt(Θ∗) =

1

2t

t∑i=1

d2∑k=1

∇Θ∗(y(k)i − x

(k)Ti,ai

θ(k)∗)2

Note that if we take partial derivatives, we get

∂Lt(Θ∗)

∂θ(j)∗ =1

2t

t∑i=1

d2∑k=1

∂

∂θ(j)∗ (y(k)i − x

(k)Ti,ai

θ(k)∗)2

=1

2t

t∑i=1

∂

∂θ(j)∗ (y(j)i − x

(j)Ti,ai

θ(j)∗)2

= −1

tX

(j)Tt e

(j)t

Therefore ∇Lt(Θ∗) = − 1t [X

(1)Tt e

(1)t , X

(2)Tt e

(2)t , · · · , X(d2)T

t e(d2)t ]. Now we can compute the Breg-

man divergence as follows.

Bt(Θ,Θ∗) = Lt(Θ)− Lt(Θ∗)− 〈∇Lt(Θ∗),Θ−Θ∗〉

=1

2t

d2∑k=1

‖Y (k)t −X(k)

t θ(k)‖22 −1

2t

d2∑k=1

‖Y (k)t −X(k)

t θ(k)∗‖22 +1

t

d2∑k=1

e(k)Tt X

(k)t (θ(k) − θ(k)∗)

=1

2t

d2∑k=1

(θ(k) − θ(k)∗)TX(k)Tt X

(k)t (θ(k) − θ(k)∗)

Therefore the event At is equivalent toλt ≥ 2|||∇Lt(Θ∗)|||∞,q/(q−1) = 2 maxi∈[d1]

(

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q

The event Et (RSC condition) is equivalent to{

1

2t

d2∑k=1


(k)t (θ(k) − θ(k)∗) ≥ α|||Θ−Θ∗|||2F − Zt(Θ

∗)

}

E.2 Useful Lemmas

Lemma E.1. (Good Choice of Lambda) Suppose that εt is σ-sub-Gaussian. For any δ ∈ (0, 1),use

λt = 2σd

1−1/q2√t

+ ση(d2,1

2− 1

q)

√2 log 2d1

δ

t

at each round t in Algorithm 5, then with probability at least 1− δ, we have λt ≥ R∗(∇Lt(Θ∗)).

34

Proof. From the union bound, we know that for any constant λt ∈ R, we have

P(λt ≥ 2|||∇Lt(Θ∗)|||∞,q/(q−1)

)= P

∀i ∈ [d1], λt ≥ (

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q

≥ 1−

d1∑i=1

P

λt ≤ (

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q

Note that the j-th row of ∇Lt(Θ∗) consists of

[∑i(x

(1)i,ai

)jε(1)i ,∑i(x

(2)i,ai

)jε(2)i , · · · ,

∑i(x

(d2)i,ai

)jε(2)i

],

which are all√tσ-sub-Gaussian random variables.

For any two√tσ-sub-Gaussian vectors w,w′, we have

|‖1

tw‖q/(q−1) − ‖

1

tw′‖q/(q−1)| ≤

1

t‖(w − w′)‖q/(q−1)

=1

tsup‖θ‖q=1

〈θ, w − w′〉

≤ 1

tsup‖θ‖q=1

‖θ‖2‖w − w′‖2

.

Now if q ∈ (1, 2], we have ‖θ‖2 ≤ ‖θ‖q = 1. If q > 2, then we have a different inequality ‖θ‖2 ≤d

1/2−1/q2 ‖θ‖q ≤ d

1/2−1/q2 . Therefore if we define η(d2,m) = max{1, dm2 }, we have |‖ 1

tw‖q/(q−1) −‖ 1tw′‖q/(q−1)| ≤ η(d2,1/2−1/q)

t ‖w − w′‖2. Thus the function is Lipschitz with constant η(d2, 1/2 −1/q)/t. Based on the concentration inequality of measure for Lipschitz functions [Bobkov et al., 2015,Negahban et al., 2012], we know that

P

(

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q ≥ E[(

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q] + δ

≤ 2exp(− t2δ2

2η(d2, 1/2− 1/q)2tσ2).

By similar arguments as in the proof of Lemma 4.5, we know that

P

(

d2∑j=1

|∇Lt(Θ∗)ij |q/(q−1))(q−1)/q ≥ 2σd

1−1/q2√t

+ ση(d2,1

2− 1

q)δ

≤ 2exp(− tδ2

2).

By change of variables and the union bound, we can get the results in the lemma. �.

Lemma E.2. (RSC condition) Suppose Assumption 3 is satisfied, then with probability at least

1−d2exp(−T0C

2

), we have Bt(Θ,Θ

∗) ≥ α0

4 |||Θ−Θ∗|||2F −α0|||Θ∗|||F

4√t

, for all T0 ≥ 2 log(2d21 +d1)/C and

Θ−Θ∗ ∈ C ∩ B(δt), where C is a constant and δt = |||Θ|||F /(2√t).

Proof. Note that the RSC condition (event Et ) is equivalent to

1

2t

d2∑k=1


(k)t (θ(k) − θ(k)∗) ≥ α|||Θ−Θ∗|||2F − Zt(Θ

∗)

=

d2∑k=1

α‖θ(k) − θ(k)∗‖22 − Zt(Θ∗)

35

By the results in Lemma 4.2 in the LASSO bandit problem, we know that with high probability1− exp(−T0C

2 )

1

2(θ(k) − θ(k)∗)T Σ

(k)T0

(θ(k) − θ(k)∗) ≥ α0

4‖θ(k) − θ(k)∗‖22

when T0 ≥ 2 log(d21 + d1)/C, where C is defined in Lemma B.4. Therefore by the Frechet inequality,

we know that

P

(d2∑k=1

1

2t(θ(k) − θ(k)∗)TX

(k)TT0

X(k)T0

(θ(k) − θ(k)∗) ≥d2∑k=1

α0

2‖θ(k) − θ(k)∗‖22

)

≥ P(

1

2t(θ(k) − θ(k)∗)TX

(k)TT0

X(k)T0

(θ(k) − θ(k)∗) ≥ α0

2‖θ(k) − θ(k)∗‖22,∀k ∈ [d2]

)≥

d2∑k=1

P

(1

2t(θ(k) − θ(k)∗)TX

(k)TT0

X(k)T0

(θ(k) − θ(k)∗) ≥ α0

2‖θ(k) − θ(k)∗‖22

)− (d2 − 1)

≥ 1− d2exp

(−T0C

2

)�

Take Zt(Θ∗) = (α0|||Θ∗|||F )/(4


t = |||Θ∗|||F (√t−√t− 1) ≥ |||Θ∗|||F /(2

√t).

Therefore for any t ≥ T0, by induction we have

Bt(Θ,Θ∗) ≥ α0

4|||Θ−Θ∗|||2F −

α0|||Θ∗|||F4√t

,∀Θ−Θ∗ ∈ C ∩ B(δt) �

E.3 Regret Analysis

Theorem E.1. Suppose that Assumption 1 and Assumption 2 hold for each agent (with respect tothe ‖ · ‖2 norm). Also suppose that Assumption 3 is satisfied. Then the expected cumulative regret ofAlgorithm 5 is upper bounded by

E[Regret(T )] = O(d2

√sT +

√d2sT log(2d1T 2)

)Proof. By the boundedness of ‖x‖2 and ‖θ‖2, we know that

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗) = x(k)Tt,a∗t

θ(k)∗ − x(k)Tt,at θ

(k)∗ ≤ (x(k)Tt,a∗t− x(k)T

t,at )θ(k)∗ ≤ 2

Then we can decompose the one-step regret from round t across different agents into three parts

36

as follows

d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

=

[d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

]I(t ≤ T0) +

[d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

]I(t > T0, Et)

+

[d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

]I(t ≥ T0, Ect )

≤ 2d2I(t ≤ T0) +

[d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

]I(t > T0, Et) + 2d2I(t > T0, Ect )

= 2d2I(t ≤ T0) +

[d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

]I(t > T0, f(x

(k)t,at , θ

(k)t ) ≥ f(x

(k)t,a∗t

, θ(k)t ),∀k, Et)

+ 2d2I(t > T0, Ect )


I (t > T0, Et)

= I(t > T0, f(x

(k)t,at , θ

(k)t ) ≥ f(x

(k)t,a∗t

, θ(k)t ),∀k, Et

)= I

(t > T0, f(x

(k)t,at , θ

(k)t )− f(x

(k)t,a∗t

, θ(k)t ) + f(x

(k)t,a∗t

, θ(k)∗t )− f(x

(k)t,at , θ

(k)∗t ) ≥ f(x

(k)t,a∗t

, θ(k)∗t )− f(x

(k)t,at , θ

(k)∗t ),∀k, Et

)= I

(t > T0, [f(x

(k)t,at , θ

(k)t )− f(x

(k)t,at , θ

(k)∗t )] + [f(x

(k)t,a∗t

, θ(k)∗t )− f(x

(k)t,a∗t

, θ(k)t )] ≥ f(x

(k)t,a∗t

, θ(k)∗t )− f(x

(k)t,at , θ

(k)∗t ),∀k, Et

)≤ I

(t > T0, 2‖θ(k)

t − θ(k)∗‖2 ≥ f(x(k)t,a∗t

, θ(k)∗t )− f(x

(k)t,at , θ

(k)∗t ),∀k, Et

)≤ I

(t > T0, 2

d2∑k=1

‖θ(k)t − θ(k)∗‖2 ≥

d2∑k=1

f(x(k)t,a∗t

, θ(k)∗t )− f(x

(k)t,at , θ

(k)∗t ), Et

)

≤ I

(t > T0, 2

√d2|||Θt −Θ∗|||F ≥

d2∑k=1

f(x(k)t,a∗t

, θ(k)∗t )− f(x

(k)t,at , θ

(k)∗t ), Et

)

where the last inequality is by the Cauchy-Schwarz Inequality. Substitute the above inequalityback and take expectation on both sides of the one-step regret from round t, we get

E

[d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

]≤ 2d2 for t ≤ T0

37


E

[d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

]

≤ E

[(d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

)I

(2√d2|||Θt −Θ∗|||F ≥

d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗), Et

)]+ 2d2P(Ect )

≤ E

[(d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

)

I

(2√d2|||Θt −Θ∗|||F ≥

d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗) ≥ 2√d2vt, Et

)]+ 2d2P(Ect ) + 2

√d2vt

≤ E

[(d2∑k=1

f(x(k)t,a∗t

, θ(k)∗)− f(x(k)t,at , θ

(k)∗)

)I (|||Θt −Θ∗|||F ≥ vt) , Et

]+ 2d2P(Ect ) + 2

√d2vt

≤ 2d2P (|||Θt −Θ∗|||F ≥ vt, Et) + 2d2P(Ect ) + 2√d2vt

By Lemma E.2, the RSC condition is satisfied when T0 = Θ(√T ) ≥ 2 log(2d2

1 + d1)/C, where C isdefined in Lemma B.4. Also, the range δt is larger than the error bound when

|||Θ|||F2√t≥ 144

λ2t

α20

φ2 +8λtα0

Zt(θ∗) = O(

1

t)

which can be satisfied when T is large enough. Now take vt to be the upper bound of |||Θt −Θ∗|||Fin Lemma 3.1, i.e., v2

t = 9λ2t

α2φ2 +

1

α[2Zt(θ

∗) + 4λtR(θ∗M⊥). We know by Lemma 3.1, the expected

cumulative regret becomes

E[Regret(T )] ≤ 2d2T0 +

T∑t=T0

[2d2P (Act , Et) + 2d2P(Ect ) + 2

√d2vt

]

≤ 2d2T0 + 2d2

T∑t=T0

[P (Act) + P(Ect )] + 2√d2

T∑t=1

√9λ2t

α2φ2 +

λtα

[2Zt(θ∗) + 4R(θ∗M⊥)]

By setting δ = 1/t2 in Lemma E.1 and Lemma E.2, the second term can be bounded as

2d2

T∑t=T0

[P (Act) + P(Ect )] ≤T∑t=1

[1

t2+ d2 exp(−T0C

2)

]= O(1)

38

The last term can be bounded as

2√d2

T∑t=1

√9λ2t

α2φ2 +

λtα

[2Zt(θ∗) + 4R(θ∗M⊥)]

≤ 2√d2

T∑t=1

√144

λ2t

α20

φ2 +8λtα0

Zt(θ∗)

≤ 2√d2

T∑t=1

√(144φ2

α20

+2

σ)λ2t

≤ 2

√d2(

144φ2

α20

+2

σ)

T∑t=1

2

(σd

1−1/q2√t

+ ση(d2,1

2− 1

q)

√2 log 2d1t2

t

)

≤ 2η(d2,1

q− 1

2)

√d2s(

144

α20

+2

σ)T∑t=1

2

(σd

1−1/q2√t

+ ση(d2,1

2− 1

q)

√2 log 2d1t2

t

)

= O(√

d2sd1− 1

q

2 η(d2,1

q− 1

2)√T +

√d2sη(d2,

1

q− 1

2)η(d2,

1

2− 1

q)√T log(2d1T 2)

)Therefore the final regret bound is of size

2d2T0 + 2d2

T∑t=T0

[P (Act) + P(Ect )] + 2√d2

T∑t=1

√9λ2t

α2φ2 +

λtα

[2Zt(θ∗) + 4R(θ∗M⊥)]

= O(

2d2

√T +

√d2sd

1− 1q

2 η(d2,1

q− 1

2)√T +

√d2sη(d2,

1

q− 1

2)η(d2,

1

2− 1

q)√T log(2d1T 2)

)= O

(√d2sd

1− 1q

2 η(d2,1

q− 1

2)√T +

√d2sη(d2,

1

q− 1

2)η(d2,

1

2− 1

q)√T log(2d1T 2)

)Take q = 2, we can get the results in the theorem. �

F Proofs Related to the Oracle Inequalities

Recall the definition of the constraint set C. We define the subset with bounded norm K(δ) :=C ∩ B(δ). Also define the function Ft(∆) as

Ft(∆) = Lt(θ∗ + ∆)− Lt(θ∗) + λt(R(θ∗ + ∆)−R(θ∗))

The proof here actually follows those in Negahban et al. [2012] because we choose the same errorupper bounds. The reason why we only need the RSC condition on the constraint set is that we onlyrequire to prove Ft(∆) > 0 for ∆ ∈ K(δ) (Lemma F.1). We directly specify

δ2 = 9λ2t

α2φ2 +

λtα

[2Zt(θ∗) + 4R(θ∗M⊥)]

here since we are simply validating the correctness of the proof under the condition that RSC is onlysatisfied on C∩B(rt). For more details of why such an error bound is chosen, please refer to Negahbanet al. [2012] and its supplementary material.

39

F.1 Proof of Lemma 3.1

First, we introduce the following two lemmas from Negahban et al. [2012]

Lemma F.1. (Lemma 3 of Negahban et al. [2012]) For any vectors θ∗,∆, and a decomposable

norm R on M,M⊥, we have the following inequality

R(θ∗ + ∆)−R(θ∗) ≥ R(∆M⊥)−R(∆M)− 2R(θ∗M⊥)

Lemma F.2. (Lemma 4 of Negahban et al. [2012]) If Ft(∆) > 0 for all vectors ∆ ∈ K(δ) for aconstant δ, then ‖θt − θ∗‖ ≤ δ

Now, we provide the proof of Lemma 3.1Proof. Note that δ2 ≤ r2

t , by the restricted strong convexity of Lt(θ∗) on K(δ), we have

Ft(∆) = Lt(θ∗ + ∆)− Lt(θ∗) + λt(R(θ∗ + ∆)−R(θ∗))

≥ 〈∇Lt(θ∗),∆〉+ α‖∆‖2 − Zt(θ∗) + λt(R(θ∗ + ∆)−R(θ∗))

≥ 〈∇Lt(θ∗),∆〉+ α‖∆‖2 − Zt(θ∗) + λt(R(∆M⊥)−R(∆M)− 2R(θ∗M⊥))

≥ −|〈∇Lt(θ∗),∆〉|+ α‖∆‖2 − Zt(θ∗) + λt(R(∆M⊥)−R(∆M)− 2R(θ∗M⊥))

≥ −R∗(∇Lt(θ∗))R(∆) + α‖∆‖2 − Zt(θ∗) + λt(R(∆M⊥)−R(∆M)− 2R(θ∗M⊥))

≥ α‖∆‖2 − Zt(θ∗) + λt(R(∆M⊥)−R(∆M)− 2R(θ∗M⊥))− λt2R(∆)

≥ α‖∆‖2 − Zt(θ∗) + λt(R(∆M⊥)−R(∆M)− 2R(θ∗M⊥))− λt2R(∆M⊥)− λt

2R(∆M)

= α‖∆‖2 − Zt(θ∗) +λt2

(R(∆M⊥)− 3R(∆M)− 4R(θ∗M⊥)

)where the second inequality is by Lemma F.1. The fourth inequality is by the generalized CauchySchwarz inequality. The fifth inequality is because of our setting λt ≥ 2R(∇Lt(θ∗)). The last inequal-ity is because of the triangle inequality R(∆) = R(∆M⊥ + ∆M) ≤ R(∆M⊥) + R(∆M). Now by thesubspace compatibility constant, we know that

R(∆M) ≤ φ‖∆M‖ = φ‖ΠM(∆)−ΠM(0)‖ ≤ φ‖∆− 0‖

where the last inequality is because 0 ∈M and because the projection operation is non-expansive.Therefore we can continue to lower bound Ft(∆) in the following way.

Ft(∆) ≥ α‖∆‖2 − Zt(θ∗)−λt2

(3R(∆M) + 4R(θ∗M⊥))

≥ α‖∆‖2 − Zt(θ∗)−3λtφ

2‖∆‖ − 2λtR(θ∗M⊥)

Now since we take ‖∆‖2 = δ2 = 9λ2t

α2φ2 + λt

α [2Zt(θ∗) + 4R(θ∗M⊥)], by the same algebraic manipu-

lations in Negahban et al. [2012], we have Ft(∆) > 0. Now by Lemma F.2, we know that

‖θt − θ∗‖2 ≤ 9λ2t

α2φ2 +

λtα

[2Zt(θ∗) + 4R(θ∗M⊥)]

40

The second oracle inequality in Lemma A.1 can be proved easily by the triangle inequality decom-position and the definition of C. That is

R(θt − θ∗) = R((θt − θ∗)M⊥ + (θt − θ∗)M) ≤ R((θt − θ∗)M⊥) +R((θt − θ∗)M)

≤ 4R((θt − θ∗)M) + 4R(θ∗M⊥)

≤ 4φ‖(θt − θ∗)M‖+ 4R(θ∗M⊥)

≤ 4φ‖θt − θ∗‖+ 4R(θ∗M⊥)

If θ∗ ∈M, then R(θ∗M⊥) = 0. Therefore we know that

R(θt − θ∗) ≤ 4φ

√9λ2t

α2φ2 +

2λtαZt(θ∗)

�

G Validation Experiments

In this section, we provide some experiments in order to validate our claims in the theoreticalresults. We choose to run our algorithm (Algorithm 2 to be specific) in the LASSO bandit problemand validate the corresponding regret upper bound in Corollary 4.1.

Specifically, we generate our true θ∗ by randomly choosing its non-zero indices, and then generateeach of its non-zero values uniformly randomly from [0,1] and then perform the normalization. Weset K = 10 so that there are ten different contexts available at each round. The contexts {xt,ai}Ki=1

are generated from the zero-mean and identity-covariance normal distribution and then normalizedso that ‖xt,ai‖2 ≤ 1. We consider the following two settings

• Fix the sparsity level s = 10, and run the algorithm with different dimension size d = {50, 100, 200}.

• Fix the dimension size d = 100, and run the algorithm with different sparsity level s = {5, 10, 20}.

0 250 500 750 1000 1250 1500 1750 2000Iterations t

0.0

0.1

0.2

0.3

0.4

0.5

Regr

et(t)

/ st

log(

dt)

d = 50d = 100d = 200

(a) With s = 10 and different dimension d

0 250 500 750 1000 1250 1500 1750 2000Iterations t

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Regr

et(t)

/ st

log(

dt)

s = 5s = 10s = 20

(b) With d = 100 and different sparsity s

Figure 1: The curves of Regret(t)/√st log(dt) in the two settings.

41

The experiment results are shown in Figure 1. As we can observe, the figures show that the regretis at most a constant times

√st log(dt), and thus our bounds correctly delineate the order of the

regret. Therefore, the experiment results validate our claims in Section 4.

42

A Simple Uni ed Framework for High Dimensional Bandit Problems

Documents