Near Optimal Adaptive Shortest Path Routing with Stochastic … · 2016-10-12 · Near Optimal Adaptive Shortest Path Routing with Stochastic Links States under Adversarial Attack

arX

iv:1

610.

0334

8v1

[cs.

NI]

11

Oct

201

6

Near Optimal Adaptive Shortest Path Routing withStochastic Links States under Adversarial Attack

Pan Zhou,Member, IEEE, Lin Cheng,Member, IEEE, Dapeng Oliver Wu,Fellow, IEEE

Abstract—We consider the shortest path routing (SPR) of a net-work with stochastically time varying link metrics under potentialadversarial attacks. Due to potential denial of service attacks,the distributions of link states could be stochastic (benign) oradversarial at different temporal and spatial locations. Withoutany a priori, designing an adaptive SPR protocol to cope withall possible situations in practice optimally is a very challengingissue. In this paper, we present the first solution by formulatingit as a multi-armed bandit (MAB) problem. By introducing anovel control parameter into the exploration phase for eachlink, a martingale inequality is applied in the our combinatorialadversarial MAB framework. As such, our proposed algorithmscould automatically detect features of the environment withina unified framework and find the optimal SPR strategies withalmost optimal learning performance in all possible cases overtime. Moreover, we study important issues related to the practicalimplementation, such as decoupling route selection with multi-path route probing, cooperative learning among multiple sources,the “cold-start” issue and delayed feedback of our algorithm.Nonetheless, the proposed SPR algorithms can be implementedwith low complexity and they are proved to scale very well withthe network size. Comparing to existing approaches in a typicalnetwork scenario under jamming attacks, our algorithm has a65.3% improvement of network delay given a learning periodand a 81.5% improvement of learning duration under a specifiednetwork delay.

Index Terms—Shortest Path Routing, Online learning, jam-ming, stochastic and adversarial multi-armed bandits

I. I NTRODUCTION

Shortest path routing (SPR) is a basic functionality ofnetworks toroute packets from sources to destinations. Con-sider a network with known topology deployed in a wirelessenvironment, where the link qualities vary stochasticallywithtime. As security is critical to network performance, it isvulnerable to a wide variety of attacks. For example, a mali-cious attacker may perform a denial of service (DoS) attackby jamming in a selected area of links or creating routingworm [1] to cause severe congestions over the network. Asa result, the link metrics for the SPR (e.g., link delays) arehard to predict. Although the source can measure links bysending traceroute probing packets along selected paths, it

Parts of this work have been presented at IEEE InternationalConferenceon Sensing, Communications and Networking (SECON 2016) [18], London,UK.

Pan Zhou is from Huazhong University of Science & Technology, Wuhan,430074, Hubei, China.

Lin Chen is from Department of Engineering, Trinity College, Hartford,CT, 06106 USA.

Dapeng Oliver Wu is with Department of Electrical and Computer En-gineering, University of Florida, Gainesville, Florida, 32611 USA. Email:[email protected] , [email protected] , [email protected]

This work was supported by the National Science Foundation of Chinaunder Grant 61401169 and NSF CNS-1116970.

is hard to obtain an accurate link measurement by a singletrial due to noise, inherent dynamics of links (e.g., fadingandshort-term interference, etc.) and the unpredictable adversarialbehaviors (e.g., DoS attack on traffic and jamming attack, etc.).Compared with the classic SPR problem where the assumedaverage link metrics is a known priori, the source is necessaryto learn about the link metrics over time.

A fair amount of SPR algorithms have been proposed byconsidering either the stochastically distributed link metrics[2], [4], [5] (i.i.d. distributed) or security issues wherealllink metrics are assumed to be adversarially distributed [6]–[8] (non-i.i.d. distributed) that can vary in an arbitrary way byattackers. In particular, the respective online learning problemsfit into the stochastic Multi-armed bandit (MAB) problem[12] and the adversarial MAB problem [11] perfectly. Themain idea is, by probing each link along with the balancebetween “exploration” and “exploitation” of sets of routesover time, true average link metrics will gradually be learnedand the optimal SPR can be found by minimizing the term“regret” that qualifies the learning performance, i.e., thegapbetween routes selected by the SPR algorithm and the optimalone known in hindsight, accumulated over time. A knownfact is that stochastic MAB and adversarial MAB have theoptimal regretsO(log(t)) [12] andO(

√t) [11] over time t,

respectively. Obviously, the learning performance of stochasticMAB is much better than that of adversarial MAB.

As we know, the assumption of the known nature of theenvironments, i.e., stochastic or adversarial, in most existingworks is very restrictive in describing practical network en-vironments. On the one hand, existing SPR protocols mayperform poorly in practice. Consider a network deployed in apotentially hostile environment, the mobility pattern, attackingapproaches and strengths, numbers and locations of attackersare often unrevealed. In this case, most likely, certain portionsof links may (or may not) suffer from denial of serviceattackers that are adversarial, while the unaffected others arestochastically distributed. To design an optimal SPR algorithm,the adoption of the typical adversarial MAB model [6]–[8] onall links will lead to undesirable learning performance (largeregrets) in finding the SPR, since a great portion of links canbe benign as the link states are still stochastically distributed.

On the other hand, applying stochastic MAB model [2],[4], [5] will face practical implementation issues, even thoughno adversarial behavior exists. In almost all practical net-works (e.g., ad hoc and sensor networks), the commonlyseen occasionally disturbing events would make the stochasticdistributed link metrics contaminated. These include the bursttraffic injection, the jitter effect of electronmagnetic waves,periodic battery replacements, and the unexpected routing

http://arxiv.org/abs/1610.03348v1

table corruptions and reconfigurations, etc. In this case, thelink metric distributions will not be i.i.d. for a small portionof time during the whole learning process. Thus, it is unclearwhether the stochastic MAB theory can still be applied,how it affects the learning performance and to what extendthe contamination is negligible. Therefore, the design of theSPR protocol without any prior knowledge of the operatingenvironment is very challenging.

In this paper, we propose a novel adaptive online-learningbased SPR protocol to address this challenging issue at the firstattempt and it achieves near optimal learning performance inall different situations within a unified online-learning frame-work. The proposed algorithm neither needs to distinguishthe stochastic and adversarial MAB problems nor needs toknow the time horizon of running the protocol. Our idea isbased on the well-known EXP3 algorithm from adversarialMAB [11] by introducing a novel control parameter into theexploration probability todetect the metrics evolutions of eachlink. In contrast to hop-by-hop routing where intermediatenode is responsible to decide the next route, our online routingdecision is made at the endhost (i.e, source nodes) that iscapable to select a globally optimal path. Owing to the lack oflink quality knowledge, the limited observation of the networkfrom the endhost makes the online-learning based SPR verychallenging. Therefore, the regret grows not only with time,but also with the network size. Moreover, to further acceleratethe learning process in practical large-scale networks, weneedto study the following important issues: each endhost in everytime slot decouples route selection and probing by sending“smart” probing packets (these packets do not carry any usefuldata) over multiple paths to measure link metrics along withthe selected path in the network, cooperative learning amongmultiple endhosts, and the “cold-start” and delayed feedbackissues in practical deployments. Our main contributions aresummarized as follows:

1) We design the first adaptive SPR protocol to bring thestochastic and adversarial MABs into a unified frameworkwith promising practical applications in unknown networkenvironments. The environments are generally categorizedintofour typical regimes, where our proposed SPR algorithms areshown to achieve almost optimal regret performance in allregimes and are resilient to different kinds of attacks.

2) We extend our algorithm to accelerated learning andsee a1/m-factor reduction in regret for a probing rate ofm.We also consider the practical “cold-start” issue of the SPRalgorithms, i.e., when the endhost is unaware of them andtotal number of linksn at the beginning and the sensitivenessof the algorithm to that lacked information, and the delayedfeedback issue.

3) The proposed algorithms can be implemented by dy-namic programming, where its time and space complexitiesis comparable to the classic Dijkstra’s algorithm. Importantly,they achieve optimal regret bounds with respect to the networksize.

4) We conduct diversified experiments on both real trace-driven and synthetic datasets and demonstrate that all advan-tages of the algorithms are real and can be applied in practice.

The rest of this paper is organized as follows. Section II

discusses related works. Section III describes the problemformulation. Section IV studies the single-source adaptiveoptimal SPR problem with solid performance analysis. InSection V, we study the accelerated learning and practicalimplementation issues. Section VI discusses the computation-ally efficient implementation of AOSPR-EXP3++. Section VIIconducts numerical experiments. Important proofs for single-source and accelerated learning SPR algorithms are put inSection VIII and Section IX, respectively. The paper concludesin Section X.

II. RELATED WORK

Online learning-based routing has been proposed to dealwith networks in dynamically changing environments, es-pecially in wireless ad hoc networks with fixed topology.Some existing solutions focus on the hop-by-hop optimizationof route selections, e.g., [3], [5], and references therein.Meanwhile, most of the other works consider the much morechallenging endhost-based routing, e.g., [2], [4], [8], [24]. In[3], reinforcement learning (RL) techniques are used to updatethe link-level metrics. It is worth pointing out that RL isgenerally targeted at a broader set of learning problems inMarkov Decision Processes (MDPs) [17]. It is well-knownthat such learning algorithms can guarantee optimality onlyasymptotically (to infinity), which cannot be relied uponin mission-critical applications. MAB problems constitute aspecial class of MDPs, for which the regret learning frame-work is generally viewed as more effective both in terms ofconvergence and computational complexity for the finite timeoptimality solutions. Thus, the use of MAB models is highlyidentified. If path measurements are available for a set ofindependent paths, it belongs to the classic MAB problem. Iflink measurements are available such that the dependent pathscan share this information, it is named as thecombinatorialsemi-bandit [20] problems. Obviously, the exploitation ofsharing measurements of overlapping links among differentpaths can accelerate learning and result in much lower regretsand better scalability [20].

Importantly, existing works are mainly based on two typesof MAB models: adversarial MAB [6]–[8] and stochasticMAB [2], [4], [5], [24]. The work in [6] studied the minimaldelay SPR against an oblivious adversary, and the regretis a suboptimalO(t2/3). The throughput-competitive routeselection against anadaptive adversary was studied in [7] withregretO(t2/3), which yields the worst routing performance.Gyorgy et al. [8] provided a complete routing frameworkunder the oblivious adversary attack, and it is based onboth link and path measurements with order-optimal regretsO(t1/2). The works in [2], [4], [5], [24] considered benignenvironments to be better modeled by the stochastic settingwithout adversarial events, where link weights follow someunknown stochastic (i.i.d.) distributions. Bhorkaret al. [5]consider routing based on each link (hop-by-hop), who hasan order-optimal regretO(logt). The first solution for SPR asthe stochastic combinatorial semi-bandit MAB problem wasseen in [4], and it indicates a regretO(n4logt) given thenumber of linksn. As noticed, the regret of endhost-basedrouting greatly increases with the network size. [2] probed

2

the least measured links for i.i.d. distributed links and ithadconsidered the practical delayed feedback issue, which hasimproved regrets compared with [4]. Although the algorithmcould handle temporally-correlated links, it is not suitable forthe adversarial link condition. In [24], the author proposed anadaptive SPR algorithm under stochastically varying link statesthat achieves anO(k4logt) regret, wherek is the dimensionof the path set.

The stochastic and adversarial MABs have co-existed inparallel for almost two decades. Recently, [15] tried to bringthem together in the classic MAB framework. Our currentwork is motivated by [15] by using a novel explorationparameter over each channel to detect its evolving patterns,i.e., stochastic, contaminated, or adversarial, but they do notgeneralize their idea to describe general environmental sce-narios (No mixed adversarial and stochastic regime, which isvery typical scenario) with potential engineering and securityapplications. Our current work uses the idea of introduc-ing the novel exploration parameter [15] into our specialcombinatorial semi-bandit MAB problem by exploiting thelink dependency among different paths, which is a nontrivialand much harder problem. This new framework avoids thecomputational inefficiency issue for general combinatorialadversary bandit problems as indicated in [21] [13]. It achievesa regret bound of orderO(kr

√tn lnn), which only has a

factor of O(√kr) factor off when compared to the optimal

O(√krtn lnn) bound in the combinatorial adversary bandit

setting [20]. However, we do believe that the regret boundin our framework is the optimal one for the exponentialweight (e.g. EXP3 [11]) type of algorithm settings in thesense that the algorithm is computationally efficient. Thus,our work is also a first computationally efficient combinatorialMAB algorithm for general unknown environments1. What ismore surprising and encouraging, in the stochastic regimes(including the contaminated stochastic regime), our algorithmsachieve a regret bound of orderO(nk log (t)

∆ ). In the senseof channel numbersn and size of links within each strategyk, this is the best result to date for combinatorial stochasticbandit problems [16]. Please note that in [4], they have aregret bound of orderO(n

4 log (t)∆ ); in [23], the regret bound

is O(n3 log (t)

∆ ); in [24], regret bound isO(k4 log (t)

∆ ) and in

[25], the regret bound isO(n2 log3 (t)

∆ ). Thus, our proposedalgorithms are order optimal with respect ton and k for alldifferent regimes, which indicates the optimal scalability forgeneral wireless communication systems or networks.

III. PROBLEM FORMULATION

A. Network Model

We consider the given network modeled by a directedacyclic graph with a set of vertices connected by edges, andsources vertices have streams of data packets to send to thedistinguisheddestination vertices. Formally, letV denote theset of nodes andE the set of links with|E| = n. For anygiven source-destination pair(s, d), let P denote the set of

1 As noticed, the stochastic combinatorial bandit problem does no have thisissue as indicated in [21] [16].

all candidate paths as routing strategies belongs to(s, d) with|P| = N . We represent each pathi, as a routing strategy, hasi ∈ P ⊂ {0, 1}n. Overlaps (sharing links) between differentpaths are allowed. Letki denote the length of each pathi andkdenote the maximum length of path(s) withinP . Thus, the sizeof N is upper bounded bynk, which is exponentially largeto the number of edgesn, and therefore a computationallyefficient algorithm is desirable.

At each time slott, depending on the traffic and link quality,each edgee may experience a different unknown varying linkweight ℓt(e). A packet traversed over the chosen path have asum of weightsℓt(i) equals to

∑

e∈iℓt(e) of links composing

the path. If there are adversary events imposed on a link (orthe related routers, which will finally affect the link weight), itis attacked. We denote the respective set and number of theseattacked links byEa andka. We assume the link weights to beadditive, where the typical additive metric is link delays (thereare others, e.g., log of delivery ratio). We do not make anyassumption on the distribution of eachℓt(e), ∀e ∈ E, it can,by default, follow some unknown stochastic process (i.i.d.),and coud be attacked arbitrarily by diversified attackers (non-i.i.d.) that is different across different links. Without loss ofgenerality (W.l.o.g), we transform the link weights such thatℓt(e) ∈ [0, 1] for all e and t, and there is a single attackerlaunches all attacks.

B. Problem Description

The main task for a given source-destination pair is to finda path i ∈ P with minimized path weightsℓt(i) over time.If each link weight ℓt(e) is known at every time slot, theproblem can be efficiently solved by classic routing solutions(e.g., Dijkstra’s algorithm). Otherwise, it necessitatesonlinelearning-based approaches.

W.l.o.g, we consider source routing, where sources period-ically sends probes along the potential paths to measure thenetwork and adjust its choices over time. We use thelink-levelmeasurements to record the link weights as in traceroute onthe probed paths, i.e., if pathi is probed at the beginning oftime slot t, all its link weights are observed at the end oft.If multi-path probing is allowed with a budget ofMt paths attime t, all the probed pathi1, ..., iMt

will be traced out andtheir link weights are observed. LetLt(i) =

∑ts=1 ℓt(i) =

∑ts=1

∑

e∈i ℓs(e) be the cumulative weight up tot for a

selected pathi. Then, i∗ ∆= argmini∈P {Lt(i)} denotes the

expected minimum weight path. LetIt denotes a particularpath chosen at time slott from P , then for a particularSPR algorithm, the cumulative weight up to time slott isLt (Is) =

∑ts=1 ℓs (Is) =

∑ts=1

∑

e∈Isℓs (e). Our goal is

to jointly select a pathI s (and a set of probing path ifallowed, i.e., i ∈ Ms) at each time slots up to time t(s = 1, 2, ..., t) such thatI s converges toi∗ as fast as possiblein all different situations. Specifically, the performanceof theSPR algorithm is qualified by theregret R(t), defined asthe difference between the selected paths by the proposedalgorithm and the expected minimum weight path up tot timeslots. Note thatR(t) is a random variable becauseI t dependson link measurement. We useEt[·] to denote expectations on

3

S

D

S

D D

D

S

Fig. 1: SPR in Different Regimes of Unknown Environments

realization of all strategies as random variables up to round t.Therefore, the expected regret can be written as

R(t) = E[t∑

s=1Es[

∑

e∈Isℓs(e)]]−min

i∈P(E[

t∑

s=1Es[

∑

e∈iℓs(e)]]). (1)

The goal of the algorithm is to minimize the regret.

C. The Four Regimes of Network Environments

Since our algorithm does not need to know the nature ofthe environments, different characteristics of the environmentswill affect its performance differently. We categorize them intofour typical regimes as shown in Fig. 1.

1) Adversarial Regime: In this regime, there is an attackerattacks (e.g., send interfering power and corrupt routers byworms, etc.) over alln links such that link weights sufferedcompletely (See Fig.1 (a)) that lead to metric value (e.g.link delay) loss. Note that the adversarial regime as a classicmodel of the well known non-stochastic MAB problem [11]implies that the attacker launches attack in every time slot. Itis the most general setting and the other three regimes can beregarded as special cases of the adversarial regime.

Attack Model: Different attack philosophies will lead todifferent level of effectiveness. We focus on the followingtwotype of jammers in the adversarial regime:

a) Oblivious attacker: an oblivious attacker attacks differentlinks with different attacking strength as a result of differentdata rate reductions, which is independent of the past commu-nication records it might have observed.

b) Adaptive attacker: an adaptive attacker selects its attack-ing strength on the targeted (sub)set of links by utilizing itspast experience and observation of the previous communica-tion records. It is very powerful and can infer the SPR protocoland can launch attacks with different level of strength overa subset of links or routers during a single time slot basedon the historical monitoring records. As shown in a recentwork [9], no bandit algorithm can guarantee a sublinear regreto(t) against an adaptive adversary with unbounded memory,because the adaptive adversary can mimic the behavior ofSPR protocol to attack, which leads to a linear regret (theattack can not be defended). Therefore, we consider a morepracticalθ-memory-bounded adaptive adversary [9] model. Itis an adversary constrained to loss functions that depends onlyon theθ + 1 most recent strategies.

2) Stochastic Regime: In this regime, the transceiver iscommunicating overn stochastic links as shown in Fig.1(b). The link weightsℓt(e), ∀e ∈ 1, ..., n of each link eare sampled independently from an unknown distribution that

depends one, but not ont. We useµe = E [ℓt(e)] to denotethe expected loss of linke. We define linke as thebest linkif µ(e) = mine′{µ(e′)} and suboptimal link otherwise; lete∗ denote some best link. For each linke, define the gap∆(e) = µ(e) − µ(e∗); let ∆e = mine:∆(e)>0 {∆(e)} denotethe minimal gap of links. The regret can be rewritten as

R(t) =∑n

e=1E [Nt(e)]∆(e). (2)

Note that we can calculate the regret either from the perspec-tive of links e ∈ 1, ..., n or from the perspective of strategiesi ∈ P . However, because of the set of strategies (paths) growsexponentially with respect ton and it does not exploit the linkdependency between different strategies, we can calculatetheregret from links, where tight regret bounds are achievable.

3) Mixed Adversarial and Stochastic Regime: This regimeassumes that the attacker only attackska out of k active linksat each time slot shown in Fig.1 (c). There is always aka/kportion of links under adversarial attack while the other(k −ka)/k portion is stochastically distributed.

Attack Model: We consider the same attack model as inthe adversarial regime. The difference here is that the attackeronly attacks a subset of links of sizeka over the totalk links.

4) Contaminated Stochastic Regime: The definition of thecontaminated stochastic regime comes from many practicalobservations that only a few links (or routers) and time slotsare exposed to adversary. In this regime, for the obliviousattacker, it selects some slot-link pairs(t, e) as “locations”to attack before the SPR starts, while the remaining linkweights are generated the same as in the stochastic regime.We can introduce and define theattacking strength parameterζ ∈ [0, 1/2). After certainτ timslots, for all t > τ the totalnumber of contaminated locations of each suboptimal link upto time t is t∆(e)ζ and the number of contaminated locationsof each best link ist∆eζ. We call a contaminated stochasticregimemoderately contaminated, if ζ is at most1/4, we canprove that for allt > τ on the average over the stochasticityof the loss sequence the adversary can reduce the gap of everylink by at most one half.

IV. SINGLE-SOURCE ADAPTIVE OPTIMAL SPR

A. Coupled Probing and Routing

This section develops an SPR algorithm for a single source.The design philosophy is that the source collects the linkdelays of the previously chosen paths, based on which itcan decide the next time slot routing strategy. The maindifficulty is that it requires the algorithm to appropriatelybalance betweenexploitation and exploration. On the onehand, such an algorithm needs to keep exploring the best setof paths; on the other hand, it needs to exploit the alreadyselected best set of paths so that they are not under utilized.

We describe Algorithm 1, namely AOSPR-EXP3++, a vari-ant based on EXP3 algorithm, whose performance in the fourregimes is proved to be asymptotically optimal. Our newalgorithm uses the fact that when link delays of the chosenpath are revealed, it also provides useful information aboutother paths with shared common links. During each time slot,we assign a link weight that is dynamically adjusted basedon the link delays revealed to the source. The weight of a

4

Algorithm 1 AOSPR-EXP3++: An MAB-based Algorithm forAOSPR

Input : n, k, t, and See text for definition ofηt andξt(e).Initialization : Set initial link losses∀e ∈ [1, n], ℓ0(e) = 0.Then the initial link and path weights∀e ∈ [1, n], w0(e) = 1and∀i ∈ [1, N ],W0(i) = k, respectively.

Set: βt=12

√lnntn ; εt (e)=min

{12n , βt, ξt (e)

}, ∀e ∈ [1, n].

for time slot t = 1, 2, ... do1: The source selects a pathIt at random according to theprobabilityρt(i), ∀i ∈ P , with ρt(i) computed as follows:

ρt(i) =

(1−∑ne=1 εt(e))

wt−1(i)Wt−1

+∑

e∈iεt(e) if i ∈ C

(1−∑n

e=1 εt(e))wt−1(i)Wt−1

if i /∈ C(3)

2: The source computes the probabilityρt(e), ∀e ∈ E, as

ρt(e) =∑

i:e∈i ρt(i) = (1−∑ne=1 εt(e))

∑

i:e∈i wt−1(i)Wt−1

+∑

e∈i εt(e) |{i ∈ C : e ∈ i}| .(4)

3: Observe the suffered link lossℓt−1(e), ∀e ∈ I t, andupdate its estimated value byℓt(e) = ℓt(e)

ρt(e), ∀e ∈ I t.

Otherwise,ℓt(e) = 0, ∀e /∈ I t.4: The source updates all the weights aswt (e) = wt−1 (e) e

−ηtℓt(e) = e−ηtLt(e) andwt (i) =

∏

e∈i wt(e) = wt−1 (i) e−ηtℓt(i), whereLt(e) = Lt−1(e) + ℓt−1(e), ℓt−1(e) =

∑

e∈i ℓt−1(e) andLt(i) = Lt−1(i) + ℓt−1(i). The sum of weights of allstrategies is calculated asWt =

∑

i∈Pwt (i).

end for

path is determined by the product of weights of all links. Ouralgorithm has two control parameters: thelearning rate ηt andthe exploration parameterεt(e) for each linke. To facilitatethe adaptive and optimal SPR without the knowledge aboutthe nature of the environments, the crucial innovation is theintroduction of exploration parameterξt(e) into εt(e) for eachlink e, which is tuned individually for each arm depending onthe past observations.

Let N denote the total number of strategies at the sourceside. A set ofcovering strategy is defined to ensure that eachlink is sampled sufficiently often. It has the property that foreach linke, there is a strategyi ∈ C such thate ∈ i. Sincethere are onlyn links and each strategy includesk links, we set|C| = ⌈n

k ⌉. As such, there is no-overlapping among differentpaths in the set of the covering strategy to maximize thecovering range. The value

∑

e∈i εt(e) means the randomizedexploration probability for each strategyi ∈ C, which isthe summation of each linke’s exploration probabilityεt (e)that belongs to the strategyi. The introduction of

∑

e∈i εt (e)ensuresρt(i) ≥

∑

e∈i εt(e) so that a mixture of exponentialweight distribution and uniform distribution [14].

In the following discussion, we show that tuning only thelearning rateηt is sufficient to control and obtain the regretof the AOSPR-EXP3++ in the adversarial regime, regardlessof the choice of exploration parameterξt(e). Then we showthat tuning only the exploration parameterξt(e) is sufficientto control the regret of AOSPR-EXP3++ in the stochasticregimes regardless of the choice ofηt, as long asηt ≥ βt.

To facilitate the AOSPR-EXP3++ algorithm without knowingabout the nature of environments, we can apply the twocontrol parameters simultaneously by settingηt = βt anduse the control parameterξt(e) in the stochastic regimessuch that it can achieve the optimal “root-t” regret in theadversarial regime and almost optimal “logarithmic-t” regretin the stochastic regime (though with a suboptimal power inthe logarithm).

B. Performance Results in Different Regimes

We present the regret performance of our proposed AOSPR-EXP3++ algorithm in different regimes as follows. The analy-sis involves with martingale theory and some special concen-tration inequalities, which are put in Section VII.

1) Adversarial Regime: We first show that tuningηt issufficient to control the regret of AOSPR-EXP3++ in theadversarial regime, which is a general result that holds forall other regimes.

Theorem 1. Under theoblivious adversary, no matter howthe status of the links change (potentially in an adversarialmanner), forηt = βt and anyξt(e) ≥ 0, the regret of theAOSPR-EXP3++ algorithm for anyt satisfies

R(t) ≤ 4k√tn lnn.

Note that Theorem 1 attains the same result as in [8] in theadversarial regime for oblivious adversary, based on whichweget result for the adaptive adversary in the following.

Theorem 2. Under theθ-memory-bounded adaptive adver-sary, no matter how the status of the links change (potentiallyin an adversarial manner), forηt = βt and anyξt(e) ≥ 0, theregret of the AOSPR-EXP3++ algorithm for anyt satisfies

R(t) ≤ (θ + 1)(4k√n lnn)

23 t

23 + o(t

23 ).

2) Stochastic Regime: Now we show that for anyηt ≥ βt,tuning the exploration parametersξt(e) is sufficient to controlthe regret of the algorithm in the stochastic regime. We alsoconsider a different way of tuning the exploration parametersξt(e) for practical implementation considerations. We beginwith an idealistic assumption that the gaps∆(e), ∀e ∈ n isknown, just to give an idea of what is the best result we canhave and our general idea for all our proofs.

Theorem 3. Assume that the gaps∆(e), ∀e ∈ n, areknown. Let t∗ be the minimal integer that satisfiest∗(e) ≥4c2n ln (t∗(e)∆(e)2)

2

∆(e)4 ln(n). For any choice ofηt ≥ βt and any

c ≥ 18, the regret of the AOSPR-EXP3++ algorithm withξt(a) =

c ln(t∆(e)2)

t∆(e)2in the stochastic regime satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

k ln (t)2

∆(e)

)

+∑

e=1,∆(e)>0

∆(e)t∗(e)

= O(

kn ln (t)2

∆e

)

+∑

e=1,∆(e)>0

O(

n∆(e)3

)

.

From the upper bound results, we note that the leadingconstantsk andn are optimal and tight as indicated in Com-bUCB1 [16] algorithm. However, we have a factor ofln(t)worse of the regret performance than the optimal “logarithmic-t” regret as in [2], [4], [5], [12], [16], [24], where theperformance gap is trivially negligible (See numerical resultsin Section VII).

5

A Practical Implementation by Estimating the Gap: Becauseof the gaps∆(e), ∀e ∈ n can not be known in advance beforerunning the algorithm. Next, we show a more practical resultthat uses the empirical gap as an estimate of the true gap. Theestimation process can be performed in background for eachlink e that starts from the running of the algorithm, i.e.,

∆t(e) = min

{

1,1

t

(

Lt(e)−mine′

(Lt(e′))

)}

. (5)

This is a first algorithm that can be used in many real-worldapplications.

Theorem 4. Let c ≥ 18 and ηt ≥ βt. Let t∗ be theminimal integer that satisfiest∗ ≥ 4c2 ln (t∗)4n

ln(n) , and lett∗(e) =

max{

t∗,⌈

e1/∆(e)2⌉}

and t∗ = max{e∈n}t∗(e). The regret

of the AOSPR-EXP3++ algorithm withξt(e) = c(ln t)2

t∆t−1(e)2 ,

termed as AOSPR-EXP3++AVG, in the stochastic regime satis-fies

R(t) ≤n∑

e=1,∆(e)>0

O(

k ln (t)3

∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)t∗(e)

= O(

nk ln (t)3

∆e

)

+ nt∗.

From the theorem, we observe that factor of anotherln(t)worse of the regret performance when compared to the ideal-istic case. Also, the additive constantt∗ in this theorem canbe very large. However, our experimental results show that aminor modification of this algorithm achieves a comparableperformance with ComUCB1 [16] in the stochastic regime.

3) Mixed Adversarial and Stochastic Regime: The mixedadversarial and stochastic regime can be regarded as a specialcase of mixing adversarial and stochastic regimes. Since thereis always a jammer randomly attackingka links out of thetotal n links out of the totalk links constantly over time, wewill have the following theorem for the AOSPR-EXP3++AVG

algorithm, which is a much more refined regret performancebound than the general regret bound in the adversarial regime.

Theorem 5.Let c ≥ 18 andηt ≥ βt. Let t∗ be the minimalinteger that satisfiest∗ ≥ 4c2 ln (t∗)4n

ln(n) , and Let t∗(e) =

max{

t∗,⌈

e1/∆(e)2⌉}



t∆t−1(e)2 ,

termed as AOSPR-EXP3++AVG under oblivious jamming at-tack, in the mixed stochastic and adversarial regime satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

(k−ka) ln (t)3

∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)t∗(e)

+4ka√tn lnn

= O(

n(k−ka) ln (t)3

∆e

)

+ nt∗ +O(

ka√tn lnn

)

.

Note that the results in Theorem 5 have better regretperformance than the results obtained by adversarial MABas shown in Theorem 1 and the adaptive SPR algorithm in[7]. Similarly, we have the following result under adaptiveadversarial attack.

Theorem 6.Let c ≥ 18 andηt ≥ βt. Let t∗ be the minimalinteger that satisfiest∗ ≥ 4c2 ln (t∗)4n

ln(n) , and Let t∗(e) =

max{

t∗,⌈

e1/∆(e)2⌉}



t∆t−1(e)2 ,

termed as AOSPR-EXP3++AVG θ-memory-bounded adaptive

adversarial attack, in the mixed stochastic and adversarialregime satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

(k−ka) ln (t)3

∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)t∗(e)

+(θ + 1)(4ka√n lnn)

23 t

23 + o(t

23 )

= O(

(k−ka)n ln (t)3

∆e

)

+ nt∗ +O(

(θ + 1)(ka√n lnn)

23 t

23

)

.

4) Contaminated Stochastic Regime: We show that the al-gorithm AOSPR-EXP3++AVG can still retain “polylogarithmic-t” regret in the contaminated stochastic regime. The follow-ing is the result for themoderately contaminated stochasticregime.

Theorem 7. Under the setting of all parameters given inTheorem 3, fort∗(e) = max

{

t∗,⌈

e4/∆(e)2⌉}

, where t∗ is

defined as before andt∗3 = max{e∈n}t∗(e), and the attacking

strength parameterζ ∈ [0, 1/2) the regret of the AOSPR-EXP3++ algorithm in the contaminated stochastic regime thatis contaminated afterτ steps satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

k ln (t)3

(1−2ζ)∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)max{t∗(e), τ}.

= O(

nk ln (t)3

(1−2ζ)∆e

)

+ nt∗3.

If ζ ∈ (1/4, 1/2), we find that the leading factor1/(1 −2ζ) is very large, which isseverely contaminated. Now, theobtained regret bound is not quite meaningful, which couldbe much worse than the regret performance in the adversarialregime for both oblivious and adaptive adversary.

V. ACCELERATEDAOSPR ALGORITHM

This section focuses on the accelerated learning by multi-path probing, cooperative learning between multiple source-destination pairs and other practical issues. All importantproofs are put in Section VIII.

A. Multi-Path Probing for Adaptive Online SPR

Intuitively, probing multiple paths simultaneously wouldoffer the source more available information to make decisions,which results in faster learning and smaller regret value. Ateach time slott, the source gets a budget1 ≤ Mt ≤N and picks a subsectOt ⊆ {1, ..., N} of Mt paths toprobe and observe the link weights of these routes. Notethat the links weights that belong to the un-probed set ofpathsP \ Ot are still unrevealed. Accordingly, we have theprobed and observed set of linksOt with the simple propertye ∈ Ot, ∀e ∈ i ∈ Ot. The proposed algorithm 2 is basedon Algorithm 1 with ωt−1(i) = wt−1(i)/Wt−1, ∀i ∈ P andωt−1(e) =

∑

i:e∈i wt−1(e)/Wt−1, ∀e ∈ E. The probabilityt = (t(1), ..., t(N)) of each observed path is computed as

t(i) = ρt(i) + (1− ρt(i))Mt − 1

N − 1, if i ∈ Ot, (6)

where a mixture of the new exploration probability(Mt −1)/(N−1) is introduced andρt(i) is defined in (3). Similarly,the link probability ˜t = (˜t(1)..., ˜t(n)) is computed as

˜t(e) = ρt(e) + (1− ρt(e))mt − 1

n− 1, if e ∈ Ot. (7)

Here, we have a link-level the new mixing exploration prob-ability (mt − 1)/(n − 1) and ρt(e) is defined in (4). The

6

probing ratemt denotes the number of simultaneous probesat time slott. Assume the link weights measured by differentprobes within the same time slot also satisfy the assumptionin Section II-A. The mixing probability(mt − 1)/(n − 1)is informed by the source to all links along the probed andobserved paths over a total ofn links of the probed pathMt isa constant value for a network with fixed topology. The sourceneeds to know the number ofn and gradually collect the valueof mt over time. Thus, the algorithm faces the problems of“Cold-Start” and delayed feedback. The design of (6) and (7)and the proof of all results in this section are non-trivial tasksin our unified framework .

Algorithm 2 AOSPR-MP-EXP3++: Prediction with Multi-Path Probing

Input : M1,M2, ...,, such thatMt ∈ P . Setβt, εt (e) , ξt (e)at in Alg. 1. ∀i ∈ P , L0(i) = 0 and∀e ∈ E, ℓ0(e) = 0.for time slot t = 1, 2, ... do

1: Choose one pathHt according toρt (3). Get adviceπHt

t as the selected path. SampleMt−1 additional pathsuniformly overN . Denote the set of sampled paths byOt, whereHt ∈ Ot and |Ot| = Mt. Let 1h

t = 1{h∈Ot}.2: Update the path probabilitiest(i) according to (6).The loss of the observed path is

ℓt(i) =ℓt(i)

t(i)1ht , ∀i ∈ Ot. (8)

3: Compute the probability of choosing each linkρt(e)that belongs to the selected path according to (4).4: let 1(e)t = 1(e)e∈h∈Ot

. Update the link probabilities˜t(e) according to (7). The loss of the observed links are

ℓt(e) =ℓt(e)

˜t(e)1(e)t, ∀e ∈ Ot. (9)

5: Updates all weightswt (e) , wt (i) ,Wt as in Alg. 1.end for

The Performance Results of Multi-path Probing in the FourRegimes: If mt is a constant or lower bounded bym, we havethe following results.

Theorem 8.Under theoblivious attack with the same settingof Theorem 1, the regret of the AOSPR-EXP3++ algorithm inthe accelerated learning with probing ratem satisfies

R(t) ≤ 4k

√

tn

mlnn.

Theorem 9. Under theθ-memory-bounded adaptive attackwith the same setting of Theorem 2, the regret of the AOSPR-EXP3++ algorithm in the accelerated learning with probingratem satisfies

R(t) ≤ (θ + 1)(4k

√n

mlnn)

23 t

23 + o(t

23 ).

We consider the practical implementation in the stochasticregime by estimating the gap as in the (5), and the result underaccelerated learning is given as:

Theorem 10.With all other parameters hold as in Theorem4, the regret of the AOSPR-EXP3++ algorithm withξt(e) =

c(ln t)2

mt∆t−1(e)2 in the accelerated learning with probing ratem, in

the stochastic regime satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

k ln (t)3

m∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)t∗(e)

= O(

nk ln (t)3

m∆e

)

+ nt∗.


c(ln t)2

mt∆t−1(e)2 underoblivious jamming attack in the accelerated

learning with probing ratem, in the mixed stochastic andadversarial regime satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

(k−ka) ln (t)3

m∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)t∗(e)

+4ka√t nm lnn

= O(

n(k−ka) ln (t)3

m∆e

)

+ nt∗ +O(ka√t nm lnn

).


c(ln t)2

mt∆t−1(e)2 under theθ-memory-bounded adaptive attack in

the accelerated learning with probing ratem, in the mixedstochastic and adversarial regime satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

(k−ka) ln (t)3

m∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)t∗(e)

+(θ + 1)(4ka√

nm lnn)

23 t

23 + o(t

23 )

= O(

n(k−ka) ln (t)3

m∆e

)

+ nt∗ +O(

(θ + 1)(ka√

nm lnn)

23 t

23

)

.

Theorem 13. With all other parameters hold as in The-orem 7, the regret of the AOSPR-EXP3++ algorithm in theaccelerated learning with probing ratem in the contaminatedstochastic regime satisfies

R(t) ≤n∑

e=1,∆(e)>0

O(

k ln (t)3

(1−2ζ)∆(e)

)

+n∑

e=1,∆(e)>0

∆(e)max{t∗(e), τ}.

= O(

nk ln (t)3

m(1−2ζ)∆e

)

+ nt∗3.

B. Multi-Source Learning for SPR Routing

So far we have focused on a single source-destination pair.Now, we turn to study the more practical multi-source learningwith multiple source-destination pairs{1, ..., S} (setS = mto ease comparison), which may also accelerate learning ifsources can share information. Depending on the approach ofinformation sharing, we consider two typical cases:coordi-nated probing anduncoordinated probing. Both cases assumethe sources share link measurements. The difference is theselection of probing path is eithercentralized or distributed.

In the coordinated probing case, the probing path areselected globally either by a cluster head or different sources.We refer the algorithm as AOSPR-CP-EXP3++. Given total ofMt source-destination pairs at timet, the Mt probing pathsare sequentially chosen which satisfies the probing rate of onepath per source-destination pair. It is identical to the multi-pathprobing case except that now the candidate paths are not fromone’s own path, but from all source-destination pairs. Thus,the same results hold as in theTheorem 8-Theorem 13.

Theorem 14. The regret upper bounds in all differentregimes for AOSPR-CP-EXP3++ hold the same as in Theorem8-Theorem 13.

In the uncoordinated probing case, the probing paths areselected in a distributed manner using AOSPR-EXP3++. We

7

denote the algorithm as AOSPR-UP-EXP3++. As such, linksare no longer evenly measured since some links may becovered by more source-destination pairs than others. Byapplying a linear program to estimate the low bound on theleast probed link over timet, we have a scale factorκ thatdefines the dependency degree of overlapping paths. Then, weobtain the following result.

Theorem 15. The regret upper bounds in all differentregimes for AOSPR-UP-EXP3++ are: ifκ = 1, it is equivalentto the single source-destination pair case, the same results holdas in the Theorem 1-Theorem 7; ifκ = m, it is equivalentto the accelerated learning of multi-path probing case wherethe regret results hold by the substitution ofm by κ in theTheorem 8-Theorem 13.

C. The Cold-Start and Delayed feedback Issues

1) The Cold-Start Issue: Before the initialization of thealgorithm, the source does not know the number of linksnand the simultaneous probed number of linksmt, which isdemanded in the probing probability calculation in (7). Thus,the algorithm faces the “cold-start” problem. Note that theN can be a complete collection of paths of the source, itmust contain a set of covering strategyC′ where the wholelinks of the network is covered and the total number of linksn is acquirable. LetM = mint{M1,M2, ...,Mt, ...} denotethe minimal probed path from source over time. We have thefollowing Corollary 16 that indicates how long it takes for theAOSPR-MP-EXP3++ algorithm to work normally.

Corollary 16. It takes at mostNM timslots for the AOSPR-MP-EXP3++ algorithm to finish the “Cold-Start” phase andstart working normally.

Proof: Denote the event of the probability that sourcenode probes the paths uniformly over theN possible paths ateach time slot asX and the eventY of the probability thatthe number of probed links over total ofn links. Take the fol-lowing conditional probability we haveE[X ] = E[X |Y ]E[Y ],where E[X ] = E[Mt]

N and E[Y ] = E[mt]n . Due to the

potential dependency among different paths,E[X |Y ] ≤ 1.Thus, E[Y ] ≤ E[X ], i.e., E[mt]

n ≤ E[Mt]N , which indicates

NMt

≥ nmt

, ∀t. Since each link has probabilityp = mt

n tobe probed at every time slot. According to the geometricdistribution, the expected time that every link is probed is1p = n

mt≤ N

Mt≤ N

M . This completes the proof.Nevertheless, for practical implementations, the accurate

number ofm and n is still hard to obtain. It often comeswith errors in acquiring these two values. Hence, we need toknow the sensitivity of deviations of the two true values onthe regret performance. The result is summarized as follows.

Theorem 17.Given the deviation of observed valuesm ism∆ in (7), the upper bound of the deviated of regretRm∆(t)with respect to the originalR(t) given its upper boundR(t)(a) in the adversarial regime is − 1

2m∆nm R(t) and

− 13m∆

nm R(t) for oblivious jammer and for adaptive

adversary, respectively;(b) in the stochastic regimeand contaminated regime are both − 1

2m∆

m R(t); (c)in the mixed adversarial and stochastic regime is− 1

2m∆

m Rk−ka(t)− 12m∆

nm Rka(t) for oblivious adversary and

is − 13m∆

m Rk−ka(t) − 12m∆

nm Rka(t) for adaptive adversary,

whereRk−ka(t) andRka(t) represents the upper bounds withk − ka andka links in the stochastic regime and adversarialregime, respectively.Given the deviation of observed valuesn is n∆ in (7), theupper bound of deviated of regretRn∆(t) with respect to theoriginal R(t) given its upper boundR(t)(d) in the adversarial regime is 1

2n∆nm

m−1n−1 R(t) ∼= 1

2n∆R(t)

and 13n∆

nm

m−1n−1 R(t) ∼= 1

3n∆R(t) for oblivious jammer andfor adaptive jammer, respectively;(e) in the stochastic regimeand contaminated regime are bothR∆(t) = 0; (f) in themixed adversarial and stochastic regime is 1

2n∆Rka(t) for

oblivious adversary and is13n∆Rka(t) for adaptive adversary.

From the theorem 17, we know that the regret in theadversarial regime is more sensitive to the deviationm∆ thanthe deviationn∆, which guides the design of the networkto acquire accurate value ofmt during the probing phase.For the stochastic regimes, we also see that the regret ismore sensitive to the deviationm∆ than the deviationn∆.Moreover, the relative deviations onR(t) stochastic regimes,i.e, Rm∆(t)/R(t) = Θ(m∆

m ) and Rn∆(t)/R(t) = 0 ismuch less (sensitive) than that in adversarial regimes, i.e.,Rm∆(t)/R(t) = Θ(m∆

nm ) andRn∆(t)/R(t) = n∆. We see

all these phenomena in the simulations.2) Delayed Feedback Issue: In the network with a large

number of links, the link delay feedback to the source nodewill spend a lot of time, which is prohibitive in the realtimeprocess. Therefore, there are variant delayed feedbacks ofeachlink to the source. Moreover, if the path is switched in themiddle of a long streaming transmission, the network SPRprotocol needs a while to find the new optimal transmissionrate, and the delay of the first few packets after the switchcan be very large. In a nutshell, the delayed feedback issue ispractically important, and we have the following results.

Theorem 18. Given the largest expected deviations ofobserved link delayτ∗, the expected delayed-feedback regretRd(t) with respect to the originalR(t) (a) Assuming thedelays depend only on time but not on links, in theobliviousadversarial regime is upper bounded bydtE[R( t

dt)], where

dt = min{t, τ∗t +1} andτ∗t is the largest link delay at timet;(b) Assuming the delays to be independent of the rewards ofthe actions, in thestochastic regime andcontaminated regimeis upper bounded byE[R(t)] +

∑ne=1 ∆eE[τ

∗e,t].

VI. T HE COMPUTATIONALLY EFFICIENT

IMPLEMENTATION OF THE AOPSR-EXP3++ ALGORITHM

The implementation of algorithm1 requires the computationof probability distributions and storage ofN strategies, whichis obvious to have a time and space complexityO(nki ) for agiven path of lengthki . As the number of links increases, thenumber of path will become exponentially large, which is veryhard to be scalable and results in low efficiency. To addressthis important problem, we propose a computationally efficientenhanced algorithm by utilizing the dynamic programmingtechniques, as shown in Algorithm 3. The key idea of theenhanced algorithm is to select links in the selected path oneby one untilki links are chosen, instead of choosing a pathfrom the large path space in each time slot.

8

We useS(e, k

)to denote the path set of which each path

selectsk links from e, e+ 1, e, ..., n. We also useS(e, k

)to

denote the path set of which each path selectsk links fromlink 1, 2, ..., e. We defineWt(e, k) =

∑

i∈S(e,k)

∏

e∈i wt(e)

and Wt(e, k) =∑

i∈S(e,k)

∏

e∈i wt(e), Note that they havethe following properties:

Wt(e, k) = Wt(e+ 1, k) + wt(e)Wt(e+ 1, k − 1), (10)

Wt(e, k) = Wt(e− 1, k) + wt(e)Wt(e− 1, k − 1), (11)which implies bothWt(e, k) andWt(e, k) can be calculated inO(krn) (LettingWt(e, 0) = 1 andW (n+1, k) = W (0, k) =0) by using dynamic programming for all1 ≤ e ≤ n and1 ≤ k ≤ ki .

In step 1, instead of drawing a path, we select links of thepath one by one until a path is found. Here, we select linksone by one in the increasing order of channel indices, i.e., wedetermine whether the link1 should be selected, and the link2, and so on. For any linke, if k′ ≤ ki links have been chosenin link 1, .., e− 1, we select linke with probability

wt−1(e)Wt(e + 1, ki − k′ − 1)

Wt−1(e, ki − k′)(12)

and not selecte with probability Wt(e+1,ki−k′−1)Wt−1(e,ki−k′) . Let w(e) =

wt−1(e) if link e is selected in the pathi; w(e) = 0 otherwise.Obviously,w(e) is actually the weight ofe in the path weight.In our algorithm,wt−1(e) =

∏ne=1 w(e). Let c(e) = 1 if e is

selected ini; c(e) = 0 otherwise. The term∑e

e=1 c(e) denotesthe number of links chosen among link1, 2, ..., e in path i. Inthis implementation, the probability that a pathi is selected,i.e., wt−1(i)

Wt−1, can be written as

n∏

e=1

w(e)Wt−1(e + 1, ki −∑e

e=1 c (e))

Wt−1(e, ki −∑e−1

e=1 c (e))=

n∏

e=1w(e)

Wt−1(1, ki). (13)

This probability is equivalent to that in Algorithm 1, whichimplies the implementation is correct. Because we do notmaintainwt(i), it is impossible to computeρt(e) as we havedescribed in Algorithm 1. Thenρt(e) can be computed withinO(nkr) as in Eq.(4) for each round.

For the exploration parametersεt(e), since there areki

parameters ofεt(e) in the last term of Eqs. (14) below andthere aren links, the storage complexity isO(kn). Similarly,we have the time complexityO(knt) for the maintenance ofexploration parametersεt(e). Based on the above analysis, wecan summarize the conclusions into the following theorem.Moreover, under delayed feedback, since the base algorithmhas a memory requirement ofO(kn), the memory required bythe delayed AUFH-EXP3++ by time stept is upper boundedby O(knτ∗t ).

Theorem 19.The Algorithm 2 has polynomial time com-plexity O(knt), space complexityO(kn) and space complex-ity under the delayed feedbackO(knτ∗t ) with respect to roundst, parametersk andn.

Besides, because of the link selection probability forqt(e)and the updated weights of Algorithm 2 equals to Algorithm1, all the performance results in Section III and IV still holdfor Algorithm 2.

Algorithm 3 A Computational Efficient Implementation ofAOPSR-EXP3++

Input : n, ki , t, and See text for definition ofηt andξt(e).Initialization : Set initial link weightw0(e) = 1, ∀e ∈ [1, n].Let Wt(e, 0) = 1 andW (n + 1, k′) = W (0, k′) = 0 andcomputeW0(e, k

′) and W0(e, k′) follows Eqs. (10) and

(11), respectively.for time slot t = 1, 2, ... do

1: The source selects a linke, ∀e ∈ [1, n] one by oneaccording to the link’s probability distribution computedfollowing Eq. (12) until a path withki chosen links areselected.2: The source computes the probabilityqt(e), ∀e ∈ [1, n]according to Eq. (14).3: The source calculates the loss for channele,ℓt−1(e), ∀e ∈ it based on the link lossℓt−1(e). Computethe estimated lossℓt(e), ∀e ∈ [1, n] as follows:

ℓt(e) =

{ℓt(e)qt(e)

if channele ∈ it0 otherwise.

4: The source updates all channel weights aswt (e) =

wt−1 (e) e−ηt ℓt(e) = e−ηtLt(e), ∀e ∈ [1, n], and computes

Wt(e, k′) and Wt(e, k

′) follows Eqs. (10) and (11),respectively.

end for

VII. N UMERICAL AND SIMULATION RESULTS

We evaluate the performance of our online adaptive SPRalgorithm using a wireless sensor network (WSN) adopting theIEEE 802.15 standard deployed on a university building. Thetrace contains QoS metrics of detailed link quality information,i.e., delay, goodput and packet loss rate, under an extensive setof parameter configurations are measured. The dataset closeto 50 thousand parameter configurations were experimentedand measurement data of more than 200 million packets werecollected over a period of 6 months. Each sender-receiver isemployed by a pair of TelosB nodes, each equipped with a TICC2420 radio using the IEEE 802.15.4 stack implementationin TinyOS, which is placed in hallways of a five floor building.The WSN contains16 nodes, and there is line-of-sight pathbetween the two nodes of a path at a specific distance, whichwas varied for different experiments ranging from 10 metersto35 meters. Each node is forwarding packets under a particularstack parameter configuration, where the configuration set isfinite.

The delay perceived by a packet mainly consists of twoparts: queuing delay and service time delay, which are mea-sured for every data packet. More specifically, it includesthe ACK frame transmission time, retransmission duration,ACK maximal timeout if damage occurs by the adversarialattack, etc. To quantitatively answer how all the layer stackparameters contribute to the delay performance, there are fourdifferent types of datasets to emulate the following four typicalregime of the environments:1) the measured link qualitydata at night, where the link states distributions are benignand only affected by multi-path reflections from the walls;2)the measured contaminated link quality data at daytime from

9

(1−∑n

e=1εt(e))

∑ki−1k′=0 Wt−1(e− 1, k′)wt−1(e)Wt−1(e + 1, ki − k′ − 1)

Wt−1(1, k′)+∑

e∈i

εt(e) |i ∈ C : e ∈ i| (14)

0 1 2 3 4 5 6 7 8 9 10

x 106

0

200

400

600

800

1000

1200

Cum

ulat

ive

Reg

ret

t

SPR−EXP3[8]AOSPR−EXP3++AOSPR−EXP3++ m= 6AOSPR−EXP3++ m= 16OSPR[2]OSPR[2] m=6OSPR[2] m=16

Fig. 2: Regret in Stochastic Regime

0 2 4 6 8 100

500

1000

1500

2000

2500

3000

3500

Cum

ulat

ive

Reg

ret

time

SPR−EXP3[8]AOSPR−EXP3++AOSPR−EXP3++ m=6AOSPR−EXP3++ m=16OSPR[2]OSPR[2] m=6OSPR[2] m=16

ContaminatedTimeslots

ContaminatedDuration

× 106

Fig. 3: Regret in Contaminated Regime

0 1 2 3 4 5 6 7

x 106

0

0.5

1

1.5

2

2.5

3x 10

4

Cum

ulat

ive

Reg

ret

time

Oblivious Adversary

SPR−EXP3[8]SPR−EXP3[8] m=6SPR−EXP3[8] m=16AOSPR−EXP3++AOSPR−EXP3++ m=6AOSPR−EXP3++ m=16

Fig. 4: Regret in Adversarial Regime

3 : 00pm−4 : 00pm, when university students and employeeswalk most frequently in the hallway, which is a particularlyharsh wireless environment;3) the measured adversarial linkquality, by the same type of TelosB nodes working underthe same stack parameter configuration but sending garbagedata to launch oblivious jamming attack during the run of thealgorithm. The link delay labeledN/A are replaced by1111or 999 in the dataset to indicate completed data packet loss;4)the measured adversarial link quality under adaptive jammingattack, where the the algorithm is implemented by a set ofθ-memory jammers of our proposed AOSPR-EXP3++ algorithm.We omit the mixed adversarial and stochastic regime forbrevity.

All computations of collected datasets were conducted on anoff-the-shelf desktop with dual6-core Intel i7 CPUs clockedat 2.66Ghz. We make ten repetitions of each experiment toreduce the potential performance bias. Thesolid lines in thegraphs represent the mean performance over the experimentsand thedashed lines represents the mean plus on standarddeviation (std) over the ten repetitions of the correspondingexperiments. To show the advantage of our AOPSR-EXP3++algorithms, we need to compare the performance of oursto other existing MAB based algorithms. They include: theEXP3 based SPR algorithm in [8], which is named as “SPR-EXP3”; the Upper-Confidence-Bound (UCB) based onlineSPR algorithm “OSPR” in [2] and their variations. We setall versions of our AOPSR-EXP3++ algorithms parameterizedby ξt(e) =

ln(t∆t(e)2)

32t∆t(e)2 , where∆t(e) is the empirical estimate

of ∆t(e) defined in (5).

In our first group of experiments in the stochastic regime(environment) as shown in Fig. 1, it is clear to see that AOPSR-EXP3++ enjoys almost the same (cumulative) regrets as OSPR[2] and has much lower regrets over time than the adversarialSPR-EXP3 [8]. We also see the significantly regrets reductionwhen accelerated learning (m = 6, 16) is employed for bothOSPR and AOPSR-EXP3++.

In our second group of experiments in the moderatelycontaminated stochastic environment, there are several con-taminated time slots as labeled in Fig. 3. In this case,

the contamination is not fully adversarial, but drawn froma different stochastic model. Despite the corrupted roundsthe AOPSR-EXP3++ algorithm successfully returns to thestochastic operation mode and achieves better results thanSPR-EXP3 [8]. With light contaminations, the performanceof OSPR in [2] is comparable to AOPSR-EXP3++, althoughit is not applicable here due to the i.i.d. assumption of OSPR.

We conducted the third group of experiments in the ad-versarial regimes. We studied the oblivious adversary caseinFig. 4. Due to the strong interference effect on each link andthe arbitrarily changing feature of the jamming behavior, allalgorithms experience very high accumulated regrets. It canbe find that our AOPSR-EXP3++ algorithm will have closeand slightly worst learning performance when compared toSPR-EXP3 [8], which confirms our theoretical analysis. Notethat we do not implement stochastic MAB algorithms suchas “OSPR” in [2], since it is inapplicable in this regime.Moreover, we studied the adaptive adversary case in Fig. 5.Compared with Fig. 4, the learning performance is muchworse, which results in close to linear (but still sublinear)regret values, especially when the memory size ofΘ is large.From the collected data, we see a 252% increase in thenetwork delay under the adaptive adversary withΘ = 4 whencompared to the oblivious adversary conditions. The valuebecomes 845% whenΘ = 20, which shows the adaptiveattacker is very hard to defend.

The centralized and distributed implementation of our co-operative learning AOSPR-EXP3++ algorithms are presentedin Fig. 6. The sensitivity of the deviation of observed valuesm andn in stochastic and adversarial regimes are presentedin Fig. 8. It is obvious to see the effects ofm∆ andn∆ onthe regret of AOSPR-EXP3++ in the stochastic regime is muchsmaller compared to the counterpart in the adversarial regimes.On average, we see a deviation of regret about 12% in thestochastic regime for the values ofm∆ andn∆ show in Fig.8, while the deviation of regret is about 126% in the adversarialregime. This indicated that our algorithm is more sensitivetothe attacked environments than benign environments.

We conduct the “Cold-Start” and delayed feedback versionof all algorithms in Fig. 7 in a general unknown environment

10

time×106

0 1 2 3 4 5 6 7 8

Cum

ulat

ive

Reg

ret

×105

0

0.5

1

1.5

2

2.5

3

3.5

4Adaptive Adversary

Θ = 20, SPR-EXP3[8]Θ = 20, AOSPR-EXP3++Θ = 20, AOSPR-EXP3++ m=6Θ = 20, AOSPR-EXP3++ m=16Θ = 4, AOSPR-EXP3++Θ = 4, AOSPR-EXP3++ m=6Θ = 4, AOSPR-EXP3++ m=16

tαtβ

α < β

Fig. 5: Regret in Adversarial Regime

0 2 4 6 8 10

x 106

0

0.5

1

1.5

2

2.5x 10

4

Cum

ulat

ive

Reg

ret

time

Centralized and Distributed Implementation

AOSPR-EXP3++AOSPR-CP-EXP3++ (m= 6)

AOSPR-CP-EXP3++ (m= 16)

AOSPR-UP-EXP3++ (κ = 1)

AOSPR-UP-EXP3++ (κ=m=6)

AOSPR-UP-EXP3++ (κ = m=16)

Fig. 6: Centr. and Distr. Implementations

0 1 2 3 4 5 6 7

x 106

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x 104

Cum

ulat

ive

Reg

ret

time

Cold Start and Delayed Feedback τ*= 800ms

SPR−EXP3[8]AOSPR−EXP3++AOSPR−EXP3++ m= 6AOSPR−EXP3++ m= 16OSPR[2]OSPR[2] m= 6OSPR[2] m= 16

Cold−start Phase

Delayed Feedback

Fig. 7: Cold Start and Delayed Feedback

0 2 4 6 8 10x 10

6

0

100

200

300

400

500

600

700

800

900

Cum

ulat

ive

Reg

ret

time

Stochastic Regime with m∆ amd n∆

AOSPR−EXP3++ (m=6, n=16)AOSPR−EXP3++, m∆=−1

AOSPR−EXP3++, m∆=−2


AOSPR−EXP3++, n∆=1



m∆

n∆

0 2 4 6

x 106

0

0.5

1

1.5

2

2.5

3x 10

4

Cum

ulat

ive

Reg

ret

time

Adversary Regime with m∆ and n∆







m∆

n∆

Fig. 8: Sensitivity ofm andn in Adver. and Stoc. Regimes

that consists of data randomly mixed in all four regimes.The “Cold-Start” phase takes about3 ∼ 20 packets deliverytimslots. Although it is hard to see the first 20 rounds onthe plot, their effect on all the algorithms is clearly visible.For delayed feedback problem, we see a “quick jump” ofregret for adversarial MAB algorithms (e.g., SPR-EXP3 [8])at initial rounds that confirms its multiplicative effect toτ∗,while the relative small regret increase is seen for stochasticMAB algorithms (e.g., OSPR [2] and AOSPR-EXP3++) thatconfirms its additive effect toτ∗.

We also compared the averaged received data packets delaywith different network sizes as shown in Fig. 9 for the mixedstochastic and adversarial regime under different number oflinks after a relative long period of learning roundsn = 7∗107.We can find that with the increasing of the network size, thelearning performance of our AOSPR-EXP3++ is approachingthe state-of-the-art algorithms OSPR [2] and SPR-EXP3 [8]in the stochastic and adversarial regimes, which indicatesitssuperior flexibility in large scale network deployments.

With the increasing of the network size, we find that thelearning performance of our AOSPR-EXP3++ is approachingthe state-of-the-art algorithms OSPR [2] and SPR-EXP3 [8]in the stochastic and adversarial regimes, which indicatesits superior flexibility in large scale network deployments.Comparing the values of average received data packets delaysof our AOSPR-EXP3++ to that of the classic algorithm SPR-EXP3, we see a65.3% improvements of the EE in averageunder different set of linksn = 4, 8, 16, 32, 64 under oblivious

4 8 16 32 640

5

10

15

20

25Stochastic Regime

Number of links

Ave

rage

d R

ecei

ved

Dat

a P

acke

ts D

elay

s (m

s)

4 8 16 32 64

20

40

60

80

100

120(Oblivious) Adversarial Regime

Number of links

Ave

rage

d R

ecei

ved

Dat

a P

acke

ts D

elay

s (m

s)

SPR−EXP3[8] AOSPR−EXP3++ OSPR[2] AOSPR−EXP3++ m= 6

Fig. 9: Delay Performance with Different Network Size

jamming attack, and a124.8% improvements under adaptivejamming attack. Moreover, to reach the same value of delay forthe AOSPR-EXP3++ algorithm, the SPR-EXP3 takes a totalof n = 12.8 ∗ 107 learning rounds. This indicates a81.5%improvement in the learning period of our proposed AOSPR-EXP3++ algorithm.

Moreover, we test the performance of the computationalefficiency of our algorithms, where the computational timeis compared in Table 1. From the results, we know thatthe computationally efficient version of the AOPSR-EXP3++algorithm, i.e., Algorithm 3, takes about several hundredsofmicro-seconds on average, while the original algorithm takesabout several hundreds of seconds, which is prohibitive inpractical implementations.

VIII. P ROOFS OFREGRETS INDIFFERENT REGIMES FOR

THE SIGNAL -SOURCE AOSPR

We prove the theorems of the performance results in SectionIII in the order they were presented.

A. The Adversarial Regimes

The proof of Theorem 1 borrows some of the analysis ofEXP3 of the loss model in [10]. However, the introduction ofthe new mixing exploration parameter and the truth of linkdependency as a special type of combinatorial MAB problemin the loss model makes the proof a non-trivial task, and weprove it for the first time.

Proof of Theorem 1.Proof: Note first that the following equalities can

be easily verified: Ei∼ρtℓt(i) = ℓt(I t),Eℓt∼ρt

ℓt(i) =

ℓt(i),Ei∼ρtℓt(i)2 = ℓt(It)2

ρt(It)andEIt∼ρt

1ρt(It)

= N .

11

TABLE I: Computation Time Comparisons of Algorithm 1 and Algorithm 3

(n, k)Alg. Ver. vs Comp. Time (micro seconds) (12, 4) (24, 4) (48, 6) (48, 12) (64, 6) (64, 12) (64, 24)AOPSR-EXP3++:Algorithm1 46.1610 267.3351 819.7124 2622.1341 11087.0957 222376.0135 1868341.2324AOPSR-EXP3++ Algorithm3 14.5137 29.1341 61.6366 157.6732 258.3622 456.1143 790.5101

Then, we can immediately rewriteR(t) and have

R(t) = Et

[t∑

s=1

Ei∼psℓs(i)−

t∑

s=1

EIs∼psℓs(i)

]

.

The key step here is to consider the expectation of thecumulative lossesℓt(i) in the sense of distributioni ∼ ρt.Let εt(i) =

∑

e∈i εt(e). However, because of the mixingterms ofρt, we need to introduce a few more notations. Letu = (

∑

e∈1εt(e), ...,

∑

e∈iεt(e), ...,

∑

e∈|C|εt(e)

︸︷︷︸

i∈C

, 0, ..., 0︸︷︷︸

i/∈C

)

be the distribution over all the strategies. Letωt−1 =ρt−u

1−∑

eεt(e)

be the distribution induced by AOSPR-EXP3++at the timet without mixing. Then we have:Ei∼ps

ℓs(i) = (1−∑

e εs(e))Ei∼ωs−1 ℓs(i) + εs(i)Ei∼uℓs(i)= (1−∑

e εs(e))(1ηs

lnEi∼ωs−1 exp(−ηs(ℓs(i)−Ej∼ωs−1 ℓt(j))))− (1−∑

eεs(e))

ηslnEi∼ωs−1 exp(−ηsℓs(i)))

+Ei∼uℓt(i).

(14)

Recall that for all the strategies, we have distributionωt−1 =(ωt−1(1), ..., ωt−1(N)) with

ωt−1(i) =exp(−ηtLt−1(i))

∑Nj=1 exp(−ηtLt−1(j))

, (15)

and for all the links, we have distributionωt−1,e =(ωt−1,e(1), ..., ωt−1,e(n))

ωt−1,e(e′) =

∑

i:e′∈i exp(−ηtLt−1(i))∑N

j=1 exp(−ηtLt−1(j)). (16)

In the second step, we use the inequalitieslnx ≤ x− 1 andexp(−x)− 1+x ≤ x2/2, for all x ≥ 0, and the fact that takeexpectations overj ∼ ωs−1 and overi ∼ ωs−1 are equivalent,to obtain:

lnEi∼ωs−1 exp(−ηs(ℓs(i)− Ej∼ωs−1 ℓs(j)))= lnEi∼ωs−1 exp(−ηsℓs(i)) + ηsEi∼ωs−1 ℓs(i)≤ Ei∼ωs−1(exp(−ηsℓs(i))− 1 + ηsℓs(i))

≤ Ei∼ωs−1

η2s ℓs(i)

2

2 .

(17)

Take expectations over all random strategies of lossesℓs(i)2,we have

Et

[

Ei∼wsℓs(i)2

]

= Et

[N∑

i=1

ωs−1(i)ℓs(i)2

]

= Et

[N∑

i=1

ωs−1(i)(∑

e∈iℓs(e))

2]

≤ Et

[N∑

i=1

ωs−1(i)k∑

e∈iℓs(e)

2

]

=Etk

[n∑

e=1ℓs(e)

2 ∑

i∈P:e∈iωs−1(i)

]

=kEt

[n∑

e′=1

ℓs(e′)2ωs−1,e(e

′)

]

= kEs

[n∑

e′=1

(lt(e

′)ρs(e′)

1s(e′))2

ωs−1,e(e′)

]

≤ kEs

[n∑

e′=1

ωs−1,e(e′)

ρs(e′)2 1s(e

′)

]

= kn∑

e′=1

ωs−1,e(e′)

ρs(e′)

= kn∑

e′=1

ωs−1,e(e′)

(1−∑

eεs(e))ωs−1,e(e′)+

∑

e∈i εs(e)|{i∈C:e∈i}| ≤ 2kn,

(18)

where the last inequality follows the fact that(1−∑

e εt(e)) ≥ 12 by the definition ofεt(e).

In the third step, note thatL0(i) = 0. Let Φt(η) =1η ln 1

N

∑Ni=1 exp(−ηLt(i)) andΦ0(η) = 0. The second term

in (14) can be bounded by using the same technique in [10](page 26-28). Let us substitute inequality (18) into (17), andthen substitute (17) into equation (14) and sum overt and takeexpectation over all random strategies of losses up to timet,we obtain

Et

[t∑

s=1Ei∼ps

ℓs(i)]

≤ knt∑

s=1ηs +

lnNηt

+t∑

s=1Ei∼uℓs(i)

+Et

[t−1∑

s=1Φs(ηs+1)− Φs(ηs)

]

+t∑

s=1EIs∼ps

ℓs(i).

Then, we get

R(t) = Et

t∑

s=1

Ei∼psℓs(i)− Et

t∑

s=1

EIs∼psℓs(i)

≤ kn

t∑

s=1

ηs +lnN

ηt+

t∑

s=1

Ei∼uℓs(i)

(a)

≤ kn

t∑

s=1

ηs +lnN

ηt+ k

t∑

s=1

n∑

e=1

εs(e)

(b)

≤ 2kn

t∑

s=1

ηs +lnN

ηt

(c)

≤ 2kn

t∑

s=1

ηs + klnn

ηt.

Note that, the inequality(a) holds by setting ℓs(i) =k, ∀i, s, and the upper bound isk

∑

i∈C

∑

e∈i εt(e) =

k∑t

s=1

∑ne=1 εs(e). The inequality(b) holds, because of, for

every time slott, ηt ≥ εt(e). The inequality(c) is due to thefact thatN ≤ nk. Settingηt = βt, we prove the theorem.

Proof of Theorem 2.

Proof: To defend against theθ-memory-bounded adaptiveadversary, we need to adopt the idea of the mini-batch protocolproposed in [9]. We define a new algorithm by wrappingAOSPR-EXP3++ with a mini-batching loop [26]. We specifya batch sizeτ and name the new algorithm AOSPR-EXP3++τ .The idea is to group the overall time slots1, ..., t into consec-utive and disjoint mini-batches of sizeτ . It can be viewedthat one signal mini-batch as a round (time slot) and usethe average loss suffered during that mini-batch to feed theoriginal AOSPR-EXP3++. Note that our new algorithm doesnot need to knowm, which only appears as a constant asshown in Theorem 2. So our new AOSPR-EXP3++τ algorithmstill runs in an adaptive way without any prior about the

environment. If we set the batchτ = (4k√n lnn)−

13 t

13 in

Theorem 2 of [9], we can get the regret upper bound in ourTheorem 2.

12

B. The Stochastic Regime

Our proofs are based on the following form of Bernstein’sinequality with minor improvement as shown in [15].

Lemma 1. (Bernstein’s inequality for martingales). LetX1, ..., Xm be martingale difference sequence with respectto filtration F = (Fi)1≤k≤m and let Yk =

∑kj=1 Xj be

the associated martingale. Assume that there exist positivenumbersν andc, such thatXj ≤ c for all j with probability

1 and∑m

k=1 E

[

(Xk)2|Fk−1

]

≤ ν with probability 1.

P[Ym >√2νb+

cb

3] ≤ e−b.

We also need to use the following technical lemma, wherethe proof can be found in [15].

Lemma 2. For anyc > 0, we have∑∞

t=0 e−c

√t = O

(2c2

).

To obtain the tight regret performance for AOSPR-EXP3++,we need to study and estimate the number of times each oflink is selected up to timet, i.e., Nt(e). We summarize it inthe following lemma.

Lemma 3. Let {εt(e)}∞t=1 be non-increasing deterministicsequences, such thatεt(e) ≤ εt(e) with probability 1 andεt(e) ≤ εt(e

∗) for all t and e. Defineνt(e) =∑t

s=11

kεs(e) ,

and define the eventEet

t∆(e)− (Lt(e)− Lt(e∗))

≤√

2(νt(e) + νt(e∗))bt +(1/k + 0.25)bt

3kεt(e∗)

(Eet ).

Then for any positive sequenceb1, b2, ..., and anyt∗ ≥ 2 thenumber of times linke is played by AOSPR-EXP3++ up toroundt is bounded as:

E[Nt(e)] ≤ (t∗ − 1) +t∑

s=t∗e−bs + k

t∑

s=t∗εs(e)1{Ee

t }

+t∑

s=t∗e−ηshs−1(e),

(19)

where

ht(e) = t∆(e)−√

2tbt

(1

kεt(e)+ 1

kεt(e∗)

)

− ( 14+

1k)bt

3εt(e∗) .

Proof: Note that the elements of the martingale differ-ence sequence{∆(e)− (ℓt(e)− ℓt(e

∗))}∞t=1 by max{∆(e)+ℓt(e

∗)} = 1kεt(e

∗)+1. Sinceεt(e∗) ≤ εt(e

∗) ≤ 1/(2n) ≤ 1/4,

we can simplify the upper bound by using 1kεt(e

∗) + 1 ≤( 14+

1k)

εt(e∗) .We further note that

t∑

s=1Es

[

(∆(e)− (ℓs(e)− ℓs(e∗)))

2]

≤t∑

s=1Es

[

(ℓs(e)− ℓs(e∗))

2]

=t∑

s=1

(

Es

[

(ℓs(e)2]

+ Es

[

(ℓs(e∗)2

])

≤t∑

s=1

(1

qs(e)+ 1

qs(e∗)

)

(a)

≤t∑

s=1

(1

kεs(e)+ 1

kεs(e∗)

)

≤t∑

s=1

(1

kεs(e) +

1kε

s(e∗)

)

= νt(e) + νt(e∗)

with probability 1. The above inequality (a) is due to thefact that ρt(e) ≥

∑

e∈i εt(e) |{i ∈ C : e ∈ i}|. Since eache only belongs to one of the covering strategiesi ∈ C,

|{i ∈ C : e ∈ i}| equals to 1 at time slott if link e is selected.Thus, ρt(e) ≥

∑

e∈i εt(e) = kεt(e).Let Ee

t denote the complementary of eventEet . Then by the

Bernstein’s inequalityP[Eet ] ≤ e−bt . The number of times the

link e is selected up to roundt is bounded as:

E[Nt(e)] =t∑

s=1

P[As = e]

=t∑

s=1P[As = e|Ee

s−1]P [Ees−1]

+P[As = e|Ees−1]P [Ee

s−1]

≤t∑

s=1P[As = e|Ee

s−1]1{Ees−1} + P[ES

s−1]

≤t∑

s=1

P[As = e|Ees−1]1{Ee

s−1} + e−bs−1 .

We further upper boundP[As = e|Ees−1]1{Ee

s−1} as follows:

P[As = e|Ees−1]1{Ee

s−1} = ρs(e)1{Ees−1}

≤ (ωs−1(e) + kεs(e))1{Ees−1}

= (kεs(e) +∑

i:e∈i ws−1(i)Ws−1

)1{Ees−1}

= (kεs(e) +∑

i:e∈i e−ηsLs−1(i)

∑

Ni=1 e−ηsLs−1(i) )1{Ee

s−1}(a)

≤ (kεs(e) + e−ηs(Ls−1(i)−Ls−1(i∗)))1{Ees−1}

(b)

≤(kεs(e) + e−ηt(Ls−1(e)−Ls−1(e∗)))1{Ee

s−1}(c)

≤ kεs(e)1{Ees−1} + e−ηshs−1(e).

The above inequality (a) is due to the fact that linke onlybelongs to one chosen pathi in t−1, inequality (b) is becausethe cumulative regret of each path is great than the cumulativeregret of each link that belongs to the path, and the lastinequality (c) we used the fact thattεt(e)

is a non-increasing

sequenceυt(e) ≤ tkεt(e)

. Substitution of this result back intothe computation ofE[Nt(e)] completes the proof.

Proof of Theorem 3.Proof: The proof is based on Lemma 3. Letbt =

ln(t∆(e)2) and εt(e) = εt(e). For any c ≥ 18 and anyt ≥ t∗, where t∗ is the minimal integer for whicht∗ ≥4c2n ln (t∗∆(e)2)

2

∆(e)4 ln(n), we have

ht(e) = t∆(e)−√

2tbt

(1

kεt(e)+ 1

kεt(e∗)

)

− ( 14+

1k )bt

3εt(e∗)

≥ t∆(e)− 2√

tbtkεt(e)

− ( 14+

1k )bt

3εt(e)

= t∆(e)(1− 2√kc

− ( 14+

1k )

3c )(a)

≥ t∆(e)(1− 2√c− 1.25

3c ) ≥ 12 t∆(e).

The above inequality (a) is due to the fact that(1 − 2√kc

−( 1

4+1k )

3c ) is an increasing function with respect tok(k ≥ 1).Plus, as indicated in work [16], by a bit more sophisticatedboundingc can be made almost as small as 2 in our case. Bysubstitution of the lower bound onht(e) into Lemma 3, wehave

E[Nt(e)] ≤ t∗ + ln(t)

∆(e)2+ k c ln (t)2

∆(e)2+

t∑

s=1

(

e−∆(e)

4

√

(s−1)ln(n)n

)

≤ k c ln (t)2

∆(e)2+ ln(t)

∆(e)2+O( n

∆(e)2) + t∗,

where lemma 3 is used to bound the sum of the exponents. Inaddition, please note thatt∗ is of the orderO( kn

∆(e)4 ln(n)).

13

Proof of Theorem 4.Proof: The proof is based on the similar idea of Theorem

2 and Lemma 3. Note that by our definition∆t(e) ≤ 1 andthe sequenceεt(e) = εt = min{ 1

2n , βt,c ln (t)2

t } satisfies the

condition of Lemma 3. Note that whenβt ≥ c ln (t)2

t }, i.e., for

t large enough such thatt ≥ 4c2 ln (t)4nln(n) , we haveεt =

c ln (t)2

t .Let bt = ln(t) and lett∗ be large enough, so that for allt ≥ t∗

we havet ≥ 4c2 ln (t)4nln(n) andt ≥ e

1∆(e)2 . With these parameters

and conditions on hand, we are going to bound the rest ofthe three terms in the bound onE[Nt(e)] in Lemma 3. Theupper bound of

∑ts=t∗ e

−bs is easy to obtain. For boundingk∑t

s=t∗ εs(e)1{Ees−1}, we note thatEe

t holds and forc ≥ 18we have

∆t(e) ≥ 1t (Lt(e)−max

e′(Lt(e

′))) ≥ 1t (Lt(e)− Lt(e

∗))

≥ 1tht(e) =

1t

(

t∆(e)− 2√

tbtkεt

− ( 14+

1k)bt

3εt

)

= 1t

(

t∆(e)− 2t√ck ln(t)

− ( 14+

1k)t

3c ln(t)

)

(a)

≥ 1t

(

t∆(e)− 2t√c ln(t)

− 1.25t3c ln(t)

)

(b)

≥ ∆(e)(

1− 2√c− 1.25

3c

)

≥ 12∆(e),

where the inequality (a) is due to the fact that1t (t∆(e) −

2t√ck ln(t)

− ( 14+

1k)t

3c ln(t) ) is an increasing function with respect to

k(k ≥ 1) and the inequality (b) due to the fact that fort ≥ t∗

we have√

ln(t) ≥ 1/∆(e). Thus,

εn(e)1{Een−1} ≤ c(ln t)

2

t∆t(e)2 ≤ 4c2(ln t)

2

t∆(e)2

and k∑t

s=t∗ εs(e)1{Een−1} = O

(k ln (t)3

∆(e)2

)

. Finally, for the

last term in Lemma 3, we have already getht(e) ≥ 12∆(e)

for t ≥ t∗ as an intermediate step in the calculation of boundon ∆t(e). Therefore, the last term is bounded in a order ofO( n

∆(e)2). Use all these results together we obtain the results

of the theorem. Note that the results holds for anyηt ≥ βt.

C. Mixed Adversarial and Stochastic Regime

Proof of Theorem 5.Proof: The proof of the regret performance in the mixed

adversarial and stochastic regime is simply a combinationof the performance of the AOSPR-EXP3++AVG algorithm inadversarial and stochastic regimes. It is very straightforwardfrom Theorem 1 and Theorem 3.

Proof of Theorem 6.Proof: Similar as above, the proof is very straightforward

from Theorem 2 and Theorem 3.

D. Contaminated Stochastic Regime

Proof of Theorem 7.Proof: The key idea of proving the regret bound under

moderately contaminated stochastic regime relies on how toestimate the performance loss by taking into account thecontaminated pairs. Let1⋆

t,e denote the indicator functions ofthe occurrence of contamination at location(t, e), i.e., 1⋆

t,e

takes value1 if contamination occurs and0 otherwise. Let

mt(e) = 1⋆t,eℓt(e) + (1− 1

⋆t,e)µ(e). If either base arme was

contaminated on roundt thenmt(e) is adversarially assigneda value of loss that is arbitrarily affected by some adversary,otherwise we use the expected loss. LetZt(e) =

∑ts=1 mt(e)

then (Zt(e)− Zt(e∗)) −

(

Lt(e)− Lt(e∗))

is a martingale.After τ steps, fort ≥ τ ,(Zt(e)− Zt(e

∗)) ≥ tmin{1⋆t,e,1

⋆t,e∗}(ℓt(e)− ℓt(e

∗))+tmin{1− 1

⋆t,e, 1− 1

⋆t,e∗}(µ(e)− µ(e∗))

≥ −ζt∆(e) + (t− ζt∆(e))∆(e) ≥ (1− 2ζ)t∆(e).Define the eventZe

t :

(1− 2ζ)t∆(e)−(

Lt(e)− Lt(e∗))

≤ 2√

νtbt +

(14 + 1

k

)bt

3εt,

where εt is defined in the proof of Theorem 3 andνt =∑t

s=11

kεt. Then by Bernstein’s inequalityP[Ze

t ] ≤ e−bt . Theremanning proof is identical to the proof of Theorem 3.

For the regret performance in the moderately contaminatedstochastic regime, according to our definition with the attack-ing strengthζ ∈ [0, 1/4], we only need to replace∆(e) by∆(e)/2 in Theorem 5.

IX. PROOF OFREGRET FORACCELERATEDAOSPRALGORITHM

We prove the theorems of the performance results in SectionIV in the order they were presented.

A. Accelerated Learning in Adversarial Regime

The proof the Theorem 8 requires the following Lemmafrom Lemma 7 [19]. We restate it for completeness.

Lemma 4. For any probability distributionω on {1, ..., n}and anym ∈ [1, n]:

n∑

e=1

ω(e)(n− 1)

ω(e)(n−m) +m− 1≤ n

m.

Proof of Theorem 8.Proof: Note first that the following equalities can

be easily verified: Ei∼tℓt(i) = ℓt(I t),Eℓt∼t

ℓt(i) =

ℓt(i),Ei∼tℓt(i)2 = ℓt(It)2

t(It)andEIt∼t

1t(It)

= N .Then, we can immediately rewriteR(t) and have

R(t) = Et

[t∑

s=1

Ei∼ρsℓs(i)−

t∑

s=1

EIs∼ρsℓs(i)

]

.

The key step here is to consider the expectation of thecumulative lossesℓt(i) in the sense of distributioni ∼ t. Letεt(i) =

∑

e∈i εt(e). However, because of the mixing termsof t, we need to introduce a few more notations. Letϕs =(∑

e∈1εt(e), ...,

∑

e∈iεt(e), ...,

∑

e∈|C|εt(e)

︸︷︷︸

i∈C

, 0, ..., 0︸︷︷︸

i/∈C

) be

the distribution over all the strategies. Letωt−1 = ρt−u1−∑

eεt(e)

be the distribution induced by AOSPR-EXP3++ at the timetwithout mixing. Then we have:Ei∼ρs

ℓs(i) = (1−∑

e εs(e))Ei∼ωs−1 ℓs(i) + εs(i)Ei∼uℓs(i)= (1 −∑

e εs(e))(1ηs

lnEi∼ωs−1 exp(−ηs(ℓs(i)−Ej∼ωs−1 ℓt(j))))− (1−

∑

e εs(e))

ηslnEi∼ωs−1 exp(−ηsℓs(i)))

+εs(i)Ei∼uℓs(i).

(20)

14

In the second step, by similar arguments as in the proof ofTheorem 1, we have:

lnEi∼ωs−1 exp(−ηs(ℓs(i)− Ej∼ωs−1 ℓs(j)))= lnEi∼ωs−1 exp(−ηsℓs(i)) + ηsEj∼ωs−1 ℓs(j)≤ Ei∼ωs−1(exp(−ηsℓs(i))− 1 + ηsℓs(i))

≤ Ei∼ωs−1

η2s ℓs(i)

2

2 .

(21)

Take expectations over all random strategies of lossesℓs(i)2,we have

Et

[

Ei∼ωs−1 ℓs(i)2]

= Et

[N∑

i=1

ωs−1(i)ℓs(i)2

]

= Et

[N∑

i=1

ωs−1(i)(∑

e∈iℓs(e))

2]

≤ Et

[N∑

i=1

ωs−1(i)k∑

e∈iℓs(e)

2

]

=Etk

[n∑

e=1ℓs(e)

2 ∑

i∈P:e∈iωs−1(i)

]

=kEt

[n∑

e′=1

ℓs(e′)2ωs−1,e(e

′)

]

= kEt

[n∑

e=1

(ls(e)˜s(e)

1s(e))2

ωs−1(e)

]

≤ kEt

[n∑

e=1

ωs−1(e)

˜s(e)2 1s(e)

]

= kn∑

e=1

ωs−1(e)˜s(e)

= kn∑

e=1

ωs−1(e)

ρs(e)+(1−ρs(e))ms−1n−1

(a)

≤ kn∑

e=1

2ρs(e)

ρt(e)+(1−ρs(e))ms−1n−1

(b)

≤ 2k nm ,

(22)

where the above inequality(a) follows the fact that(1−∑

e εt(e)) ≥ 12 by the definition ofεt(e) and the equality

(4) and the above inequality(b) follows the Lemma 4. Notethat ϕs−1(e) =

∑

e∈i εt(e) |{i ∈ C : e ∈ i}| , ∀e ∈ [1, n] Takeexpectations over all random strategies of lossesℓs(i) withrespective to distributionu, we have

Et

[

Ei∼ϕsℓs(i)

]

= Et

[N∑

i=1

ϕs(i)ℓs(i)]

= Et

[N∑

i=1

ϕs(i)(∑

e∈iℓs(e))

]

≤ Et

[N∑

i=1

ϕs(i)(∑

e∈iℓs(e))

]

=Et

[n∑

e=1ℓs(e)

∑

i∈P:e∈iϕs(i)

]

=Et

[n∑

e′=1

ℓs(e′)ϕs(e

′)

]

≤ kEt

[n∑

e′=1

ϕs(e′)

ρs(e′)1s(e

′)

]

= kn∑

e′=1

ϕs(e′)

ρs(e′)

= kn∑

e′=1

ϕs(e′)


(a)

≤ kn∑

e′=1

ρs(e)


≤ 2k nm ,

(23)

where the above inequality(a) is becauseρt(e) ≥ ϕs−1(e).

In the third step, note thatL0(i) = 0. Let Φt(η) =1η ln 1

N

∑Ni=1 exp(−ηLt(i)) andΦ0(η) = 0. The second term

in (20) can be bounded by using the same technique in [10](page 26-28). Let us substitute inequality (22) into (21), andthen substitute (21) into equation (20) and sum overt and takeexpectation over all random strategies of losses up to timet,we obtain

Et

[t∑

s=1Ei∼ρs

ℓs(i)]

≤ knt∑

s=1ηs +

lnNηt

+t∑

s=1Ei∼ϕs

ℓs(i)

+Et

[t−1∑

s=1Φs(ηs+1)− Φs(ηs)

]

+ Et

t∑

s=1ℓs(i).

Then, we get

R(t) = Et

t∑

s=1

Ei∼psℓs(i)− Et

t∑

s=1

EIs∼ϕsℓs(i)

≤ kn

m

t∑

s=1

ηs +lnN

ηt+

t∑

s=1

Ei∼ϕsℓs(i)

(a)

≤ kn

m

t∑

s=1

ηs +lnN

ηt+ k

n

m

t∑

s=1

n∑

e=1

εs(e)

(b)

≤ 2kn

m

t∑

s=1

ηs +lnN

ηt

(c)

≤ 2kn

m

t∑

s=1

ηs + klnn

ηt. (24)

Note that, the inequality(a) holds according to (23). Theinequality (b) holds is because of, for every time slott,ηt ≥ εt(e). The inequality(c) is due to the fact thatN ≤ nk.Settingηt = bt, we prove the theorem.

Proof of Theorem 9.

Proof: The proof of Theorem 9 for adaptive adversary isbased on Theorem 8, and use the same idea as in the proof of

Theorem 2. Here, If we set the batchτ = (4k√

nm lnn)−

13 t

13

in Theorem 2 of [9], we can get the regret upper bound in ourTheorem 9.

B. Accelerated AOSPR Algorithm in The Stochastic Regime

To obtain the tight regret performance for AOSPR-MP-EXP3++, we need to study and estimate the number of timeseach of link is selected up to timet, i.e.,Nt(e). We summarizeit in the following lemma.

Lemma 5. In the multipath probing case, let{εt(e)}∞t=1

be non-increasing deterministic sequences, such thatεt(e) ≤εt(e) with probability 1 and εt(e) ≤ εt(e

∗) for all t and e.Defineνt(e) =

∑ts=1

1kεs(e)

, and define the eventΞet

mt∆(e)− (Lt(e∗)− Lt(e))

≤√

2(νt(e) + νt(e∗))bt +(1/k + 0.25)bt

3kεt(e∗)

(Ξet ).

Then for any positive sequenceb1, b2, ..., and anyt∗ ≥ 2 thenumber of times linke is played by AOSPR-EXP3++ up toroundt is bounded as:

E[Nt(e)] ≤ (t∗ − 1) +t∑

s=t∗e−bs + k

t∑

s=t∗εs(e)1{Ξe

t}

+t∑

s=t∗e−ηsℏs−1(e),

where

ℏt(e) = mt∆(e)−√

2mtbt

(1

kεt(e)+ 1

kεt(e∗)

)

− ( 14+

1k)bt

3εt(e∗) .

Proof: Note that AOSPR-MP-EXP3++ probesMt pathsrather than1 path each time slott. Let # {·} stands for thenumber of elements in the set{·}. Hence,E[Nt(e)] = E[# {1 ≤ s ≤ t : As = e, Ee

t }+#{1 ≤ s ≤ t : As = e, Ee

t

}],

whereAs denotes the action of link selection at time slots.

15

By the following simple trick, we haveE[Nt(e)] = E[# {1 ≤ s ≤ t : As = e, Ee

t }]+E[#

{1 ≤ s ≤ t : As = e, Ee

t

}]]

≤ E[t∑

s=11{1≤s≤t:As=e}P[#{Ee

t }]]+

E[t∑

s=11{1≤s≤t:As=e}P[#{Ee

t }]]

(25)

≤ E[t∑

s=11{1≤s≤t:As=e}P[Ξ

emt]]+

E[t∑

s=11{1≤s≤t:As=e}P[Ξ

emt]].

(26)

Note that the elements of the martingale difference se-quence in the{∆(e)− (ℓt(e)− ℓt(e

∗))}∞t=1 by max{∆(e) +ℓt(e

∗)} = 1kε

t(e∗)+1. Sinceεt(e

∗) ≤ εt(e∗) ≤ 1/(2n) ≤ 1/4,

we can simplify the upper bound by using 1kε

t(e∗) + 1 ≤

( 14+

1k)

εt(e∗) .

We further note that

Es

{

#

{t∑

s=1

[

(∆(e)− (ℓs(e)− ℓs(e∗)))

2]}}

(a)

≤ Es

{

mt∑

s=1

[

(∆(e)− (ℓs(e)− ℓs(e∗)))

2]}

≤ mt∑

s=1Es

[

(ℓs(e)− ℓs(e∗))

2]

= mt∑

s=1

(

Es

[

(ℓs(e)2]

+ Es

[

(ℓs(e∗)2

])

≤ mt∑

s=1

(1

˜s(e)+ 1

˜s(e∗)

)

(b)

≤ mt∑

s=1

(1

kεs(e)+ 1

kεs(e∗)

)

≤ mt∑

s=1

(1

kεs(e)+ 1

kεs(e∗)

)

= mνt(e) +mνt(e∗)

with probability 1. The above inequality(a) is because thenumber of probes for each linke at time slots is at mostm times, so does the accumulated value of the variance(∆(e)− (ℓs(e)− ℓs(e

∗)))2. The above inequality (b) is due

to the fact that˜t(e) ≥ ρt(e) ≥ ∑

e∈i εt(e) |{i ∈ C : e ∈ i}|.Since eache only belongs to one of the covering strategiesi ∈ C, |{i ∈ C : e ∈ i}| equals to 1 at time slott if link e isselected. Thus,ρt(e) ≥

∑

e∈i εt(e) = kεt(e).

Let Eet denote the complementary of eventEe

t . Then bythe Bernstein’s inequalityP[Ee

t ] ≤ e−bt . According to (26),the number of times the linke is selected up to roundt isbounded as:

E[Nt(e)] ≤t∑

s=1P[As = e|Ξe

s−1]P [Ξes−1]

+P[As = e|Ξes−1]P [Ξe

s−1]

≤t∑

s=1P[As = e|Ξe

s−1]1{Ξes−1} + P[ΞS

s−1]

≤t∑

s=1P[As = e|Ξe

s−1]1{Ξes−1} + e−bs−1 .

We further upper boundP[As = e|Ξes−1]1{Ξe

s−1} as follows:

P[As = e|Ξes−1]1{Ξe

s−1} = ρs(e)1{Ξes−1}

≤ (ωs−1(e) + kεs(e))1{Ξes−1}

= (kεs(e) +∑

i:e∈i ws−1(i)Ws−1

)1{Ξes−1}

= (kεs(e) +∑

i:e∈i e−ηsLs−1(i)

∑

Ni=1 e−ηsLs−1(i) )1{Ξe

s−1}(a)

≤ (kεs(e) + e−ηs(Ls−1(i)−Ls−1(i∗)))1{Ξes−1}

(b)

≤(kεs(e) + e−ηs(Ls−1(e)−Ls−1(e∗)))1{Ξe

s−1}(c)

≤ kεs(e)1{Ξes−1} + e−ηsℏs−1(e).

The above inequality (a) is due to the fact that linke onlybelongs to one chosen pathi in t−1, inequality (b) is becausethe cumulative regret of each path is great than the cumulativeregret of each link that belongs to the path, and the lastinequality (c) we used the fact thattεt(e)

is a non-increasingsequenceυt(e) ≤ t

kεt(e). Substitution of this result back into

the computation ofE[Nt(e)] completes the proof.Proof of Theorem 10.

Proof: The proof is based on Lemma 3. Letbt =ln(t∆(e)2) and εt(e) = εt(e). For any c ≥ 18 and anyt ≥ t∗, where t∗ is the minimal integer for whicht∗ ≥4c2n ln (t∗∆(e)2)

2

m2∆(e)4 ln(n), we have

ℏt(e) = mt∆(e)−√

2mtbt

(1

kεt(e)+ 1

kεt(e∗)

)

− ( 14+

1k )bt

3εt(e∗)

≥ mt∆(e)− 2√

mtbtkεt(e)

− ( 14+

1k )bt

3εt(e)

= mt∆(e)(1 − 2√kc

− ( 14+

1k )

3c )

(a)

≥ mt∆(e)(1− 2√c− 1.25

3c ) ≥ 12mt∆(e),

whereεt(e) =c ln(t∆(e)2)

tm∆(e)2. By substitution of the lower bound

on ht(e) into Lemma 3, we have

E[Nt(e)] ≤ t∗+ ln(t)

∆(e)2+k c ln (t)2

m∆(e)2+

t∑

s=1(e−

m∆(e)4

√

(s−1)ln(n)n )

≤ k c ln (t)2

m∆(e)2+ ln(t)

∆(e)2+O( n

m2∆(e)2) + t∗,

(27)

where lemma 3 is used to bound the sum of the exponents. Inaddition, please note thatt∗ is of the orderO( kn

m2∆(e)4 ln(n)).

Proof of Theorem 11-Theorem 13. The proofs of Theorem11-Theorem 13 use similar idea as in the proof of Theorem14. We omitted here for brevity.

Proof of Theorem 14.Proof: For the AOSPR-CP-EXP3++ algorithm, multiple

source-destination pairs are coordinated to avoid probingtheoverlapping path as little as possible, where now the statisti-cally collected link-level probing ratem′

t is no less than themt at each time slot. Thus, the actual link probability˜t(e)is no less than the one in (7). Following the same line ofanalysis, the regret upper bounds in Theorem 8-13 hold forthe AOSPR-CP-EXP3++ algorithm.

Proof of Theorem 15.Proof: The proof of Theorem 15 also relies on the

Theorem 8-13. Moreover, it requires the construction of alinear program. LetCes(e = 1, ..., n, s = 1, ..., S) be the

16

indicator that link i is covered by the paths of the source-destination pairs, Es

∆= {e ∈ E : Ces = 1} be the subset of

links constructing paths andks∆=

∑ne=1 Ces the size of this

subset. Consider a source-destination pairs. The key pointis to bound the minimum link sample sizemine∈E′

sze(t) for

general set ofCes. It is obvious that∑

e zes(t) ≥ t for alls = 1, ..., S. In the worst case, we have

∑

e zes(t) = t. In theStep 4 in the Algorithm 2, it iteratively solves the followinginteger linear programm (LP).

max κ

s.t.S∑

s=1zes(t)Ces ≥ κ, e = 1, ..., n,

N∑

i=1

zes(t) ≤ t, s = 1, ..., S,

zes(t) ∈ N, ∀i, s.

(28)

The aim of this LP is that distributing thet probing ofeach source-destination pairs to evenly cover the links tomaximize the minimum link sample size

∑Ss=1 zes(t)Ces.

Particulary, we consider the minimum link sample size forsource-destination pairs′, i.e.,mine∈E′

s

∑Ss=1 zes(t)Ces. De-

note the maximum value of the LP (28) byκ∗. Note thatzes(t) = ⌊t/ks⌋Ces is a feasible solution to (28). Thus,mine∈Es

ze(t) ≥ κ∗ ≥ mine∈Es

∑Ss=1 ⌊t/ks⌋Ces

∆=κ (t).

Actually, normalizedκ (t) to κ =∑t

t=1 κ (t) /t, whichis the average probing rate up to time slott. The AOSPR-CP-EXP3++ Algorithm needs to use in the link probabilitycalculation of (7). Under the complete overlap of paths overthe entire network, i.e.,Ces ≡ 1 and ks ≡ n, we haveκ = S = m. Following the same line of analysis, theregret upper bounds in Theorem 8-Theorem 13 hold for theAOSPR-CP-EXP3++ algorithm in the multi-source acceleratedlearning case by replacingm with S. In the absence of anyoverlap, i.e.,

∑Ss=1 Ces ≡ 1, we have the probing rateκ = 1.

This correspond to the single source-destination case, andthenow Theorem 1-Theorem 7 hold for the AOSPR-CP-EXP3++algorithm.

Proof of Theorem 17.

Proof: To analysis the deviation of regretR(t) to m∆ andn∆ in the adversarial regime, we need to focus on the fol-lowing function f (m,n, ˜t(e)) =

∑ne=1

˜t(e)

˜t(e)+(1− ˜t(e))m−1n−1

subject to∑n

e=1 ˜t(e) = 1. The corresponding Lagrangian is:

L (m,n, ˜t(e)) =∑n

e=1˜t(e)

ρt(e)+(1− ˜t(e))m−1n−1

+λ (1−∑ne=1 ˜t(e)) .

As shown in [19],˜t(e) = 1n , ∀e ∈ [1, n] is the only maximizer

of f (m,n, ˜t(e)).

At first, take the first derivative of the Lagrangian withrespect tom we get∂L(m,n, ˜t(e))

∂m

∣∣∣˜t(e)=

1n

=n∑

e=1

− ˜t(e)(1− ˜t(e))1

n−1

( ˜t(e)+(1− ˜t(e))m−1n−1 )

2 = − n2

m2 .

We can make the first order approximation,i.e., f (m+m∆, n, ˜t(e)) ≃ f (m,n, ˜t(e)) +

m∆∂L(m,n, ˜t(e))

∂m

∣∣∣˜t(e)=

1n

= f (m,n, ˜t(e)) − n2

m2m∆.

Then, according to (22) and (23). We have the deviated

version of regret in (24) as

R(t) ≤ 2k(

nm − n2m∆

m2

) t∑

s=1ηs + k lnn

ηt

≤ 4k

√

t(

nm − n2m∆

m2

)

lnn.

Make the first order approximation of the upper bound ofR(t), i.e.,R(t), aroundn

m we get theRm∆(t) is 12m∆

nm R(t).

Use similar approach we get the result for adaptive jammer is13m∆

nm R(t). Combine the two, we prove the part (a) of the

Theorem 17.

To prove the part (b) of the theorem, let us view the proof ofthe upper bound ofR(t) in the stochastic regime in (27). Takea first order approximation ofm∆ on the leading term1

m (asa function), we easily get theRm∆(t) =

12m∆

m R(t). Similarly,the results hold in the contaminated stochastic regimes.

The result (c) in the mixed adversarial and stochastic regimestraightforward, which is just a combination of the resultsinadversarial and stochastic regimes.

Secondly, take the first derivative of the Lagrangian withrespect ton we get∂L(m,n, ˜t(e))

∂n

∣∣∣˜t(e)=

1n

=n∑

e=1

˜t(e)(1− ˜t(e))m−1

(n−1)2

( ˜t(e)+(1− ˜t(e))m−1n−1 )

2 = n2

m2m−1n−1 .

Let us take the first order approximation,i.e., f (m,n+ n∆, ˜t(e)) ≃ f (m,n, ˜t(e)) +

n∆∂L(m,n, ˜t(e))

∂n

∣∣∣˜t(e)=

1n

= f (m,n, ˜t(e)) + n∆n2

m2m−1n−1 .

Then, according to (22) and (23). We have the deviatedversion of regret in (24) as

R(t) ≤ 2k(

nm + n∆

n2

m2m−1n−1

) t∑

s=1ηs + k lnn

ηt

≤ 4k

√

t(

nm + n∆

n2

m2m−1n−1

)

lnn.

Make the first order approximation of the upper bound ofR(t),i.e., R(t), aroundn

m we get theRm∆(t) is 12n∆

nm

m−1n−1 R(t) ≃

12n∆R(t). Use similar approach we get the result for adaptivejammer is 1

3n∆R(t). Combine the two, we prove the part (d)of the Theorem 17.

To prove the part (e) of the theorem, let us view the proof ofthe upper bound ofR(t) in the stochastic regime in (27). Takea first order approximation ofn∆ on the leading regret term.Since there is no estimated value ofn in (7), we easily get theRm∆(t) = 0. Similarly, the results hold in the contaminatedstochastic regimes.

The result (f) in the mixed adversarial and stochastic regimestraightforward, which is just a combination of the resultsinadversarial and stochastic regimes.

Proof of Theorem 18.

Proof: The delayed regret upper bounds results of The-orem 18 comes from the general results for adversarial andstochastic MABs in the respective Theorem 1 and Theorem 6in [22]. The regret upper bound under delayed feedback in theadversarial regime is proved by a simple Black-Box transfor-mation in a non-delayed oblivious MAB environment, whichis a general result. For the stochastic regimes (contaminatedregimes, etc.), we need to study the following high probability

17

bounds (19)

E[Nt(e)] ≤ (t∗ − 1) +t∑

τ=t∗e−bτ + k

t∑

τ=t∗ετ (e)1{Ee

t }

+t∑

τ=t∗e−ητhτ−1(e),

again. In the delayed-feedback setting, if we use upper confi-dence boundsEe

s(t) instead ofEet , wheres(t) was defined to be

the number of rewards of linke observed up to and includingtime instantt. In the same way as above we can write

E[Nt(e)] ≤ (t∗ − 1) +t∑

τ=t∗e−bτ + k

t∑

τ=t∗ετ (e)1{Ee

s(t)}

+t∑

τ=t∗e−ητhτ−1(e).

(29)

SinceNt−1(e) = τ∗ + St−1(e), we get

E[Nt(e)] ≤ τ∗ + (t∗ − 1) +t∑

τ=t∗e−bτ +k

t∑

τ=t∗ετ (e)1{Ee

s(t)}

+t∑

τ=t∗e−ητhτ−1(e).

(30)

Now the same concentration inequalities used to bound (29) inthe analysis of the non-delayed setting can be used to upperbound the expected value of the sum in (30). By the sametechnique, the result holds for other stochastic regimes.

X. CONCLUSION AND FUTURE WORKS

In this paper, we propose the first adaptive online SPRalgorithm, which can automatically detect the feature of theenvironment and achieve almost optimal learning performancein all different regimes. We have conducted extensive ex-periments to verify the flexibility of our algorithm and haveseen performance improvements over classic approaches. Wealso considered many practical implementation issues to makeour algorithm more useful and computationally efficient inpractice. Our algorithm can be especially useful for sensor,ad hoc and military networks in dynamic environments. In thenear future, we plan to extend our model to mobile networksand networks with node failure and inaccessibility to gainmore insight into the learnability of the online SPR algorithm.

REFERENCES

[1] C. Zou, D. Towsley, W. Gong, and S. Cai, “Routing worm: A fast,selective attack worm based on ip address information,” In Proc. ofthe IEEE 19th Workshop on Principles of Advanced and DistributedSimulation, pp. 199-206, 2005.

[2] T. He, D. Goeckel, R. Raghavendra, and D. Towsley, “endndhost-based shortest path routing in dynamic networks: An online learningapproach,” In Proc. of 39st IEEE International Conference on ComputerCommunications (INFOCOM), pp. 2202-2210, April, 2013.

[3] A. Bhorkar, M. Naghshvar, T. Javidi, and B. Rao, “Adaptive OpportunisticRouting for Wireless Ad Hoc Networks,” IEEE/ACM Transactions onNetworking (TON), 20, no. 1, pp. 243-256, 2012.

[4] Yi Gai, Bhaskar Krishnamachari and Rahul Jain, “Combinatorial Net-work Optimization with Unknown Variables: Multi-Armed Bandits withLinear Rewards and Individual Observations,” IEEE/ACM Transactionson Networking (TON), vol. 20, no. 5, pp. 1466-1478, 2012.

[5] A.A. Bhorkar and T. Javidi, “No Regret Routing for ad-hocwirelessnetworks,” In Proc. of Asilomar Conference on Signals, Systems, andComputers (CISS), pp. 68-75, Nov., 2010.

[6] B. Awerbuch and R. D. Kleinberg, “Adaptive routing with end-to-endfeedback: distributed learning and geometric approaches,” In Proc. ofthe 36th Annual ACM Symposium on the Theory of Computing (STOC2004), pp. 45-53, 2004.

[7] B. Awerbuch, D. Holmer, H. Rubens, and R. Kleinberg, “Provably Com-petitive Adaptive Routing. In Proc. of 31st IEEE International Conferenceon Computer Communications (INFOCOM 2005), pp.1345-1256,2005.

[8] A. Gyorgy, T. Linder, G. Lugosi, and G. Ottucsak, “The on-line shortestpath problem under partial monitoring,” Journal of MachineLearningResearch, vol. 8, pp. 2369-2403, 2007.

[9] R. Arora, D. Ofer, and T. Ambuj, “Online bandit learning against anadaptive adversary: from regret to policy regret,” In Proc.of InternationalConference on Machine Learning (ICML 2011), pp. 366-377, 2011.

[10] S. Bubeck and N. Cesa-Bianchi, “Regret Analysis of Stochastic andNonstochastic Multi-armed Bandit Problems,” vol. 5, Foundation andTrends in Machine Learning, 2012.

[11] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire,“The non-stochastic multiarmed bandit problem,” SIAM Journal on Computing,vol.32, no.1, pp.48-77, 2002.

[12] T. L. Lai, and H. Robbins, “Asymptotically efficient adaptive allocationrules,” Advances in Applied Mathematics, vol.6, 1985.

[13] N. Cesa-Bianchi, G. Lugosi, “Combinatorial bandits,”Journal of Com-puter and System Sciences, vol.78, no.5, pp. 1404-1422, 2012.

[14] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire,“Gambling ina rigged casino: The adversarial multi-arm bandit problem,” in Proc. ofIEEE FOCS’95, pp. 322-331, 1995.

[15] Y. Seldin, and A. Slivkins, “One practical algorithm for both stochasticand adversarial bandits,” In Proc. of The 31st International Conferenceon Machine Learning (ICML 2014), pp. 286-294, 2014.

[16] B. Kveton, Z. Wen, A. Ashkan, C. Szepesvari, “Tight Regret Bounds forStochastic Combinatorial Semi-Bandits,” 18th International Conferenceon Artificial Intelligence and Statistics (AISTATS 2015), pp. 1-9, 2015.

[17] A. G Barto. Reinforcement learning: An introduction. MIT press, 1998.[18] P. Zhou, L. Chen, D. P. Wu, “Shortest Path Routing in Unknown

Environments: Is the Adaptive Optimal Strategy Available?”, IEEE In-ternational Conference on Sensing, Communications and Networking(SECON 2016), pp. 1-9, London, UK, 2016

[19] Y. Seldin, P. Bartlett, K. Crammer, and Y. Abbasi-Yadkori, “ Predictionwith Limited Advice and Multiarmed Bandits with Paid Observations,” InProc. of The 31st International Conference on Machine Learning (ICML2014), pp. 280-287, 2014.

[20] A. Jean-Yves, B. Sabastien, L. Gabor, “Regret in Online CombinatorialOptimization,” Math. Oper. Res. vol.39, no.1, pp. 31-45, 2014.

[21] Y. Zhou, Q. Huang, F. Li, X.Y.Li, M. Liu, Z. Li and Z .Yin, “AlmostOptimal Channel Access in Multi-Hop Networks With Unknown ChannelVariables,” in Proc. of IEEE 34th International Conferenceon DistributedComputing Systems (ICDCS 2014), pp. 234-245, 2014.

[22] P. Joulani, A. Gyorgy, and C. Szepesvari,“Online Learning underDelayed Feedback,” In Proc. of The 30st International Conference onMachine Learning (ICML 2013), pp. 1453-1461, 2013.

[23] K. Liu, and Q. Zhao, “Online learning for stochastic linear optimiza-tion problems,” In proc. of IEEE Information Theory and ApplicationsWorkshorp (ITA 2012), pp. 363-367, 2012.

[24] K. Liu, and Q. Zhao, “Adaptive shortest-path routing under unknown andstochastically varying link states,” In 10th IEEE International Symposiumon Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks(WiOpt 2012), pp. 232-237, 2012.

[25] V. Dani, T. P. Hayes, and S. M. Kakade, “Stochastic Linear Optimizationunder Bandit Feedback,” In Proc. of Conference on Learning Theorey(COLT 2008), pp. 355-366. 2008.

[26] O. Dekel, G. B. Ran, S. Ohad, and X. Lin, “Optimal distributedonline prediction using mini-batches,” In Proc. of The 29stInternationalConference on Machine Learning (ICML 2012), pp. 58-70, 2012.

18

0 2 4 6x 10

6

0

0.5

1

1.5

2

2.5

3x 104

Cum

ulat

ive

Reg

ret

time

Adversary Regime with m∆ and n∆







time0 1 2 3 4 5 6 7 8

Cum

ulat

ive

Reg

ret

×104

0

0.5

1

1.5

2

2.5Centralized and Distributed Implementations

AOSPR-EXP3++AOSPR-EXP3++ Centr. (m= 6)AOSPR-EXP3++ Centr. (m= 16)AOSPR-EXP3++ Distr. (κ = 1)AOSPR-EXP3++ Distr. (κ=m=6)AOSPR-EXP3++ Distr. (κ = m=16)

Near Optimal Adaptive Shortest Path Routing with Stochastic … · 2016-10-12 · Near Optimal Adaptive Shortest Path Routing with Stochastic Links States under Adversarial Attack

Documents