Page 1
Scalable Lifelong Reinforcement Learning
Yusen Zhan
The School of Electrical Engineering and Computer Science, Washington State University,
Pullman, WA 99163, USA
Haitham Bou Ammar
Prowler i.o., Cambridge, United Kingdomn
Matthew E. Taylor∗
The School of Electrical Engineering and Computer Science, Washington State University,Pullman, WA 99163, USA
Abstract
Lifelong reinforcement learning provides a successful framework for agents to
learn multiple consecutive tasks sequentially. Current methods, however, suffer
from scalability issues when the agent has to solve a large number of tasks. In
this paper, we remedy the above drawbacks and propose a novel scalable tech-
nique for lifelong reinforcement learning. We derive an algorithm which assumes
the availability of multiple processing units and computes shared repositories
and local policies using only local information exchange. We then show an im-
provement to reach a linear convergence rate compared to current lifelong pol-
icy search methods. Finally, we evaluate our technique on a set of benchmark
dynamical systems and demonstrate learning speed-ups and reduced running
times.
Keywords: Reinforcement Learning, Lifelong Learning, Distributed
Optimization, Transfer Learning
∗Corresponding authorEmail addresses: [email protected] (Yusen Zhan), [email protected] (Haitham
Bou Ammar), [email protected] (Matthew E. Taylor)
Preprint submitted to Journal of LATEX Templates July 23, 2017
Page 2
1. Introduction
Reinforcement learning (RL) provides the ability to solve sequential decision-
making problems with limited feedback. Applications with these characteristics
range from robotics control [1] to personalized medicine [2, 3]. Though suc-
cessful, typical RL methods require a substantial amount of experience before5
acquiring acceptable behavior. The cost of obtaining such experience, however,
can be prohibitively expensive in terms of time and data [4]. Also, these costs
only worsen when considering multiple tasks1. Transfer learning2 and multi-task
learning [7, 8] have been developed to remedy these problems by allowing agents
to reuse knowledge from other tasks. Unfortunately, both these techniques suf-10
fer from scalability problems as the number of tasks considered grows large. In
a recent attempt to target these issues, Bou Ammar et al. proposed PG-ELLA,
an online hybrid of the above two paradigms—transfer and multi-task learn-
ing [9]. PG-ELLA decomposes a task’s policy parameters into a shared latent
repository L and task specific coefficients st, one for each task. Consequently,15
it allows for knowledge transfer between tasks (using the latent repository L)
while scaling multi-task learning by streaming problems online. In the original
paper, the authors show a closed form solution to the shared repository.
In a lifelong setting, the agent deals with a sequence of tasks, potentially lead-
ing to computational intractability when the agent observes and learns many20
tasks (e.g., in big data scenario, the agent may observe many tasks in its life).
Unfortunately, PG-ELLA suffers when considering a large number of tasks or
dimensions — determining the shared knowledge-base L becomes intractable —
due to two inefficiencies. The first inefficiency is that computing the expansion’s
operating point amounts to solving a local RL problem described by the cur-25
rent task’s observed trajectories. The second inefficiency reducing PG-ELLA’s
1The mutiple tasks learning tried to learning a single polices or strategy to deal with a set
of tasks. See [5] for details.2The idea of transfer learning is to reuse previous trained information to speed up learning
process. See [6] for details.
2
Page 3
scalability arises when updating the shared repository L.
In this paper, we remedy the above drawbacks of lifelong policy search and
propose a scalable method capable of handling large number of tasks and high-
dimensional policies, while ensuring meaningful parameters at each iteration of30
the algorithm (i.e., each step conveys more information than a single gradient
step, as done in PG-ELLA). Our method assumes multiple processing units
and derives a novel algorithm for lifelong policy search. Such an assumption
leads us to better convergence rates and more accurate local policies. Using
augmented Lagrangian methods, we propose two algorithms for computing the35
local operating point and the shared repository, while allowing for scalable solu-
tions. We show that such updates can be computed locally by each processing
unit in closed form under broad assumptions. We theoretically and empirically
analyze the performance of these new techniques. On the theoretical side, we
demonstrate superiority to PG-ELLA by proving dimensionality-free linear con-40
vergence in both determining local policies and updating the shared repository.
On the empirical side, we show our algorithm outperforms existing algorithms
on a set of 5 domains with 50 variations, including the control of a simulated
helicopter. In a final set of experiments, we also demonstrate the improved
generalization capabilities of this new method on unobserved tasks and report45
a decrease in learning time, relative to current techniques. The contributions
of this paper can be summarized as: i) deriving a novel scalable algorithm for
lifelong policy search, ii) acquiring linear convergence rate in both steps of the
method, iii) demonstrating the effectiveness of our technique on 5 dynamical
systems with 50 variations each, and iv) demonstrating learning speed-ups on50
unobserved tasks.
The rest of the paper is organized as follows: Section 2 introduces the back-
ground; Section 3 provides some related works; In section 4, we define our
lifelong policy search problem; Section 5 proposes the scalable lifelong policy
search method; In section 6, it shows the experimental results and section 755
summarizes the whole paper.
3
Page 4
2. Reinforcement Learning
In reinforcement learning (RL) an agent must sequentially select actions to
maximize its total expected pay-off. These problems are typically formalized as
Markov decision processes (MDPs) 〈X ,U ,P,R, γ〉, where X ⊆ Rd and U ⊆ Rm60
denote the state and action spaces. P : X × U × X → [0, 1] represents the
transition probability governing the dynamics of the system, R : X × U → R is
the reward function quantifying the performance of the agent and γ ∈ [0, 1) is a
discount factor specifying the degree to which rewards are discounted over time.
At each time step m, the agent is in state xm ∈ X and must choose an action65
ut ∈ U , transitioning it to a successor state xm+1 ∼ p(xm+1|xm,um) as given
by P and yielding a reward rm+1 = R(xm,um). A policy π : X × U → [0, 1] is
defined as a probability distribution over state-action pairs, where π (um|xm)
denotes the probability of choosing action ut is state xm.
Policy gradients [1, 4] are a class of reinforcement learning algorithms that
have shown successes in solving complex robotic problems [1]. Such methods
represent the policy πθ(um|xm) by an unknown vector of parameters θ ∈ Rd.
The goal is to determine the optimal parameters θ? that maximize the expected
average pay-off:
J (θ) =
∫τ
pθ(τ )R(τ )dτ ,
where τ = [x0:T ,u0:T ] denotes a trajectory over a possibly infinite horizon T .
The probability of acquiring a trajectory, pθ(τ ), under the policy parameteri-
zation πθ(·) and average per-time-step return R(τ ) is given by:
pθ(τ ) = p0(x0)
M−1∏m=0
p (xm+1 | xm,um)πθ(um | xm), R(τ ) =1
M
M∑m=0
rm+1,
with an initial state distribution p0 : X → [0, 1]. Policy gradient methods, such
as episodic REINFORCE [10] and Natural Actor Critic [11, 12], typically em-
ploy a lower-bound on the expected return J (θ) for fitting the unknown policy
parameters θ. To achieve this, such algorithms generate trajectories using the
current policy θ, and then compares performance with a new parameterization
4
Page 5
θ. As detailed in [1], the lower bound on the expected return can be attained
using Jensen’s inequality and the concavity of the logarithm:
logJ(θ)
= log
∫τ
pθ(τ )R(τ )dτ ≥∫τ
pθ(τ )R(τ ) logpθ(τ )
pθ(τ )dτ + cnst.
∝ −KL(pθ(τ )R(τ )||pθ(τ )
)= JL,θ
(θ),
where KL(p(τ )||q(τ )) =∫τp(τ ) log p(τ )
q(τ )dτ . Thus, we can directly optimize the70
lower bound instead of the original objective.
3. Related Work
Transfer learning aims to improve the learning speed of an agent on a new
target task by transferring knowledge learned from one or more previous source
tasks [6]. However, most current transfer learning methods only consider the75
target tasks instead of all previously visited tasks. In contrast, multi-task learn-
ing (MTL) methods optimize performance over all tasks by training models over
all possible tasks [7, 8], but are computationally expensive in lifelong learning
setting since the agent must learn a large number of tasks over time.:[13, 14].
There are many works that consider parallelizing reinforcement learning.80
For example, Caarls and Schuitema apply parallel on-line temporal Temporal
Difference (TD) learning to motor control [15]. Shixiang et al. proposed an
asynchronous parallel method to deep reinforcement learning[16]. Yahya et al.
introduce a distributed asynchronous guided policy search method to deep re-
inforcement learning [17]. Sergey et al. also employed Bregman ADMM, which85
is a centralized ADMM method, to guided policy search setting [18]. However,
all of the above methods are based on standard reinforcement learning instead
of lifelong learning setting.
PG-ELLA is a recent lifelong policy gradient RL algorithm that can effi-
ciently learn multiple tasks consecutively while sharing knowledge between task90
policies to accelerate learning [9]. In fact, PG-ELLA, which is a centralized
method, can be viewed as one agent or node case of the algorithm we propose
in this paper. MTL for policy gradients has also been explored by Deisenroth
5
Page 6
et al. [19] through customizing a single parameterized controller to individual
tasks that differ only in the reward function. Another closely related work is95
on hierarchical Bayesian MTL [20], which can learn RL tasks consecutively, but
unlike our approach, requires discrete states and actions. Snel and Whiteson’s
representation learning approach is also related, but limited to a potential func-
tion representation [21]. GO-MTL is also a multi-task learning algorithm that
focuses on multi-task supervised learning [22]. All of these MTL methods suffer100
from scalability issues due to their dependence on centralized methods.
4. Lifelong Policy Search
We adopt the lifelong learning framework previously introduced in [9]. An
agent has to master multiple RL tasks while supporting transfer to improve
learning. Let T denote the set of tasks in which each element is an MDP. At
any time, the learner may face any of the previously seen tasks — it must max-
imize its performance across all tasks. The goal is to learn optimal policies,
π?θ?1, . . . , π?θ?
|T |for all tasks with corresponding parameters θ?1 , . . . ,θ
?|T |, where
θ?j ∈ Rd parameterizes the policy for task j. For each task j, the learner sam-
ples a set of nj trajectories T(j) ={τ(1)j , . . . , τ
(nj)j
}from the environment by
predefined policies, and each trajectory has a length M (j). To transfer knowl-
edge between tasks, we adopt the shared knowledge-base representation, where
the policy parameters for each task are represented as a linear combination of
a shared latent basis L ∈ Rd×p and coefficient vectors sj ∈ Rp, i.e., θj = Lsj .
Here, each column of L encodes a set of transferable knowledge that is tailored
to suit the peculiarities of each task using sj . Such a decomposition has been
widely used in the multi-task learning literature [9, 13, 22, 23]. Hence, the
optimization problem for a total of t tasks can be written as:
minL,s1:st
1
t
t∑j=1
(− Jj (Lsj) + µ1 ‖sj‖1
)+ µ2 ‖L‖2F , (1)
where ||v||1 and ||Z||F denote the L1 norm of v ∈ Rp and the Frobenius norm
of Z ∈ Rd×p, respectively. µ1 and µ2 denote regularization parameters, and
6
Page 7
Jj(Lsj) is the total pay-off for a task j:
Jj (Lsj) =
∫τ (j)∈T(j)
p(j)Lsj
(τ (j)
)R(j)
(τ (j)
)dτ (j).
Here, p(j)Lsj
(τ (j)
)and R(j)
(τ (j)
)denote the task-specific trajectory probability
and expected total reward:
p(j)Lsj
(τ (j)
)= p
(j)0
(x(j)0
)M(j)−1∏m=0
p(j)(x(j)m+1 | x(j)
m ,u(j)m
)π(j)Lsj
(u(j)m | x(j)
m )
R(j)(τ (j)
)=
1
M (j)
M(j)∑m=0
R(j)(x(j)m ,u(j)
m
).
As noted in [9], Equation (1) describes batch multi-task learning as opposed
to the online lifelong learning setting. This is due to the dependency of the
above optimization problem on all trajectories from all tasks. To remedy this
problem, the authors employed a local second-order Taylor expansion of the
lower-bound to Jt(θt) around θ?t = arg minθ −JL,θ(θ)
, where JL,θ(θ)
=
KL(pθ(τ )R(τ )||pθ(τ )
), leading to:
minL,s1:st
1
t
t∑j=1
(∥∥∥θ?j −Lsj∥∥∥2H(j)
+ µ1 ‖sj‖1
)+ µ2 ‖L‖2F , (2)
where j denotes task j, ‖v‖2H(j) = vᵀH(j)v and H(j) denotes the Hessian
defined as:
H(j) = −Eτ (j)∼pθ?j(τ (j))
R(j)(τ (j)
)M(j)∑m=1
∇2θj ,θj
logπθj
(u(j)m | s(j)m
) .The problems of PG-ELLA. Although successful, as shown in [9, 23], PG-ELLA
is unscalable to a large number of tasks, which is typical for the lifelong learning
setting. Here, we differentiate the two inefficiency sources of PG-ELLA. First,105
we note that although the dependency on all tasks has been remedied using a
second-order expansion, computing the operating point θ?j can be expensive. In
essence, determining θ?j equates to solving a local MDP (i.e., the jth) defined
over T(j). This led the authors to propose a one-step gradient update given
a small set of observed trajectories for a task j. Apart from the reduction in110
7
Page 8
the number of available trajectories, gradient methods are known to exhibit
slow convergence rates and are prone to local minima. Consequently, a PG-
ELLA update can lead to a low-quality latent repository L due to the error in
approximating θ?j .
The second inefficiency reducing PG-ELLA’s scalability arises when updat-115
ing the shared repository L. On one hand, the update involves an inversion of
a dp × dp matrix leading to a complexity of O(d3p2
)in the best case. Con-
sequently, such a method is not scalable for neither high-dimensional policy
parametrization nor latent dimensions p. As, as mentioned in [23], gradient-
based methods share the problems mentioned above. On the other hand, as the120
number of tasks grows large, these centralized methods (updating L or comput-
ing θ?j ) become intractable due to computational and memory constraints.
5. Scalable Lifelong Policy Search
Our strategy to remedy the two problems above consists of both scaling
PG-ELLA to multiple trajectories and multiple tasks through more processing125
units, as well as generalizing gradient methods to algorithms with faster con-
vergence rates. We assume the existence of multiple processors defined through
undirected graph G = (V, E), with V denoting the set of nodes and E the set
of edges. We assume a natural ordering among the nodes from 1, . . . , |V| and
eij ∈ E denotes the edge between nodes i and j with i < j. We define a block130
Matrix A to have |E| rows and |V| columns. Suppose edge eij is the kth edge
in E associating to the kth row in matrix A, then, the entry (k, i) of matrix
A equals identity matrix I ∈ Rd×d , i.e., A(k, i) = I, and the block A(k, j) of
matrix A equals -I, i.e., A(k, j) = −I and other entries are set to matrix 0.
A(I) denotes the general edge-node incident matrix with identity matrix I.135
The intuition is that we distribute all the tasks among the nodes in the
graph and compute the local variables by local variables from other nodes, which
connected by edges in the set E . One potential drawback is the communication
cost between nodes.
8
Page 9
5.1. Scaling the Computation of θ?j140
As mentioned earlier, the first problem of current lifelong policy search algo-
rithms relates to the scalability in determining the local operating point θ?j for
the second-order Taylor expansion in Equation (2). Given a set of trajectories
T(j) for a task j, θ?j is determined as the solution to:
θ?j = arg minθ−J (j)
L,θ
(θ)
= arg minθ
KL(p(j)θ
(τ (j)
)R(j)
(τ (j)
) ∣∣∣∣∣∣p(j)θj
(τ (j)
))= arg min
θ
∫τ (j)∈T(j)
p(j)
θ(j)
(τ (j)
)R(j)
(τ (j)
)log
[p(j)
θ(j)
(τ (j)
)p(j)
θ(j)
(τ (j)
)R(j)
(τ (j)
)] dτ (j).
Given the above processing unit, we assume a random split of the set T(j) among
the nodes 1, . . . , |V| such that T(j) = ∪|V|i=1T(j)i and thus rewrite the optimization
problem of θ?j as:
minθ(j)1 :θ
(j)
|V|
|V|∑i=1
∫τ (j)∈T(j)
i
p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
(3)
s.t. θ(j)i = θ
(j)j , for all eij ∈ E and i < j,
where the constraints θ(j)i = θ
(j)j , for all eij ∈ E and i < j have been added so
as to ensure agreement across the different processing units in G. To rewrite the
constraint above using a compact representation we defined the following: a vec-
tor θ(j): =
[θT,(j)1 , . . . , θ
T,(j)|V|
]T∈ Rd|V| and matrix A = A(Id×d) ∈ Rd|E|×d|V|,
where Id×d is a d-dimensional identity matrix. Using the above notation, we
reformulate the problem in Equation (3) as:
minθ(j)1 :θ
(j)
|V|
|V|∑i=1
∫τ (j)∈T(j)
i
p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
s.t. Aθ(j): = 0.
The above is a constraint optimization problem that we aim to solve efficiently.
Generally, this problem can be intractable due to the non-convexity of each of
the i objectives. Under the broad-class of exponential family policies, however,
9
Page 10
each of the local objectives is convex in θ(j)i and can thus be solved optimally and
efficiently. Our technique for determining the solution relies on a blend between
an augmented Lagrangian and a decomposition-coordination procedure similar
to the accelerated direction method of multiples ADMM [24] algorithm. In
contrast to ADMM, our method is required to determine the feasible point based
exclusively on local interactions among the computational nodes. This setting
is well known in the distributed optimization literature [25]. Unfortunately,
current methods for distributed ADMM mostly focus on the univariate case (as
detailed in [25]). The univariate case is limited by its one-dimensional setting —
we generalize a distributed version of ADMM to our multi-dimensional setting
to propose a scalable and efficient solver for lifelong policy search. Let λ ∈ Rd|E|
be a vector of Lagrange multipliers associating with the constraint Aθ(j): = 0
and a constant ρ > 0 of penalty term∣∣∣∣∣∣Aθ(j):
∣∣∣∣∣∣22. We augmented Lagrangian
methods (similar to existing work [24, 25]) so that the Lagrangian function
becomes:
LAug
(θ(j): ,λ
)=
|V|∑i=1
∫τ (j)∈T(j)
i
p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
− λTAθ(j): +
ρ
2
∣∣∣∣∣∣Aθ(j):
∣∣∣∣∣∣22.
At this stage, we can use standard ADMM for acquiring primal-dual updates
and thus determining the solution θ?,(j): . Standard ADMM, however, is not
readily applicable to our setting due to the need of a distributed solution in
which each node performs parameter updates based on only local information
exchange between its neighbors. To achieve such a distributed solution, we gen-145
eralize the approach in [25] to the multi-dimensional setting and associate a
dual variable with the constraint on each edge. Each processor i keeps a local
decision estimate θ(j)i and a vector of dual variables λki with k < i. We define
two sets of the neighbors of a node i, denote by Pi and Si. Here, Pi collects
all nodes having an index lower than i, i.e., Pi = {j|eij ∈ E , j < i} and Si all150
10
Page 11
Algorithm 1 Scalable Operating Point Computation
1: Initialize θ(j)i , ∀i ∈ {1, . . . , |V|}, and set ρ > 0.
2: for k = 0, . . . ,K do
3: Each agent i updates, θ(j)i in a sequential order from i = 1, . . . , |V| using:
θ(k+1),(j)i = arg min
θ(j)i
[∫τ (j)∈T(j)
i
p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
+ρ
2
∑l∈Pi
∣∣∣∣∣∣∣∣θ(k+1),(j)l − θ(j)i −
1
ρλ(k)li
∣∣∣∣∣∣∣∣22
+ρ
2
∑l∈Si
∣∣∣∣∣∣∣∣θ(j)i − θ(k),(j)l − 1
ρλ(k)il
∣∣∣∣∣∣∣∣22
].
4: Each agent updates λli for l ∈ Pi as:
λ(k+1)li = λ
(k)li − ρ
(θ(k+1),(j)l − θ(k+1),(j)
i
).
5: end for
nodes with index higher than i, i.e., Si = {j|eij ∈ E , i < j}. Our algorithm to
determine θ?j is summarized by the set of instructions in Algorithm 1. Algo-
rithm 1 consists of two steps. First, primal updates are performed by each node
in G (line 3). Second, given the updated primal, the dual variables are then
computed (line 4).155
Note that in the case of Gaussian policies, which also can be generalized
to any policy from the exponential family, the primal updates on line 3 of
Algorithm 1 can be computed in closed form (see Appendix B for derivation):
θ(k+1),(j)i =
−ρdegi2
Id×d + Eτ (j)∼pθ(j)i
(·)
R(j)(τ (j)
)M(j)−1∑m=0
ΦT,(j)(xm)Σ−1ΦT,(j)(xm)
−1
×
[Eτ (j)∼p
θ(j)i
(·)
R(j)(τ (j)
)M(j)−1∑m=0
ΦT,(j)(xm)Σ−1u(j)m
− ρ
2
(∑l∈Si
[θ(k),(j)l +
1
ρλ(k)il
]+∑l∈Pi
[θ(k+1),(j)l − 1
ρλ(k)li
])].
where degi is the degree of processing unit i, i.e., set of neighbors, and Φ(·) is a
11
Page 12
feature map of the system’s state space. Having the problem solved in comput-
ing the linearization operating point, the second step needed for scaling lifelong
policy search is handling the inefficiency in updating the shared repository. We
detail the solution to this step in the next section.160
Theoretical Guarantees: We now show the theorem derives an improve-
ment to a linear convergence rate for our proposed algorithm.
Theorem 1. The sequence{θ(k),(j): ,λ(k)
}k≥0
consists of
θ(k),(j): =[θT,(j)1 , . . . , θ
T,(j)|V|
]Tand λ(k) =
[λ(k),T1 , . . . ,λ
(k),T|E|
]T,
where λ(k)i ∈ Rd is a dual vector. Following Algorithm 1, the sequence converges
linearly with a rate given by O (1/k).
(The interested reader can find the proof in Appendix A.)165
5.2. Scaling the Model Update
The lifelong policy search problem is framed as the solution to the opti-
mization problem in Equation (2). It is easy to see that such a problem is not
convex in both the shared repository, L, and the co-efficient vectors, sj , simul-
taneously. Holding one variable fixed, however, leads to a convex problem in
the other. Consequently, following such an alternating procedure, one can solve
a sequence of convex optimization problems iteratively. The computations over
the sj vectors are relatively simple to scale. The scalability problem arises when
trying to update the latent repository L due to its dependency on all tasks ob-
served thus far. To scale the computation of the shared repository L, we follow
a similar strategy to that of the previous section. Holding sj fixed, computing
L, then we need to solve:
minL
1
t
t∑j=1
∣∣∣∣∣∣θ?j −Lsj∣∣∣∣∣∣2Hj
+ µ2||L||2F.
First, we introduce the vec(Z) notation, which vectorizes a matrix Z ∈ Rd×p
to a vector of size Rdp, i.e.,
vec(Z) = [a1,1, . . . , ad,1, a1,2, . . . , ad,2, . . . , ad,1, . . . , ad,p]T,
12
Page 13
where ai,j denotes the (i, j)-th element of matrix Z. Consequently, we can write
vec(Lsj) =(sTj ⊗ Id×d
)vec(L), with ⊗ being the Kronecker product and Id×d a
d-dimensional identity matrix. Hence, we can rewrite the optimization problem
in the following equivalent form:
minvec(L)
1
t
t∑j=1
∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec(L)∣∣∣∣∣∣2Hj
+ µ2||vec(L)||22,
where ||L||2F = ||vec(L)||22. Splitting this equation across G leads us to the
following:
minvec(L1):vec(L|V|)
|V|∑i=1
1
ti
ti∑j=1
(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj
+ µ2 ||vec (Li)||22
)s.t. vec (Ll) = vec (Li) for all eli ∈ E and l < i,
where vec (Li) is the chunk of the shared repository to be learned using process-
ing unit i and ti is the total number of tasks assigned to that processing unit.
Similarly, we can rewrite the constraints compactly in the form of A ¯vec(L) = 0,
where ¯vec(L) =[vec (L1)
T, . . . , vec
(L|V|
)T]T. Thus, we can write the aug-
mented Lagrangian as:
LAugL ( ¯vec(L),λL) =
|V|∑i=1
1
ti
ti∑j=1
(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj
+ µ2 ||vec (Li)||22
)− λT
LA ¯vec(L) +ρ
2
∣∣∣∣∣∣A ¯vec(L)∣∣∣∣∣∣22,
where ρ > 0 is the penalty parameter and λL is a vector of Lagrange mul-
tipliers. At this stage, we can follow a similar strategy to that of the previous
section to determine the optimal solution ¯vec(L)?. To update the primal vari-
ables, each processing unit in G solves the following problem:
vec(L
(k+1)i
)= arg min
vec(Li)
[1
ti
ti∑j=1
(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj
+ µ2 ||vec (Li)||22
)
+ρ
2
∑l∈Pi
∣∣∣∣∣∣∣∣vec(L(k+1)l
)− vec (Li)−
1
ρλ(k)L,li
∣∣∣∣∣∣∣∣22
+ρ
2
∑l∈Si
∣∣∣∣∣∣∣∣vec (Li)− vec(L
(k)l
)− 1
ρλ(k)L,li
∣∣∣∣∣∣∣∣22
],
(4)
13
Page 14
Algorithm 2 Scalable Operating Point Computation for L
1: Initialize vec(L
(0)i
), ∀i ∈ {1, . . . , |V|}, and set ρ > 0.
2: for k = 0, . . . ,K do
3: Each agent i updates, vec(L
(k+1)i
)in a sequential order from i =
1, . . . , |V| using:
vec(L
(k+1)i
)= arg min
vec(Li)
[1
ti
ti∑j=1
(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj
+ µ2 ||vec (Li)||22
)
+ρ
2
∑l∈Pi
∣∣∣∣∣∣∣∣vec(L(k+1)l
)− vec (Li)−
1
ρλ(k)L,li
∣∣∣∣∣∣∣∣22
+ρ
2
∑l∈Si
∣∣∣∣∣∣∣∣vec (Li)− vec(L
(k)l
)− 1
ρλ(k)L,il
∣∣∣∣∣∣∣∣22
],
4: Each agent updates λli for
λ(k+1)L,li = λ
(k)L,li − ρ
(vec(L
(k+1)l
)− vec
(L
(k+1)i
)).
5: end for
where k is the iteration number. Given an updated primal, the dual variables,
L, are then computed according to:
λ(k+1)L,li = λ
(k)L,li − ρ
(vec(L
(k+1)l
)− vec
(L
(k+1)i
)). (5)
Equation (5) is similar to step 4 in Algorithm 1. We provide the pseudocode
in Algorithm 2. Note that the update rule can be acquired in a closed form
the Gaussian policy assumption, leading to the following: (See Appendix B for
details):
vec(L
(k+1)i
)=
1
ti
ti∑j=1
ΓTjHjΓj +
ρdegi2
Idk×dk
−1( 1
ti
ti∑j=1
ΓjHj θ?j
+ρ
2
(∑l∈Si
[vec(L
(k+1)l
)+
1
ρλ(k)L,li
]+∑l∈Pi
[vec(L
(k+1)l
)− 1
ρλ(k)L,il
])),
with Γj = sj ⊗ Id×d. Given the solution to the shared repository, the task-
specific coefficients are performed locally using a LASSO [26] step for each of
14
Page 15
the tasks ti.
Theoretical Guarantees: Our next theorem derives a linear convergence170
rate for our proposed algorithm.
Theorem 2. The sequence{
¯vec(L(k)
),λ
(k)L
}k≥0
consists of
¯vec(L(k)
)=[vec (L1)
(k),T, . . . , vec
(L|V|
)(k),T]Tand λ
(k)L =
[λ(k),TL,1 , . . . ,λ
(k),TL,|E|
]T,
where λ(k)L,i ∈ Rdp is a dual vector. Following Equations (4) and 5, the sequence
converges linearly with a rate given by O (1/k).
(The interested reader can find the proof in Appendix A.)
Theorems 1 and 2 show that our technique improves over PG-ELLA in both175
the policy gradient update step and as well as in the computations of the shared
repository. This improvement is significant as it shows linear converges in both
steps.
6. Experimental Results
Next, we present an empirical validation on five benchmark dynamical sys-180
tems introduced in [4, 23] .
Cart Pole: The cart pole (CP) system is controlled by the cart’s mass mc
in kg, the pole’s mass mp in kg and the pole’s length l in meters. The state
is given by the cart’s position and velocity v, as well as the pole’s angle θ and
angular velocity θ. The goal is to control the pole in an upright position.185
Double Inverted Pendulum: The double inverted pendulum (DIP) is an
extension of the cart pole system. It has one cart with mass m0 in kg and two
poles in which the corresponding lengths are l1 and l2 in meters. We assume
the poles have no mass and there are two masses m1 and m2 in kg on the top
of each pole. The state consists of the cart’s position x1 and velocity v1, the190
lower pole’s angle θ1 and angular velocity θ1, as well as the upper pole’s θ2 and
angular velocity θ2. The goal is also to learn a policy to control the two poles
in a specific state.
15
Page 16
Helicopter: This linearized model of a CH-47 (HC) tandem-rotor helicopter
assumes horizontal motion at 40 knots. It characterized by two matrices A ∈195
R4×4 and B ∈ R4×2. The main goal is to stabilize the helicopter by controlling
the collective and differential rotor thrust.
Simple Mass: The simple mass (SM) system is characterized by the spring
constant k in N/m, the damping constant d in Ns/m and the mass m in kg.
The system’s state is given by the position x and the velocity, v, of the mass.200
The goal is to train a policy for guiding the mass to a specific state.
Double Mass: The double mass (DM) is an extension of the simple mass
system. It has two masses m1,m2 in kg and two springs in which the cor-
responding springs constants are k1 and k2 in N/m, as well as the damping
constant d1 and d2 in Ns/m. The state consists of the big mass’ position x1205
and velocity v1, as well as the small mass’ position x2 and velocity v2. The goal
is also to learn a policy to control the two mass in a specific state.
Table 1: Parameter ranges used in the experiments
CP DIP HC
mc,mp ∈ [0, 1] m0 ∈ [1.5, 3.5] A with rand entry A(i, j) ∈ [−32, 3]
l ∈ [0.2, 0.8] m1,m2 ∈ [0.055, 0.1] B with rand entry A(i, j) ∈ [−9, 1]
l1 ∈ [0.4, 0.8], l2 ∈ [0.5, 0.9]
SM DM
m ∈ [3, 5] m1 ∈ [1, 7],m2 ∈ [1, 10]
k ∈ [1, 7] k1 ∈ [1, 5], k2 ∈ [1, 7]
d1, d2 ∈ [0.01, 0.1]
We generated 50 tasks for each domain by varying the dynamical parameters
of each of the domains (for 250 in total). For reproducibility, we summarize these
ranges in Table 1. The reward was given by −√‖xm − x‖22−
√uᵀmum, where x210
was the goal state, xm is the current state and um is the action that the agent
has chosen. We run each task for a total of 200 iterations. At each iteration,
the learner observed a task through 50 trajectories of 150 steps and updated
16
Page 17
0 50 100 150 200
Iterations
-180
-160
-140
-120
-100A
vera
ge R
ew
ard
ePG-ELLA
PG-ELLA
GO-MTL
Standard PG
(a) SM
0 50 100 150 200
Iterations
-1400
-1200
-1000
-800
-600
-400
Avera
ge R
ew
ard
ePG-ELLA
PG-ELLA
GO-MTL
Standard PG
(b) DM
0 50 100 150 200
Iterations
-400
-350
-300
-250
Avera
ge R
ew
ard
ePG-ELLA
PG-ELLA
GO-MTL
Standard PG
(c) CP
0 50 100 150 200
Iterations
-400
-350
-300
-250
-200
Avera
ge R
ew
ard
ePG-ELLA
PG-ELLA
GO-MTL
Standard PG
(d) DCP
0 50 100 150 200
Iterations
-1000
-900
-800
-700
-600
-500
-400
-300
Avera
ge R
ew
ard
ePG-ELLA
PG-ELLA
GO-MTL
Standard PG
(e) HC
100
101
102
Iterations
100
Consensus E
rror
Cart-Pole
Double Mass
Simple Mass
(f) Agreement
Figure 1: Figures (a)-(e) report average reward versus iterations and demonstrate that ePG-
ELLA is capable of outperforming other methods. Figure (f) shows the agreement error across
processors on 3 sample systems.
performed algorithmic updates. We used eNAC [27], a standard PG algorithm,
as the base learner. We compare our method (ePG-ELLA) to standard PG, PG-215
17
Page 18
% Ju
mp-
Star
t Im
prov
emen
t0
70
SM CPDCP HC
DM
PG-ELLA GO-MTL ePG-ELLA
17 19
21
23 15
39
23 25
38
44 42
64
40 58
69
(a) Jump-Start
% Ju
mp-
Star
t Im
prov
emen
t0
80
SM CPDCP HC
DM
PG-ELLA PG-ELLA Improvement TrendGO-MTL GO-MTL Improvement TrendePG-ELLA ePG-ELLA Improvement Trend
17 19
21
23 15
39
23 25
38
44 42
64
40 58
69
(b) Asymptotic Results
Tota
l Run
ning
Tim
e [s
ec]
0.00
120.
00
SM CPDCP HC
DM
ePG-ELLA PG-ELLA GO-MTL
1519
11 14
>3000
7 10 9
>30
12
>30
>3000 >4000 >4000 >4000
(c) Time to Convergence
0 20 40 60 80 100 120 140 160 180 200
No. of Iterations
-2.3
-2.2
-2.1
-2
-1.9
-1.8
-1.7
-1.6
Ave
rag
e R
ew
ard
PG
ePG-ELLA
PG-ELLA
GO-MTL
(d) Novel Domain
Figure 2: Figures (a) and (b) show that ePG-ELLA is capable of outperforming others in
both the jump-start and asymptotic performance. Figure (c) shows ePG-ELLA takes less
time to optimized the objective function. Finally, Figure (d) demonstrates the generalization
capability of our method in terms of average reward on unobserved tasks in HC domain.
ELLA [9], and GO-MTL [22]. To distribute our computations, we made use of
MATLAB’s parallel pool running on 10 nodes. For distributive computation in
ePG-ELLA, we equally assigned 50 tasks to 10 agents, that is, each agent learns
5 tasks in parallel. The edges between agents were generated randomly without
overlapping, and the agents are able to exchange information via the graph220
network. To make sure every node in the graph has at least one predecessor
and at least one successor, we connect node 1 to node 2, node 2 to node 3, . . . ,
node 9 and 10. We then randomly assigned 20 edges to all nodes to ensure a
connected graph.
Figures 1a—1e report the average reward performance on the different bench-225
18
Page 19
mark in terms of the number of iterations. In all these systems, our method
is capable of outperforming the others in terms of total reward. Figure 1f
shows that our method is capable of acquiring feasible solutions by reporting
the consensus error on 3 different systems. Figures 2a and 2b demonstrate that
our algorithm is capable of outperforming PG-ELLA and GO-MTL in terms230
of jump-starts and asymptotic performance. Figure 2c demonstrates that our
method takes less time to optimize the objective function in terms of time con-
sumption.
Transfer provides significant advantages when the lifelong learning agent
faces a novel task domain. To evaluate this, we chose the most complex of the235
task domains (HC) and trained the lifelong learner by unseen 49 tasks to yield an
effectively shared knowledge base for each of the algorithms. Then, we evaluated
the algorithms’ ability to learn a new (unseen) task from the helicopter domain,
comparing the benefits of ePG-ELLA transfer from LePGELLA with PG-ELLA
(from LPG-ELLA) , GO-MTL (from LGO-MTL) and PG. Figure 2d depicts the240
result of learning on a novel domain, averaged over HC tasks, showing that
ePG-ELLA converges fastest in this scenario in terms of average reward.
Negative transfer may occur during training with a small probability. How-
ever, we run the experiments over 200 iterations and take the average results
to smooth the results. Any cases where transfer was negative were statistically245
overwhelemed by the positive results in our setting.
7. Conclusion and Future Work
We proposed a scalable and efficient lifelong learner for policy gradient meth-
ods. Our approach achieves linear convergence rates in both acquiring lineariz-
ing operating points and updating shared knowledge bases. We have shown250
that our new technique can outperform state-of-the-art methods on a variety
of benchmark control systems. Future work will include targeting the more-
general lifelong learning setting in different domains and running on real world
applications such as robotics.
19
Page 20
8. Acknowledgements255
We thank the anonymous reviewers for their useful suggestions and com-
ments. This research has taken place in part at the Intelligent Robot Learn-
ing (IRL) Lab, Washington State University. IRL’s support includes NASA
NNX16CD07C, NSF IIS-1149917, NSF IIS-1643614, and USDA 2014-67021-
22174.260
Appendix A. Convergence and Convergence Rate
In this section, we will show the proof of Theorem 2. Theorem 1 can be
proved in a similar way. To show that the sequence{
¯vec(L(k)
),λ
(k)L
}k≥0
con-
verges with rate O( 1k ), let the function
F ( ¯vec (L)) =
|V|∑i=1
fi (vec (Li))
=
|V|∑i=1
1
ti
ti∑j=1
(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj
+ µ2 ||vec (Li)||22
)and the Lagrangian function
LAugL ( ¯vec(L),λL) =
|V|∑i=1
1
ti
ti∑j=1
(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj
+ µ2 ||vec (Li)||22
)− λT
LA ¯vec(L) +ρ
2
∣∣∣∣∣∣A ¯vec(L)∣∣∣∣∣∣22,
(A.1)
where
¯vec(L(k)
)=
[vec(L
(k)1
)T, . . . , vec
(L
(k)|V|
)T]Tand λ
(k)L =
[λ(k),TL,1 , . . . ,λ
(k),TL,|E|
]T,
with λ(k)L,i ∈ Rdk being a dual vector. It is easy to verify that each cost function
fi (vec (Li)) =1
ti
ti∑j=1
(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj
+ µ2 ||vec (Li)||22
)is convex. Also, we needs following assumptions
20
Page 21
Assumption 1 (Existence of a Saddle Point). The Lagrangian function LAugL
has a saddle point, that is, there is a solution ( ¯vec (L∗) ,λ∗L) such that
LAugL ( ¯vec (L∗) ,λL) ≤ LAugL ( ¯vec (L∗) ,λ∗L) ≤ LAugL ( ¯vec (L) ,λ∗L) (A.2)
for all ¯vec (L) ∈ Rdp|V| and λL ∈ Rdp.
Assumption 2. The penalty term∥∥∥A ¯vec(L)
∥∥∥22is bounded by a constant γ > 0.
We extended Wei and Ozdaglar’s results to multiple dimension [25]. Before265
we provide the proofs, we need following lemmas:
Lemma 1. The sequence{
¯vec(L(k)
),λ
(k)L
}k≥0
consists of
¯vec(L(k)
)=
[vec(L
(k)1
)T, . . . , vec
(L
(k)|V|
)T]Tand λ
(k)L =
[λ(k),TL,1 , . . . ,λ
(k),TL,|E|
]T,
where λ(k)L,i ∈ Rdp is a dual vector. Then, for all k ≥ 0, we have following
relation,
F ( ¯vec (L))−F(
¯vec(L(k+1)
))+(
¯vec(L(k)
)− ¯vec
(L(k+1)
))T·(−ATλ
(k+1)L − ρATB
(¯vec(L(k+1)
)− ¯vec
(L(k)
))+ ρBTB
(¯vec(L(k+1)
)¯vec(L(k)
)))≥ 0
for all ¯vec (L) ∈ Rdp|V|, where matrix A is the general edge-node incident matrix
and B = min{
0, A}, and the min is taken element-wise.
Proof. Let
g(k)i (vec (Li)) =
ρ
2
∑l∈Pi
∣∣∣∣∣∣∣∣vec(L(k+1)l
)− vec (Li)−
1
ρλ(k)L,li
∣∣∣∣∣∣∣∣22
+ρ
2
∑l∈Si
∣∣∣∣∣∣∣∣vec (Li)− vec(L
(k)l
)− 1
ρλ(k)L,il
∣∣∣∣∣∣∣∣22
From equation 4, the optimal value of function gki + fi is vec(L
(k+1)i
), which
implies that there exits a subgradient sg(vec(L
(k+1)i
))∈ ∂fi (vec (Li)) such
that sg(vec(L
(k+1)i
))+∇g(k)i
(vec(L
(k+1)i
))= 0, and we obtain[
vec (Li)− vec(L
(k+1)i
)]T (sg(vec(L
(k+1)i
))+∇g(k)i
(vec(L
(k+1)i
)))= 0,
(A.3)
21
Page 22
for all vec (Li) ∈ Rdk. Due to the definition of subgradient, we have
fi (vec (Li)) ≥ fi(vec(L
(k+1)i
))+[vec (Li)− vec
(L
(k+1)i
)]T·sg(vec(L
(k+1)i
))(A.4)
Combining Equation (A.3) and A.4 yield:
fi (vec (Li))−fi(vec(L
(k+1)i
))+[vec (Li)− vec
(L
(k+1)i
)]T·∇g(k)i
(vec(L
(k+1)i
))≥ 0.
(A.5)
The gradient of g(k)i
(vec(L
(k+1)i
))is
∇g(k)i
(vec(L
(k+1)i
))= −ρ
∑l∈Pi
(vec(L
(k+1)l
)− vec
(L
(k+1)i
)− 1
ρλ(k)L,li
)+ ρ
∑l∈Si
(vec(L
(k+1)i
)− vec
(L
(k)l
)− 1
ρλ(k)L,il
)Then, we substitute the gradient in Equation (A.5) with above equation, we
have
fi (vec (Li))− fi(vec(L
(k+1)i
))+[vec (Li)− vec
(L
(k+1)i
)]T·(− ρ
∑l∈Pi
(vec(L
(k+1)l
)− vec
(L
(k+1)i
)− 1
ρλ(k)L,li
)+ ρ
∑l∈Si
(vec(L
(k+1)i
)− vec
(L
(k)l
)− 1
ρλ(k)L,il
))≥ 0.
Since the dual variables λ(k)L,li have the relation in Equation (5), then,
fi (vec (Li))− fi(vec(L
(k+1)i
))+[vec (Li)− vec
(L
(k+1)i
)]T·(∑l∈Pi
λ(k+1)L,li +
∑l∈Si
−λ(k+1)L,il +
∑l∈Si
ρ(vec(L
(k+1)l
)− vec
(L
(k)l
)))≥ 0.
Then, by the definition of matrix A, we can rewrite above equation as
fi (vec (Li))− fi(vec(L
(k+1)i
))+[vec (Li)− vec
(L
(k+1)i
)]T·(− A(·, i)Tλ(k+1)
L +∑l∈Si
ρ(vec(L
(k+1)l
)− vec
(L
(k)l
)))≥ 0.
where A(·, i) denotes the ith column of matrix A. We sum the above equation
22
Page 23
from k = 1 to |V|,
|V|∑i=1
fi (vec (Li))−|V|∑i=1
fi
(vec(L
(k+1)i
))+
|V|∑i=1
[vec (Li)− vec
(L
(k+1)i
)]T·(− A(·, i)Tλ(k+1)
L +∑l∈Si
ρ(vec(L
(k+1)l
)− vec
(L
(k)l
)))≥ 0.
(A.6)
By the definition of matrices A and B, we obtain
−|V|∑i=1
[vec (Li)− vec
(L
(k+1)i
)]TA(·, i)Tλ(k+1)
L = −[
¯vec (L)− ¯vec(L(k+1)
)]TATλ
(k+1)L
and
|V|∑i=1
[vec (Li)− vec
(L
(k+1)i
)]T∑l∈Si
ρ(vec(L
(k+1)l
)− vec
(L
(k)l
))= ρ
[¯vec (L) ¯vec
(L(k+1)
)]T (−A+ B
)TB(
¯vec(L(k+1)
)− ¯vec
(L(k+1)
)).
The result follows by plugging preceding two equations into Equation (A.6).
Lemma 2. The sequence{
¯vec(L(k)
),λ
(k)L
}k≥0
consists of
¯vec(L(k)
)=
[vec(L
(k)1
)T, . . . , vec
(L
(k)|V|
)T]Tand λ
(k)L =
[λ(k),TL,1 , . . . ,λ
(k),TL,|E|
]T,
where λ(k)L,i ∈ Rdp is a dual vector. Assume that ( ¯vec (L∗) ,λ∗L) is the solution
of problem A.1. Let matrix A is the general edge-node incident matrix and
B = min{
0, A}, and the min is taken element-wise. Then, we have following
relation:
¯vec(L(k+1)
)TAT(λ
(k+1)L − λ∗L) + ρ( ¯vec
(L(k+1)
)TATB
(¯vec(L(k+1)
)− ¯vec
(L(k)
))+ ρ
(¯vec (L∗)− ¯vec
(L(k+1)
))TBTB
(¯vec(L(k+1)
)− ¯vec
(L(k)
))=
1
2ρ
(∥∥∥λ(k)L − λ
∗L
∥∥∥2 − ∥∥∥λ(k+1)L − λ∗L
∥∥∥2)+ρ
2
(∥∥∥B ( ¯vec(L(k)
)− ¯vec (L∗))
)∥∥∥2 − ∥∥∥B ( ¯vec(L(k+1)
)− ¯vec (L∗))
)∥∥∥)− ρ
2
∥∥∥B ( ¯vec(L(k+1)
)− ¯vec
(L(k)
))− A ¯vec
(L(k+1)
)∥∥∥223
Page 24
Proof. See the proofs of Lemma 4.2 in [25]270
Lemma 3. Let ( ¯vec (L∗) ,λ∗L) be a saddle point of Equation (A.1). Then we
have
A ¯vec (L∗) = 0.
Proof. The result follows by the Assumption 1.
With aforementioned Lemmas, we are able to show the main theoretical
result in the paper.
Proof of Theorem 1. To show the convergence with rate O( 1k ), we need an
additional variable yk = 1k
∑k−1s=0 ¯vec
(L(s+1)
)which is the average of ¯vec
(Lk)
until time k. Then, We need to bound
LAugL
(yk,λ∗L
)− LAugL ( ¯vec (L∗) ,λ∗L)
First, LAugL
(yk,λ∗L
)−LAugL ( ¯vec (L∗) ,λ∗L) ≥ 0 since ( ¯vec (L∗) ,λ∗L) is the
saddle point of Equation (A.1). Then, we derive the upper bound of above
equation. By setting ¯vec (L) = ¯vec (L∗) in Lemma 1, we have following relation,
F ( ¯vec (L∗))−F(
¯vec(L(s+1)
))+(
¯vec (L∗)− ¯vec(L(s+1)
))T·(−ATλ
(s+1)L − ρATB
(¯vec(L(s+1)
)− ¯vec
(L(s)
))+ ρBTB
(¯vec(L(s1)
)− ¯vec
(L(s)
)))≥ 0
We have A ¯vec (L∗) = 0 from Lemma 3, so ¯vec (L∗)TAT = 0. Thence, the above
equation can be reduced to
F ( ¯vec (L∗))−F(
¯vec(L(s+1)
))+ ¯vec
(L(s+1)
)TATλ
(s+1)L
+
(ρ ¯vec
(L(s+1)
)TATB + ¯vec
(L(s)
)+ ρ
(¯vec (L∗)− ¯vec
(L(s+1)
))BTB
)·(
¯vec(L(s+1)
)− ¯vec
(L(s)
))≥ 0
By adding (λ∗L)TA ¯vec
(L(s)
)− (λ∗L)
TA ¯vec
(L(s)
), we obtain
F ( ¯vec (L∗))−F(
¯vec(L(s+1)
))+ ¯vec
(L(s+1)
)TATλ∗L
+ ¯vec(L(s+1)
)TAT(λ
(s+1)L − λ∗L) + ρ ¯vec
(L(s+1)
)TATB
(¯vec(L(s+1)
)− ¯vec
(L(s)
))+ ρ
(¯vec (L∗)− ¯vec
(L(s+1)
))TBTB
(¯vec(L(s+1)
)− ¯vec
(L(s)
))≥ 0
24
Page 25
Using Lemma 2 and Lemma 3, it yields,
F ( ¯vec (L∗))−F(
¯vec(L(s+1)
))+ (λ∗L)
TA ¯vec
(L(s+1)
)+
1
2ρ
(∥∥∥λ(s)L − λ
∗L
∥∥∥2 − ∥∥∥λ(s+1)L − λ∗L
∥∥∥2)+ρ
2
(∥∥∥B ( ¯vec(L(s)
)− ¯vec (L∗)
)∥∥∥2 − ∥∥∥B ( ¯vec(L(s)
)− ¯vec (L∗)
)∥∥∥)− ρ
2
∥∥∥B(vec(L(s+1))− vec(L(s)))− Avec(L(s+1))∥∥∥2 ≥ 0
Then, we sum aforementioned inequality from s = 0, . . . k − 1 and after tele-
scoping cancellation, we have
kF ( ¯vec (L∗))−k−1∑s=0
F(
¯vec(L(s+1)
))+ (λ∗L)
TAT
k−1∑s=0
¯vec(L(s+1)
)+
1
2ρ
∥∥λ0L − λ∗L
∥∥2 +ρ
2
∥∥∥B ( ¯vec(L0)− ¯vec (L∗)
)∥∥∥2≥ 1
2ρ
∥∥λkL − λ∗L∥∥2 +ρ
2
∥∥∥B ( ¯vec(Ll)− ¯vec (L∗)
)∥∥∥2+
k−1∑s=0
ρ
2
∥∥∥B ( ¯vec(L(s+1)
)− ¯vec
(L(s)
))− A ¯vec
(L(s+1)
)∥∥∥2 ≥ 0
Since F is a convex function due to the convexity of fi, we have∑k−1s=0 F
(¯vec(L(s+1)
))≥
kF(yk) and with the definition of yk = 1k
∑k−1s=0 ¯vec
(L(s+1)
),
kF ( ¯vec (L∗))− kF(yk) + k (λ∗L)TATyk
+1
2ρ
∥∥λ0L − λ∗L
∥∥2 +ρ
2
∥∥∥B ( ¯vec(L0)− ¯vec (L∗)
)∥∥∥2 ≥ 0
Taking − 1k on both sides of the above equation and (λ∗L)
TAT ¯vec (L∗) = 0 by
Lemma 3, we obtain
F(yk)− (λ∗L)TATyk −F ( ¯vec (L∗)) + (λ∗L)TAT ¯vec (L∗)
≤ 1
k
(1
2ρ
∥∥λ0L − λ∗L
∥∥2 +ρ
2
∥∥∥B ( ¯vec(L0)− ¯vec (L∗)
)∥∥∥2)By Assumption 2 and definition of yk,
ρ
2
∣∣∣∣∣∣Ayk∣∣∣∣∣∣22≤ ρ
2k
k−1∑s=0
∣∣∣∣∣∣A ¯vec(L(s+1)
)∣∣∣∣∣∣22≤ ρ
2kkγ =
γρ
2(A.7)
25
Page 26
Then,
F(yk)− (λ∗L)TATyk +
ρ
2
∣∣∣∣∣∣Ayk∣∣∣∣∣∣22yk −F ( ¯vec (L∗)) + (λ∗L)
TAT ¯vec (L∗)− ρ
2
∣∣∣∣∣∣A ¯vec (L∗)∣∣∣∣∣∣22
≤ 1
k
(1
2ρ
∥∥λ0L − λ∗L
∥∥2 +ρ
2
∥∥∥B ( ¯vec(L0)− ¯vec (L∗)
)∥∥∥2 +γρ
2
)(A.8)
Combining Equation (A.8) and (A.7), we have
LAugL
(yk,λ∗L
)− LAugL ( ¯vec (L∗) ,λ∗L)
≤ 1
k
(1
2ρ
∥∥λ0L − λ∗L
∥∥2 +ρ
2
∥∥∥B ( ¯vec(L0)− ¯vec (L∗)
)∥∥∥2 +γρ
2
)Due to the convexity of function F and linearity of (λ∗L)
TAT ¯vec (L∗) and
ρ2
∣∣∣∣∣∣A ¯vec (L∗)∣∣∣∣∣∣22, the function LAugL ( ¯vec (L) ,λ∗L) is strict convex and has a275
unique minimizer, i.e., ¯vec(L∗)
. Hence if k → inf, then the sequence{
¯vec(L(k)
),λ
(k)L
}k≥0
converges to the optimal minimizer ¯vec(L∗)
. In practice, we could use the
additional variables yk = 1k
∑k−1s=0 ¯vec
(L(s+1)
)to achieve O
(1k
)convergence
rate.
Appendix B. Update Rules280
In this section, we provide the derivation of update rules in Section 5.1. The
derivation of Section 5.2 can be obtained by similar methods. Using Leibniz
integration rule, the gradient can be computed as:
∇θ(j)i
[∫τ (j)∈T(j)
i
p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
ρ
2
∑l∈Pi
∣∣∣∣∣∣∣∣θ(k+1),(j)l − θ(j)i −
1
ρλ(k)li
∣∣∣∣∣∣∣∣22
+ρ
2
∑l∈Si
∣∣∣∣∣∣∣∣θ(j)i − θ(k),(j)l − 1
ρλ(k)li
∣∣∣∣∣∣∣∣22
]
=
∫τ (j)∈T(j)
i
∇θ(j)ip(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
︸ ︷︷ ︸fRL
(j)i
+ ρdegiθ(j)i − ρ
(∑l∈Si
[θ(k),(j)l +
1
ρλ(k)il
]+∑l∈Pi
[θ(k+1),(j)l − 1
ρλ(k)li
]),
26
Page 27
with degi is the degree of processing unit i, i.e., set of neighbors. Assuming a
Gaussian policy for the ith split of task j, we can fRL
(j)i
as:
fRL
(j)i
= p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
[p(j)
θ(j)i
(τ (j)
)]− cnst.
= p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)M(j)−1∑m=0
−1
2
[u(j)m −Φ(j)(xm)θ
(j)i
]TΣ−1
[u(j)m −Φ(j)(xm)θ
(j)i
],
with Φ(j)(xm) ∈ Rm×d being a feature representation of the state, and Σ ∈
Rm×m being the covariance matrix of the Gaussian policy. Hence:
∫τ (j)∈T(j)
i
∇θ(j)ip(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
= −∇θ(j)ip(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)[M(j)−1∑m=0
u(j),Tm Σ−1u(j)
m
− 2
M(j)−1∑m=0
u(j),Tm Σ−1Φ(j) (xm) θ
(j)i
+
M(j)−1∑m=0
θ(j),Ti Φ(j),T (xm) Σ−1Φ(j) (xm) θ
(j)i
]
= −p(j)θ(j)i
(τ (j)
)R(j)
(τ (j)
)[M(j)−1∑m=0
∇θ(j)iu(j),Tm Σ−1u(j)
m
− 2
M(j)−1∑m=0
∇θ(j)iu(j),Tm Σ−1Φ(j) (xm) θ
(j)i
+
M(j)−1∑m=0
∇θ(j)iθ(j),Ti Φ(j),T (xm) Σ−1Φ(j) (xm) θ
(j)i
].
At this stage, we handle each of the above three gradients separately. The first
is zero as the expression does not depend on θ(j)i . For the second we have:
∇θ(j)iu(j),Tm Σ−1Φ(j) (xm) θ
(j)i = Φ(j),T (xm) Σ−1u(j)
m .
Furthermore, we have:
∇θ(j)iθ(j),Ti Φ(j),T (xm) Σ−1Φ(j) (xm) θ
(j)i = (A+AT)θ
(j)i ,
27
Page 28
with A = Φ(j),T (xm) Σ−1Φ(j) (xm). Since A = AT, then we have:
∇θ(j)iθ(j),Ti Φ(j),T (xm) Σ−1Φ(i) (xm) θ
(j)i = 2Aθ
(j)i
= 2Φ(j),T (xm) Σ−1Φ(j) (xm) θ(j)i .
Plugging these results back in the original gradient equation, yields:
∫τ (j)∈T(j)
i
∇θ(j)ip(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
)log
p(j)
θ(j)i
(τ (j)
)p(j)
θ(j)i
(τ (j)
)R(j)
(τ (j)
) dτ (j)
= −p(j)θ(j)i
(τ (j)
)R(j)
(τ (j)
)[− 2
M(j)−1∑m=0
Φ(j),T (xm) Σ−1u(j)m
+ 2
M(j)−1∑m=0
Φ(j),T(xm)Σ−1Φ(j)(xm)
]
= 2Eτ (j)∼p(j)
θ(j)i
(·)
[R(j)
(τ (j)
)M(j)−1∑m=0
Φ(j),T (xm) Σ−1u(j)m
]
− 2Eτ (j)∼p(j)
θ(j)i
(·)
[R(j)(τ (j))
M(j)−1∑j=0
Φ(j),T (xm) Σ−1Φ(j) (xm) θ(j)i
].
Considering the penalty terms of ADMM, and setting the above to zero, we can
update˜θ(j)i in closed-form according to:
− 2Eτ (j)∼p(j)
θ(j)i
(·)
[R(j)(τ (j))
M(j)−1∑j=0
Φ(j),T (xm) Σ−1Φ(j) (xm) θ(j)i
]
+ 2Eτ (j)∼p(j)
θ(j)i
(·)
[R(j)
(τ (j)
)M(j)−1∑m=0
Φ(j),T (xm) Σ−1u(j)m
]+ ρdegiθ
(j)i
− ρ
(∑l∈Si
[θ(k),(j)l +
1
ρλ(k)il
]+∑l∈Pi
[θ(k+1),(j)l − 1
ρλ(k)li
])= 0.
28
Page 29
Solving the above yields:
θ(k+1),(j)i =
−ρdegi2
Id×d + Eτ (j)∼pθ(j)i
(·)
R(j)(τ (j)
)M(j)−1∑m=0
ΦT,(j)(xm)Σ−1ΦT,(j)(xm)
−1
×
[Eτ (j)∼p
θ(j)i
(·)
R(j)(τ (j)
)M(j)−1∑m=0
ΦT,(j)(xm)Σ−1u(j)m
− ρ
2
(∑l∈Si
[θ(k),(j)l +
1
ρλ(k)il
]+∑l∈Pi
[θ(k+1),(j)l − 1
ρλ(k)li
])].
Similar steps can be performed to determine the update equation for the
shared repository by taking the gradient with respect to vec(L).
29
Page 30
[1] J. Kober, J. R. Peters, Policy search for motor primitives in robotics, in:
Advances in neural information processing systems, 2009, pp. 849–856.
[2] S. A. Murphy, D. W. Oslin, A. J. Rush, J. Zhu, Methodological challenges in285
constructing effective treatment sequences for chronic psychiatric disorders,
Neuropsychopharmacology 32 (2007) 257–262.
[3] J. Pineau, M. G. Bellemare, A. J. Rush, A. Ghizaru, S. A. Murphy, Con-
structing evidence-based treatment strategies using methods from com-
puter science, Drug and alcohol dependence 88 (2007) S52–S60.290
[4] R. S. Sutton, A. G. Barto, Introduction to Reinforcement Learning, 1st
ed., MIT Press, Cambridge, MA, USA, 1998.
[5] A. Wilson, A. Fern, S. Ray, P. Tadepalli, Multi-task reinforcement learning:
a hierarchical bayesian approach, in: Proceedings of the 24th international
conference on Machine learning, ACM, 2007, pp. 1015–1022.295
[6] M. E. Taylor, P. Stone, Transfer Learning for Reinforcement Learning
Domains: A Survey, Journal of Machine Learning Research 10 (2009)
1633–1685.
[7] A. Lazaric, M. Ghavamzadeh, Bayesian multi-task reinforcement learning,
in: Proceedings of the 27th International Conference on Machine Learning,300
2010.
[8] H. Li, X. Liao, L. Carin, Multi-task reinforcement learning in partially
observable stochastic environments, The Journal of Machine Learning Re-
search 10 (2009) 1131–1186.
[9] H. Bou-Ammar, E. Eaton, P. Ruvolo, M. Taylor, Online Multi-task Learn-305
ing for Policy Gradient Methods, in: T. Jebara, E. P. Xing (Eds.), Pro-
ceedings of the 31st International Conference on Machine Learning, JMLR
Workshop and Conference Proceedings, 2014.
30
Page 31
[10] R. J. Williams, Simple statistical gradient-following algorithms for connec-
tionist reinforcement learning, Machine learning 8 (1992) 229–256.310
[11] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, M. Lee, Natural actor–critic
algorithms, Automatica 45 (2009) 2471–2482.
[12] J. Peters, S. Schaal, Natural actor-critic, Neurocomputing 71 (2008) 1180–
1190.
[13] P. Ruvolo, E. Eaton, Ella: An efficient lifelong learning algorithm, in:315
Proceedings of the 30th International Conference on Machine Learning,
2013.
[14] S. Thrun, J. O’Sullivan, Discovering Structure in Multiple Learning Tasks:
The TC Algorithm, in: International Conference on Machine Learning,
Morgan Kaufmann, 1996.320
[15] W. Caarls, E. Schuitema, Parallel online temporal difference learning for
motor control, IEEE transactions on neural networks and learning systems
27 (2016) 1457–1468.
[16] S. Gu, E. Holly, T. Lillicrap, S. Levine, Deep reinforcement learning for
robotic manipulation with asynchronous off-policy updates, arXiv preprint325
arXiv:1610.00633 (2016).
[17] A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, S. Levine, Collective robot
reinforcement learning with distributed asynchronous guided policy search,
arXiv preprint arXiv:1610.00673 (2016).
[18] S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep330
visuomotor policies, Journal of Machine Learning Research 17 (2016) 1–
40.
[19] M. P. Deisenroth, P. Englert, J. Peters, D. Fox, Multi-task policy search for
robotics, in: Robotics and Automation (ICRA), 2014 IEEE International
Conference on, 2014, pp. 3876–3881.335
31
Page 32
[20] A. Wilson, A. Fern, S. Ray, P. Tadepalli, Multi-Task Reinforcement Learn-
ing: A Hierarchical Bayesian Approach, in: Machine Learning, Proceedings
of the 24th International Conference Corvallis, Oregon, USA, June 20-24,
year = 2007,, ????
[21] M. Snel, S. Whiteson, Learning potential functions and their representa-340
tions for multi-task reinforcement learning, Autonomous agents and multi-
agent systems 28 (2014) 637–681.
[22] A. Kumar, H. Daume, Learning task grouping and overlap in multi-task
learning, in: Proceedings of the 29th International Conference on Machine
Learning, 2012, pp. 1383–1390.345
[23] H. Bou Ammar, E. Eaton, J. M. Luna, P. Ruvolo, Autonomous cross-
domain knowledge transfer in lifelong policy gradient reinforcement learn-
ing, in: Proceedings of the International Joint Conference on Artificial
Intelligence, 2015.
[24] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed opti-350
mization and statistical learning via the alternating direction method of
multipliers, Foundations and Trends in Machine Learning 3 (2011) 1–122.
[25] E. Wei, A. Ozdaglar, Distributed alternating direction method of multi-
pliers, in: Decision and Control, 2012 IEEE 51st Annual Conference on,
IEEE, 2012, pp. 5445–5450.355
[26] R. Tibshiranit, Regression shrinkage and selection via the lasso, Journal
of the Royal Statistical Society. Series B (Methodological) 58 (1996) pp.
267–288.
[27] J. Peters, S. Schaal, Natural Actor-Critic, Neurocomputing 71 (2008).
32