Learning Parameterized Skills - Duke Universitygdk/pubs/parskills.pdfLearning Parameterized Skills Bruno Castro da Silva [email protected] Autonomous Learning Laboratory, Computer

Learning Parameterized Skills

Bruno Castro da Silva [email protected]

Autonomous Learning Laboratory, Computer Science Dept., University of Massachusetts Amherst, 01003 USA.

George Konidaris [email protected]

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge MA 02139, USA.

Andrew G. Barto [email protected]

Autonomous Learning Laboratory, Computer Science Dept., University of Massachusetts Amherst, 01003 USA.

Abstract

We introduce a method for constructing skillscapable of solving tasks drawn from a distri-bution of parameterized reinforcement learn-ing problems. The method draws exampletasks from a distribution of interest and usesthe corresponding learned policies to esti-mate the topology of the lower-dimensionalpiecewise-smooth manifold on which the skillpolicies lie. This manifold models how policyparameters change as task parameters vary.The method identifies the number of chartsthat compose the manifold and then appliesnon-linear regression in each chart to con-struct a parameterized skill by predicting pol-icy parameters from task parameters. Weevaluate our method on an underactuatedsimulated robotic arm tasked with learningto accurately throw darts at a parameterizedtarget location.

1. Introduction

One approach to dealing with the complexity of apply-ing reinforcement learning to high-dimensional controlproblems is to specify or discover hierarchically struc-tured policies. The most widely used hierarchical re-inforcement learning formalism is the options frame-work (Sutton et al., 1999), where high-level options(also called skills) define temporally extended policiesthat can be used directly in learning and planning butabstract away the details of low-level control. One ofthe motivating principles underlying hierarchical rein-

Appearing in Proceedings of the 29 th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).

forcement learning is the idea that subproblems recur,so that acquired or designed options can be reused ina variety of tasks and contexts.

However, the options framework as usually formulateddefines an option as a single policy. An agent may in-stead wish to define a parameterized policy that canbe applied across a class of related tasks. For exam-ple, consider a soccer playing agent. During a gamethe agent might wish to kick the ball with varyingamounts of force, towards various different locationson the field; for such an agent to be truly competent itshould be able to execute such kicks whenever neces-sary, even with a particular combination of force andtarget location that it has never had direct experiencewith. In such cases, learning a single policy for eachpossible variation of the task is clearly infeasible. Theagent might therefore wish to learn good policies fora few specific kicks, and then use this experience tosynthesize a single general skill for kicking the ball—parameterized by the amount of force desired and thetarget location—that it can execute on-demand.

We propose a method for constructing parameterizedskills from experience. The agent learns to solve a fewinstances of the parameterized task and uses these toestimate the topology of the lower-dimensional mani-fold on which the skill policies lie. This manifold mod-els how policy parameters change as task parametersvary. The method identifies the number of charts thatcompose the manifold and then applies non-linear re-gression in each chart to construct a parameterizedskill by predicting policy parameters from task param-eters. We evaluate the method on an underactuatedsimulated robotic arm tasked with learning to accu-rately throw darts at a parameterized target location.


2. Setting

In what follows we assume an agent which is presentedwith a set of tasks drawn from some task distribution.Each task is modeled by a Markov Decision Process(MDP) and the agent must maximize the expected re-ward over the whole distribution of possible MDPs.We assume that the MDPs have dynamics and rewardfunctions similar enough so that they can be consid-ered variations of a same task. Formally, the goal ofsuch an agent is to maximize:∫

P (τ)J(πθ, τ

)dτ, (1)

where πθ is a policy parameterized by a vectorθ ∈ RN , τ is a task parameter vector drawn froma |T |-dimensional continuous space T , J(π, τ) =

E{∑K

t=0 rt|π, τ}

is the expected return obtained whenexecuting policy π while in task τ and P (τ) is a prob-ability density function describing the probability oftask τ occurring. Furthermore, we define a parameter-ized skill as a function

Θ : T → RN ,

mapping task parameters to policy parameters. Whenusing a parameterized skill to solve a distribution oftasks, the specific policy parameters to be used dependon the task currently being solved and are specified byΘ. Under this definition, our goal is to construct aparameterized skill Θ which maximizes:∫

P (τ)J(πΘ(τ), τ

)dτ. (2)

2.1. Assumptions

We assume the agent must solve tasks drawn from adistribution P (τ). Suppose we are given a set K ofpairs {τ, θτ}, where τ is a |T |-dimensional vector oftask parameters sampled from P (τ) and θτ is the cor-responding policy parameter vector that maximizes re-turn for task τ . We would like to use K to constructa parameterized skill which (at least approximately)maximizes the quantity in Equation 2.

We start by highlighting the fact that the probabilitydensity function P induces a (possibly infinite) set ofskill policies for solving tasks in the support of P , eachone corresponding to a vector θτ ∈ RN . These poli-cies lie in an N -dimensional space containing samplepolicies that can be used to solve tasks drawn fromP . Since the tasks in the support of P are assumedto be related, it is reasonable to further assume thatthere exists some structure in this space; specifically,

that the policies for solving tasks drawn from the dis-tribution lie on a lower-dimensional surface embeddedin RN and that their parameters vary smoothly as wevary the task parameters.

This assumption is reasonable in a variety of situa-tions, especially in the common case where the policyis differentiable with respect to its parameters. In thiscase, the natural gradient of the performance J

(πθ, τ

)is well-defined and indicates the direction (in policyspace) that locally maximizes J but which does notchange the distribution of paths induced by the policyby much. Consider, for example, problems in whichperformance is directly correlated to how close theagent gets to a goal state; in this case one can interpreta small perturbation to the policy as defining a newpolicy which solves a similar task but with a slightlydifferent goal. Since under these conditions small pol-icy changes induce a smoothly-varying set of goals, onecan imagine that the goals themselves parameterizethe space of policies: that is, that by varying the goalor task one moves over the lower-dimensional surfaceof corresponding policies.

Note that it is possible to find points in policy spacein which the corresponding policy cannot be furtherlocally modified in order to obtain a solution to a new,related goal. This implies that the set of skill policiesof interest might be in fact distributed over severalcharts of a piecewise-smooth manifold. Our methodcan automatically detect when this is the case andconstruct separate models for each manifold, essen-tially discovering how many different skills exist andcreating a unified model by which they are integrated.

3. Overview

Our method proceeds by collecting example task in-stances and their solution policies and using them totrain a family of independent non-linear regressionmodels mapping task parameters to policy parameters.However, because policies for different subsets of Tmight lie in different, disjoint manifolds, it is necessaryto first estimate how many such lower-dimensional sur-faces exist before separately training a set of regressionmodels for each one.

More formally, our method consists of four steps: 1)draw |K| sample tasks from P and construct K, theset of task instances τ and their corresponding learnedpolicy parameters θτ ; 2) use K to estimate the geom-etry and topology of the policy space, specifically thenumber D of lower-dimensional surfaces embedded inRN on which skill policies lie; 3) train a classifier χmapping elements of T to [1, . . . , D]; that is, to one


of the D lower-dimensional manifolds; 4) train a setof (N ×D) independent non-linear regression modelsΦi,j , i ∈ [1, . . . , D], j ∈ [1, . . . N ], each one mappingelements of T to individual skill policy parameters θi,i ∈ [1, . . . N ]. Each subset [Φi,1, . . . ,Φi,N ] of regressionmodels is trained over all tasks τ in K where χ(τ) = i.1

We therefore define a parameterized skill as a vectorfunction:

Θ(τ) ≡ [Φχ(τ),1, . . . ,Φχ(τ),N ]T . (3)

Task space T

i

policy space

i

km

χ(τ)P (τ)

τΦi,1

Φi,N

......

θ1

θN

Figure 1. Steps involved in executing a parameterized skill:a task is drawn from the distribution P ; the classifier χidentifies the manifold to which the policy for that taskbelongs; the corresponding regression models for that man-ifold map task parameters to policy parameters.

Figure 1 depicts the above-mentioned steps. Notethat we have described our method without specifyinga particular choice of policy representation, learningalgorithm, classifier, or non-linear regression model,since these design decisions are best made in light ofthe characteristics of the application at hand. In thefollowing sections we present a control problem whosegoal is to accurately throw darts at a variety of tar-gets and describe one possible instantiation of our ap-proach.

4. The Dart Throwing Domain

In the dart throwing domain, a simulated planar un-deractuated robotic arm is tasked with learning a pa-rameterized policy to accurately throw darts at targetsaround it (Figure 4). The base of the arm is affixedto a wall in the center of a 3-meter high and 4-meterwide room. The arm is composed of three connectedlinks and a single motor which applies torque only tothe second joint, making this a difficult non-linear and

1 This last step assumes that the policy features areapproximately independent conditioned on the task; if thisis known not to be the case, it is possible to alternativelytrain a set of D multivariate non-linear regression modelsΦi, i ∈ [1, . . . , D], each one mapping elements of T to com-plete policies parameterizations θ ∈ RN , and use them toconstruct Θ. Again, the i-th such model should be trainedonly over tasks τ in K such that χ(τ) = i.

underactuated control problem. At the end of its thirdlink, the arm possesses an actuator capable of holdingand releasing a dart. The state of the system is a 7-dimensional vector composed by 6 continuous featurescorresponding to the angle and angular velocities ofeach link and by a seventh binary feature specifyingwhether or not the dart is still in being held. The goalof the system is to control the arm so that it executesa throwing movement and accurately hits a target ofinterest. In this domain the space T of tasks con-sists of a 2-dimensional Euclidean space containing all(x, y) coordinates at which a target can be placed—atarget can be affixed anywhere on the walls or ceilingsurrounding the agent.

5. Learning Parameterized Skills forDart Throwing

To implement the method outlined in Section 3 weneed to specify methods to 1) represent a policy; 2)learn a policy from experience; 3) analyze the topol-ogy of the policy space and estimate D, the number oflower-dimensional surfaces on which skill policies lie;4) construct the non-linear classifier χ; and 5) con-struct the non-linear regression models Φ. In this sec-tion we describe the specific algorithms and techniqueschosen in order to tackle the dart-throwing domain.We discuss our results in Section 6.

Our choices of methods are directly guided by the char-acteristics of the domain. Because the following ex-periments involve a multi-joint simulated robotic arm,we chose a policy representation that is particularlywell-suited to robotics: Dynamic Movement Primi-tives (Schaal et al., 2004), or DMPs. DMPs are aframework for modular motor control based on a setof linearly-parameterized autonomous non-linear dif-ferential equations. The time evolution of these equa-tions defines a smooth kinematic control policy whichcan be used to drive the controlled system. The spe-cific trajectory in joint space that needs to be followedis obtained by integrating the following set of differen-tial equations:

κv = K(g − x)−Qv + (g − x0)f

κx = v,

where x and v are the position and velocity of thesystem, respectively; x0 and g denote the start andgoal positions; κ is a temporal scaling factor; and Kand Q act like a spring constant and a damping term,respectively. Finally, f is a non-linear function whichcan be learned in order to allow the system to generate


arbitrarily complex movements and is defined as

f(s) =

∑i wiψi(s)∑i ψi(s)

,

where ψi(s) = exp(−hi(s − ci)2) are Gaussian basis

functions with adjustable weights wi and which de-pend on a phase variable s. The phase variable isconstructed so that it monotonically decreases from 1to 0 during the execution of the movement and is typi-cally computed by integrating κs = −αs, where α is apre-determined constant. In our experiments we useda PID controller to track the trajectories induced bythe above-mentioned system of equations.

This results in a 37-dimensional policy vector θ =[λ, g, w1, . . . , w35]T , where λ specifies the value of thephase variable s at which the arm should let go ofthe dart; g is the goal parameter of the DMP; andw1, . . . , w35 are the weights of each Gaussian basisfunction in the movement primitive.

We combine DMPs with PoWER (Kober & Peters,2008), a policy search technique that collects samplepath executions and updates the policy’s parameterstowards ones that induce a new success-weighted pathdistribution. PoWER works by executing rollouts ρconstructed based on slightly perturbed versions of thecurrent policy parameters; perturbations to the pol-icy parameters consist of a structured, state-dependentexploration εt

Tφ(s, t), where εt ∼ N (0, Σ) and Σ is ameta-parameter of the exploration; φ(s, t) is the vec-tor of policy feature activations at time t. By addingthis type of perturbation to θ we induce a stochas-tic policy whose actions are a = (θ + εt)

Tφ(s, t)) ∼N (0, φ(s, t)T Σφ(s, t)). After performing rollouts us-ing such a stochastic policy, the policy parameters areupdated as follows:

θk+1 = θk +

⟨ T∑t=1

W(s, t)Qπ(s,a, t))

⟩−1

ω(ρ)

×

⟨ T∑t=1

W(s, t)εtQπ(s,a, t))

⟩ω(ρ)

where Qπ(s,a, t) =∑Tt=t r(st,at, st+1, t) is an

unbiased estimate of the return, W(s, t) =

φ(s, t)φ(s, t)T(φ(s, t)T Σφ(s, t)

)−1and 〈·〉ω(ρ) denotes

an importance sampler which can be chosen dependingon the domain. A useful heuristic when defining ω isto discard sample rollouts with very small importanceweights; importance weights, in our experiments, areproportional to the relative performance of the rolloutin comparison to others.

To analyze the geometry and topology of the pol-icy space and estimate the number D of lower-

dimensional surfaces on which skill policies lie we usedthe ISOMAP algorithm (Tenenbaum et al., 2000).ISOMAP is a technique for learning the underlyingglobal geometry of high-dimensional spaces and thenumber of non-linear degrees of freedom that under-lie it. This information provides us with an estimateof D, the number of disjoint lower-dimensional mani-folds where policies are located; ISOMAP also specifiesto which of these disconnected manifolds a given in-put policy belongs. This information is used to trainthe classifier χ, which learns a mapping from task pa-rameters to numerical identifiers specifying one of thelower-dimensional surfaces embedded in policy space.For this domain we have implemented χ by means ofa simple linear classifier. In general, however, morepowerful classifiers could be used.

Finally, we must choose a non-linear regression algo-rithm for constructing Φi,j . We use standard SupportVector Machines (SVM) (Vapnik, 1995) due to theirgood generalization capabilities and relatively low de-pendence on parameter tuning. In the experimentspresented in Section 6 we use SVMs with Gaussiankernels and a inverse variance width of 5.0. As pre-viously mentioned, if important correlations betweenpolicy and task parameters are known to exist, mul-tivariate regression models might be preferable; onepossibility in such cases are Structure Support VectorMachines (Tsochantaridis et al., 2005).

6. Experiments

Before discussing the performance of parameterizedskill learning in this domain, we present some empiri-cally measured properties of its policy space. Specifi-cally, we describe topological characteristics of the in-duced space of policies generated as we vary the task.We sampled 60 tasks (target positions) uniformly atrandom and placed target boards at the correspond-ing positions. Policies for solving each one of thesetasks were computed using PoWER; a policy updatewas performed every 20 rollouts and the search ranuntil a minimum performance threshold was reached.In our simulations, this criteria corresponded to themoment when the robotic arm first executed a policythat landed the dart within 5 centimeters of the in-tended target. In order to speed up the sampling pro-cess we initialize policies for subsequent targets withones computed for previously sampled tasks.

We first analyze the structure of the policy manifoldby estimating how each dimension of a policy variesas we smoothly vary the task. Figure 2a presents thisinformation for a representative subset of policy pa-rameters. On each subgraph of Figure 2a the x axis


corresponds to a 1-dimensional representation of thetask obtained by computing the angle at which thetarget is located with respect to the arm; this is donefor ease of visualization, since using x, y coordinateswould require a 3-D figure. The y axis corresponds tothe value of a selected policy parameter. The first im-portant observation to be made is that as we vary thetask, not only do the policy parameters vary smoothly,but they tend to remain confined to one of two dis-joint but smoothly varying lower-dimensional surfaces.A discontinuity exists, indicating that after a certainpoint in task space a qualitatively different type of pol-icy parameterization is required. Another interestingobservation is that this discontinuity occurs approx-imately at the task parameter values correspondingto hitting targets directly above the robotic arm; thisimplies that skills for hitting targets to the left of thearm lie on a different manifold than policies for hit-ting targets to its right. This information is relevantfor two reasons: 1) it confirms both that the manifoldassumption is reasonable and that smooth task varia-tions induce smooth, albeit non-linear, policy changes;and 2) it shows that the policies for solving a distri-bution of tasks are generally confined to one of severallower-dimensional surfaces, and that the way in whichthey are distributed among these surfaces is correlatedto the qualitatively different strategies that they im-plement.

1 0 1 2 3 410

0

10

20

1 0 1 2 3 44

2

0

2

1 0 1 2 3 44

2

0

2

4

1.5 2 2.5 3 3.5 40

1

2

3

1.5 2 2.5 3 3.5 4459

459.5

460

460.5

1.5 2 2.5 3 3.5 4142.5

143

143.5

144

0.5 0 0.5 1 1.5 28

9

10

11

0.5 0 0.5 1 1.5 27

6

5

4

3

0.5 0 0.5 1 1.5 2463

462

461

460

459

a)

b)

c)

Figure 2. Analysis of the variation of a subset of policyparameters as a function of smooth changes in the task.

Figures 2b and 2c show, similarly, how a selected sub-set of policy parameters changes as we vary the task,but now with the two resulting manifolds analyzedseparately. Figure 2b shows the variations in policyparameters induced by smoothly modifying tasks forhitting targets anywhere in the interval of 1.57 to 3.5radians—that is, targets placed roughly at angles be-tween 90◦ (directly above the agent) and 200◦ (lowestpart of the right wall). Figure 2c shows that same in-formation but for targets located on one of the othertwo quadrants—that is, targets to the left of the arm.

We superimposed in Figures 2a-c a red curve repre-senting the non-linear fit constructed by Φ while mod-eling the relation between task and policy parametersin each manifold. Note also how a clear linear sepa-ration exists between which task policies lie on whichmanifold: this separation indicates that two qualita-tively distinct types of movement are required for solv-ing different subsets of the tasks. Because we empiri-cally observe that a linear separation exists, we imple-ment χ using a simple linear classifier mapping tasksparameters to the numerical identifier of the manifoldto which the task belongs.

We can also analyze the characteristics of thelower-dimensional, quasi-isometric embedding of poli-cies produced by ISOMAP. Figure 3 shows the 2-dimensional embedding of a set of policies sampledfrom one of the manifolds. Embeddings for the othermanifold have similar properties. Analysis of the resid-ual variance of ISOMAP allows us to conclude thatthe intrinsic dimensionality of the skill manifold is 2;this is expected since we are essentially parameteriz-ing a high-dimensional policy space by task parame-ters, which are drawn from the 2-dimensional space T .This implies that even though skill policies themselvesare part of a 37-dimensional space, because there arejust two degrees-of-freedom with which we can varytasks, the policies themselves remain confined to a2-dimensional manifold. In Figure 3 we use lighter-colored points to identify embeddings of policies forhitting targets at higher locations. From this obser-vation it is possible to note how policies for similartasks tend to remain geometrically close in the spaceof solutions.

10 5 0 5 10 1510

8

6

4

2

0

2

4

6

8

Figure 3. 2-dimensional embedding of policies parameters.

Figure 4 shows some types of movements the arm is ca-pable of executing when throwing the dart at specifictargets. Figure 4a and Figure 4b present trajectoriescorresponding to policies aiming at targets high on theceiling and low on the right wall, respectively; thesewere presented as training examples to the parameter-ized skill. Note that the link trajectories required to


accurately hit a target are complex because we are us-ing just a single actuated joint to control an arm withthree joints.

Figure 4c shows a policy that was predicted by the pa-rameterized skill for a new, unknown task correspond-ing to a target in the middle of the right wall. A totalof five sample trajectories were presented to the pa-rameterized skill and the corresponding predicted pol-icy was further improved by two policy updates, afterwhich the arm was capable of hitting the intended tar-get perfectly.

Figure 4. Learned arm movements (a,b) presented as train-ing examples to the parameterized skill; (c) predictedmovement for a novel target.

Figure 5 shows the predicted policy parameter error,averaged over the parameters of 15 unknown taskssampled uniformly at random, as a function of thenumber of examples used to learn the parameterizedskill. This is a measure of the relative error betweenthe policy parameters predicted by Θ and parametersof a known good solution for the same task. The lowerthe error, the closer the predicted policy is (in norm)to the correct solution. After 6 samples are presentedto the parameterized option it is capable of predictingpolicies whose parameters are within 6% of the correctones; with approximately 15 samples, this error sta-bilizes around 3%. Note that this type of accuracy isonly possible because even though the spaces analyzedare high-dimensional, they are also highly structured;

specifically, solutions to similar tasks lie on a lower-dimensional manifold whose regular topology can beexploited when generalizing known solutions to newproblems.

0

0.1

0.2

0.3

0.4

0.5

0 5 10 15 20 25 30 35

Ave

rage

feat

ure

rela

tive

erro

r

Sampled training task instances

Figure 5. Average predicted policy parameter error as afunction of the number of sampled training tasks.

Since some policy representations might be particu-larly sensitive to noise, we additionally measured theactual effectiveness of the predicted policy when di-rectly applied to novel tasks. Specifically, we measurethe distance between the position where the dart hitsand the intended target; this measurement is obtainedby executing the predicted policy directly and beforeany further learning takes places. Figure 6 shows thatafter 10 samples are presented to the parameterizedskill, the average distance is 70cm. This is a reason-able error if we consider that targets can be placedanywhere on a surface that extends for a total of 10meters. If the parameterized skill is presented with atotal of 24 samples the average error decreases to 30cm,which roughly corresponds to the dart being thrownfrom 2 meters away and landing one dartboard awayfrom the intended center.

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35

Ave

rage

dis

tanc

e to

targ

et b

efor

e le

arni

ng (c

m)


Figure 6. Average distance to target (before learning) as afunction of the number of sampled training tasks.

Although these initial solutions are good, especiallyconsidering that no learning with the target task pa-


rameters took place, they are not perfect. We mighttherefore want to further improve them. Figure 7shows how many additional policy updates are re-quired to improve the policy predicted by the param-eterized skill up to a point where it reaches a perfor-mance threshold. The dashed line in Figure 7 showsthat on average 22 policy updates are required for find-ing a good policy when the agent has to learn fromscratch. On the other hand, by using a parameterizedskill trained with 9 examples it is already possible todecrease this number to just 4 policy updates. With20 examples or more it takes the agent an average of2 additional policy updates to meet the performancethreshold.

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35

Polic

y up

date

s to

perf

orm

ance

thre

shol

d


With parameterized skillWithout parameterized skill

(averaged over tasks)

Figure 7. Average number of policy updates required toimprove the solution predicted by the parameterized skillas a function of the number of sampled training tasks.

7. Related Work

The simplest solution for learning a distribution oftasks in RL is to include τ , the task parameter vec-tor, as part of state descriptor and treat the entireclass of tasks as a single MDP. This approach has sev-eral shortcomings: 1) learning and generalizing overtasks is slow since the state features corresponding totask parameters remain constant throughout episodes;2) the number of basis functions required to approxi-mate the value function or policy needs to be increasedto capture all the non-trivial correlations caused bythe added task features; 3) sample task policies can-not be collected in parallel and combined to acceleratethe construction of the skill; and 4) if the distribu-tion of tasks is non-stationary, there is no simple wayof adapting a single estimated policy in order to dealwith a new pattern of tasks.

Alternative approaches have been proposed under thegeneral heading of skill transfer. Konidaris and Barto(2007) learn reusable options by representing them inan agent-centered state space but do not address the

problem of how to construct skills for solving a familyof related tasks. Soni and Singh (2006) create optionswhose termination criteria can be adapted on-the-flyto deal with changing aspects of a task. They do not,however, predict a complete parameterization of thepolicy for new tasks.

Other methods have been proposed to transfer a modelor value function between given pairs of tasks, butnot necessarily to reuse learned tasks and constructa parameterized solution. It is often assumed that amapping between features and actions of a source andtarget tasks exists and is known a priori, as in Tay-lor and Stone (2007). Liu and Stone (2006) proposea method for transferring a value function betweenpairs of tasks but require prior knowledge of the taskdynamics in the form of a Dynamic Bayes Network.Hausknecht and Stone (2011) present a method forlearning a parameterized skill for kicking a ball withvarying amounts of force. They exhaustively test vari-ations of one of the policy parameters known a priorito be relevant for the skill and then measure the re-sulting effect on the distance traveled by the ball. Byassuming a quadratic relation between these variables,they are able to construct a regression model and in-vert it, thereby obtaining a closed-form expression forthe policy parameter value required for a desired kick.This is an interesting example of the type of parame-terized skill that we would like to construct, albeit avery domain-dependent one.

The work most closely related to ours is by Kober, Wil-helm, Oztop, and Peters (2012), who learn a mappingfrom task description to metaparameters of a DMP, es-timating the mean value of each metaparameter givena task and the uncertainty with which it solves thattask. Their method uses a parameterized skill frame-work similar to ours but requires the use of DMPs forpolicies and assumes that their metaparameters aresufficient to represent the class of tasks of interest.Bitzer, Havoutis, and Vijayakumar (2008) synthesizenovel movements by modulating DMPs learned in a la-tent space and projecting them back onto the originalpose space. Neither approach supports arbitrary taskparameterizations or discontinuities in the skill mani-fold, which are important, for instance, for classes ofmovements whose description requires more than oneDMP.

Finally, Braun et al. (2010) discuss how Bayesian mod-eling can be used to explain experimental data fromcognitive and motor neuroscience that supports theidea of structure learning in humans, a concept verysimilar to the one of parameterized skills. The authorsdo not, however, propose a concrete method for iden-


tifying and constructing such skills.

8. Conclusions and Future Work

We have presented a general framework for construct-ing parameterized skills. The idea underlying ourmethod is to sample a small number of task instancesand generalize them to new problems by combiningclassifiers and non-linear regression models. This ap-proach is effective in practice because it exploits theintrinsic structure of the policy space and becauseskill policies for similar tasks typically lie on a lower-dimensional manifold. Our framework allows for theconstruction of effective parameterized skills and isable to identify the number of qualitatively differentstrategies required for a given distribution of tasks.

This work can be extended in several important di-rections. First, the question of how to actively selecttraining tasks in order to improve the overall readinessof a parameterized skill, given a distribution of tasksexpected in the future, needs to be addressed. Anotherimportant open problem is how to properly deal witha non-stationary distribution of tasks. If a new taskdistribution is known exactly it might be possible touse it to resample instances from K and thus recon-struct the parameterized skill. However, more generalstrategies are needed if the task distribution changesin a way that is not known to the agent.

Another important question is how to analyze thetopology and geometry of the policy space more effi-ciently. Methods for discovering the underlying globalgeometry of high-dimensional spaces typically requiredense sampling of the manifold, which could requiresolving an unreasonable number of training tasks.Note, however, that most local policy search meth-ods like the Natural Actor Critic and PoWER movesmoothly over the manifold of policies while search-ing for locally optimal solutions. Therefore, at eachpolicy update during learning they provide us with anew sample which can be used for further train the pa-rameterized skill; each task instance therefore resultsin a trajectory through policy parameter space. Inte-grating this type of sampling into the construction ofthe skill corresponds to a type of off-policy learningmethod, since samples collected while estimating onepolicy could be used to generalize it to different tasks.

Acknowledgments

This research was partially funded by the Euro-pean Union under the FP7-ICT program (IM-CLeVeRproject, grant agreement no. ICT-IP-231722).

References

Bitzer, S., Havoutis, I., and Vijayakumar, S. Synthesis-ing novel movements through latent space modulationof scalable control policies. In Proceedings of the TenthInternational Conference on Simulation of Adaptive Be-havior: From Animals to Animats, pp. 199–209, 2008.

Braun, D., Waldert, S., Aertsen, A., Wolpert, D., andMehring, C. Structure learning in a sensorimotor as-sociation task. PLoS ONE, 5(1):e8973, 2010.

Hausknecht, M. and Stone, P. Learning powerful kicks onthe Aibo ERS-7: The quest for a striker. In RoboCup-2010: Robot Soccer World Cup XIV, volume 6556 of Lec-ture Notes in Artificial Intelligence, pp. 254–65. SpringerVerlag, 2011.

Kober, J. and Peters, J. Policy search for motor primitivesin robotics. In Advances in Neural Information Process-ing Systems 21, pp. 849–856, 2008.

Kober, J., Wilhelm, A., Oztop, E., and Peters, J. Rein-forcement learning to adjust parametrized motor primi-tives to new situations. Autonomous Robots, 33(4):361–379, 2012. ISSN 0929-5593.

Konidaris, G. and Barto, A. Building portable options:Skill transfer in reinforcement learning. In Proceedingsof the Twentieth International Joint Conference on Ar-tificial Intelligence, pp. 895–900, 2007.

Liu, Y. and Stone, P. Value-function-based transfer for re-inforcement learning using structure mapping. In Pro-ceedings to the Twenty-First National Conference on Ar-tificial Intelligence, pp. 415–420, 2006.

Schaal, S., Peters, J., Nakanishi, J., and Ijspeert, A.Learning movement primitives. In Proceedings ofthe Eleventh International Symposium on Robotics Re-search. Springer, 2004.

Soni, V. and Singh, S. Reinforcement learning of hierar-chical skills on the Sony Aibo robot. In Proceedings ofthe Fifth International Conference on Development andLearning, 2006.

Sutton, R., Precup, D., and Singh, S. Between MDPs andsemi-MDPs: A framework for temporal abstraction inreinforcement learning. Artificial Intelligence, 112:181–211, 1999.

Taylor, M. and Stone, P. Cross-domain transfer for rein-forcement learning. In Proceedings of the Twenty FourthInternational Conference on Machine Learning, 2007.

Tenenbaum, J., de Silva, V., and Langford, J. A global ge-ometric framework for nonlinear dimensionality reduc-tion. Science, 290(5500):2319–2323, 2000.

Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun,Y. Large margin methods for structured and interde-pendent output variables. Journal of Machine LearningResearch, 6(Dec):1453–1484, 2005.

Vapnik, V. The Nature of Statistical Learning Theory.Springer New York Inc., New York, NY, USA, 1995.ISBN 0-387-94559-8.

Learning Parameterized Skills - Duke Universitygdk/pubs/parskills.pdfLearning Parameterized Skills Bruno Castro da Silva [email protected] Autonomous Learning Laboratory, Computer

Documents