RLPy: A Value-Function-Based Reinforcement Learning ... · [7] G. Neumann, “The Reinforcement Learning Toolbox , Reinforcement Learning for Optimal Control Tasks”. MSc thesis,

Post on 18-Oct-2019

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Simulator

SimulatorSimulator Simulator

Value Function Policy

.

.

.

✓1

✓2

�1

�2

�n ✓n

V ⇡(s) ⇡ ⇥(s)>�s

WeightsFeatures

Interest in Value-based RL with linear function approximation

Increased Specialization in RL ➙ More Granular Framework

Comparison with State-of-the-art Techniques in Various Domains

An easy to use RL framework for both research and education

[1-4]

[3-4]

Easy Installation

Rapid Prototyping in Python

Grid World

Lack of granularity to accommodate recent advances in RL [6]

Challenging for entry level due to programming languages (e.g. C++)[6-7]

Not Self-contained [5]

Existing Gap

Examples

Conclusion

RLPy is an object-oriented reinforcement learning (RL) software package with focus on value- function-based methods using linear function approximation and discrete actions. The framework was designed for both education and research purposes. It provides a rich library of fine-grained, easily exchangeable compo-nents for learning agents (e.g., policies or representations of value functions), fa-cilitating recent increased specialization in RL. RLPy is written in Python to al-low fast prototyping but is also suitable for large-scale experiments through its inbuilt support for optimized numerical libraries and parallelization. Code profil-ing, domain visualizations, and data analysis are integrated in a self-contained package available under the Modified BSD License. All these properties allow users to compare various RL algorithms with little effort.

RLPy is a new python based open-source RL framework.

Simplifies the construction of new RL ideas

Accessible for both novice and expert users

Realizes reproducible experiments

RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and ResearchAlborz Geramifard*, Christoph Dann, Robert H. Klein, William Dabney*, Jonathan P. How

Problem

Abstract

[1] A. Tamar, H. Xu, S. Mannor, “Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations”, In Proceedings of the 31t International Conference on Machine Learning (ICML), pages 127-135, 2014[2] D. Calandriello, A. Lazaric, M. Restelli, “Sparse Multi-Task Reinforcement Learning”, In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 819–827, Quebec, Canada, 2014[3] J. Z. Kolter and A. Y. Ng, “Regularization and Feature Selection in Least-squares Temporal Difference Learning”, In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 521–528, New York, NY, USA, 2009[4] A. Geramifard, T. Walsh, N. Roy, and J. How “Batch iFDD: A Scalable Matching Pursuit Algorithm for Solving MDPs”, The Conference on Uncertainty in Artificial Intelligence (UAI), 2013[5] D. Aberdeen, “LibPGRL: A high performance reinforcement learning library in C++”, 2007. URL https://code.google.com/p/libpgrl.[6] F. de Comite, “PIQLE: A Platform for Implementation of Q-Learning Experiments”, 2006. URL http://piqle.sourceforge.net.[7] G. Neumann, “The Reinforcement Learning Toolbox , Reinforcement Learning for Optimal Control Tasks”. MSc thesis, TU Graz, 2005* Author is currently employed at Amazon.

Value Function / Policy

Domain

Learning Agent

Representation

Policy⇡

Q, V

at

st+1, rt+1RL Agent

Experiment

> pip install -U RLPy

Simulator

Cart Pole

Plotting

Learning Steps

Learning Episodes

Learning Time

Return

Number of Features

Termination

TOTAL TIME84.49 (s)

Pendulum:286:step6.90%

(0.72%)26559×

mlab:988:rk45.30%

(1.92%)26559×

5.30%26559×

numeric:256:asarray1.24%

(0.12%)134332×

1.18%106236×

Pendulum:347:_dsdt2.10%

(2.10%)106236×

2.10%106236×

Experiment:105:performanceRun26.30%(0.09%)

10×

4.28%16559×

eGreedy:29:pi33.94%(0.11%)27047×

21.93%16559×

GeneralTools:178:randSet0.60%

(0.09%)27691×

0.59%27047×

Representation:259:bestActions35.40%(0.26%)35987×

33.21%25987×

backend_macosx:24:mainloop11.90%(0.00%)

~:0:<matplotlib.backends._macosx.show>11.90%

(11.74%)1×

11.90%1×

backend_bases:69:__call__11.90%(0.00%)

11.90%1×

~:0:<method 'sum' of 'numpy.ndarray' objects>5.81%

(0.52%)527988×

_methods:16:_sum5.57%

(0.83%)537989×

5.29%527988×

~:0:<method 'reduce' of 'numpy.ufunc' objects>9.02%

(9.02%)1137592×

4.74%537989×

tiles:74:_hash15.67%(8.59%)527988×

5.81%527988×

~:0:<numpy.core.multiarray.arange>2.15%

(2.15%)747267×

1.13%527988×

main:45:main100.00%(0.00%)

OnlineExperiment:54:run87.28%(0.12%)

87.28%1×

OnlineExperiment:125:save12.70%(0.00%)

12.70%1×

2.62%10000×

26.30%10×

12.01%10488×

Greedy_GQ:36:learn46.14%(1.45%)10000×

46.14%10000×

pyplot:123:show11.90%(0.00%)

11.90%1×

11.90%1×

Representation:143:phi48.62%(0.13%)45987×

20.99%20000×

Representation:161:phi_sa5.25%

(3.84%)127961×

0.84%20000×

GeneralTools:117:count_nonzero20.30%

(19.91%)10000×

20.30%10000×

Representation:290:bestAction2.23%

(0.03%)10000×

2.23%10000×

~:0:<numpy.core._dotblas.dot>0.92%

(0.92%)128359×

0.15%20000×

Representation:112:Qs34.14%(0.47%)35987×

Representation:127:Q5.53%

(0.33%)107961×

5.53%107961×

27.63%25987×

~:0:<numpy.core.multiarray.array>2.50%

(2.50%)805132×

0.40%35987×

4.41%107961×

0.76%107961×

tiles:54:phi_nonTerminal48.30%

(17.30%)45063×

48.30%45063×

34.14%35987×

~:0:<method 'max' of 'numpy.ndarray' objects>0.54%

(0.04%)36047×

0.54%35987×

0.80%127961×

~:0:<numpy.core.multiarray.zeros>0.85%

(0.85%)228631×

0.61%127961×

2.18%10000×

_methods:28:_all4.41%

(0.57%)563331×

3.85%563331×

~:0:<method 'all' of 'numpy.ndarray' objects>4.96%

(0.54%)563331×

4.41%563331×

1.12%134332×

0.14%45063×

0.13%45063×

tiles:82:_physical_addr30.51%(7.53%)450630×

30.51%450630×

15.67%527988×

fromnumeric:1643:all7.31%

(0.94%)562496×

7.31%562496×

4.95%562496×

numeric:326:asanyarray1.52%

(0.68%)589084×

1.42%562496×

0.83%589084×

RLPy (rlpy.readthedocs.org)

Sponsored by:

FA9550-09-1-0522 N000141110688

Built-in Hyperparameter Optimization

Improved Experimentation

Reproducible

Parallel Execution

Optimized Implementation (Cython, C++)

Batteries Included!

20 Domains, 8 Learning Algorithms

4 Policies, 7 Representations

Built-in Profiling

Improved Granularity of the agent using OO Python

top related