Top Banner
Simulator Simulator Simulator Simulator Value Function Policy . . . 1 2 φ 1 φ 2 φ n n V (s) (s) > θ s Weights Features Interest in Value-based RL with linear function approximation Increased Specialization in RL More Granular Framework Comparison with State-of-the-art Techniques in Various Domains An easy to use RL framework for both research and education [1-4] [3-4] Easy Installation Rapid Prototyping in Python Grid World Lack of granularity to accommodate recent advances in RL [6] Challenging for entry level due to programming languages (e.g. C++) [6-7] Not Self-contained [5] Existing Gap Examples Conclusion RLPy is an object-oriented reinforcement learning (RL) software package with focus on value- function-based methods using linear function approximation and discrete actions. The framework was designed for both education and research purposes. It provides a rich library of fine-grained, easily exchangeable compo- nents for learning agents (e.g., policies or representations of value functions), fa- cilitating recent increased specialization in RL. RLPy is written in Python to al- low fast prototyping but is also suitable for large-scale experiments through its inbuilt support for optimized numerical libraries and parallelization. Code profil- ing, domain visualizations, and data analysis are integrated in a self-contained package available under the Modified BSD License. All these properties allow users to compare various RL algorithms with little effort. RLPy is a new python based open-source RL framework. Simplifies the construction of new RL ideas Accessible for both novice and expert users Realizes reproducible experiments RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research Alborz Geramifard*, Christoph Dann, Robert H. Klein, William Dabney*, Jonathan P. How Problem Abstract [1] A. Tamar, H. Xu, S. Mannor, “Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations”, In Proceedings of the 31t International Conference on Machine Learning (ICML), pages 127-135, 2014 [2] D. Calandriello, A. Lazaric, M. Restelli, “Sparse Multi-Task Reinforcement Learning”, In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 819–827, Quebec, Canada, 2014 [3] J. Z. Kolter and A. Y. Ng, “Regularization and Feature Selection in Least-squares Temporal Difference Learning”, In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 521–528, New York, NY, USA, 2009 [4] A. Geramifard, T. Walsh, N. Roy, and J. How “Batch iFDD: A Scalable Matching Pursuit Algorithm for Solving MDPs”, The Conference on Uncertainty in Artificial Intelligence (UAI), 2013 [5] D. Aberdeen, “LibPGRL: A high performance reinforcement learning library in C++”, 2007. URL https://code.google.com/p/libpgrl. [6] F. de Comite, “PIQLE: A Platform for Implementation of Q-Learning Experiments”, 2006. URL http://piqle.sourceforge.net. [7] G. Neumann, “The Reinforcement Learning Toolbox , Reinforcement Learning for Optimal Control Tasks”. MSc thesis, TU Graz, 2005 * Author is currently employed at Amazon. Value Function / Policy Domain Learning Agent Representation Policy Q, V a t s t+1 ,r t+1 RL Agent Experiment > pip install -U RLPy Simulator Cart Pole Plotting Learning Steps Learning Episodes Learning Time Return Number of Features Termination TOTAL TIME 84.49 (s) Experiment:105:performanceRun 26.30% (0.09%) 10× 4.28% 16559× 21.93% 16559× main:45:main 100.00% (0.00%) 1× OnlineExperiment:54:run 87.28% (0.12%) 1× 87.28% 1× OnlineExperiment:125:save 12.70% (0.00%) 1× 12.70% 1× 26.30% 10× 12.01% 10488× Greedy_GQ:36:learn 46.14% (1.45%) 10000× 46.14% 10000× 20.30% 10000× 2.23% 10000× RLPy (rlpy.readthedocs.org) Sponsored by: FA9550-09-1-0522 N000141110688 Built-in Hyperparameter Optimization Improved Experimentation Reproducible Parallel Execution Optimized Implementation (Cython, C++) Batteries Included! 20 Domains, 8 Learning Algorithms 4 Policies, 7 Representations Built-in Profiling Improved Granularity of the agent using OO Python
1

RLPy: A Value-Function-Based Reinforcement Learning ... · [7] G. Neumann, “The Reinforcement Learning Toolbox , Reinforcement Learning for Optimal Control Tasks”. MSc thesis,

Oct 18, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RLPy: A Value-Function-Based Reinforcement Learning ... · [7] G. Neumann, “The Reinforcement Learning Toolbox , Reinforcement Learning for Optimal Control Tasks”. MSc thesis,

Simulator

SimulatorSimulator Simulator

Value Function Policy

.

.

.

✓1

✓2

�1

�2

�n ✓n

V ⇡(s) ⇡ ⇥(s)>�s

WeightsFeatures

Interest in Value-based RL with linear function approximation

Increased Specialization in RL ➙ More Granular Framework

Comparison with State-of-the-art Techniques in Various Domains

An easy to use RL framework for both research and education

[1-4]

[3-4]

Easy Installation

Rapid Prototyping in Python

Grid World

Lack of granularity to accommodate recent advances in RL [6]

Challenging for entry level due to programming languages (e.g. C++)[6-7]

Not Self-contained [5]

Existing Gap

Examples

Conclusion

RLPy is an object-oriented reinforcement learning (RL) software package with focus on value- function-based methods using linear function approximation and discrete actions. The framework was designed for both education and research purposes. It provides a rich library of fine-grained, easily exchangeable compo-nents for learning agents (e.g., policies or representations of value functions), fa-cilitating recent increased specialization in RL. RLPy is written in Python to al-low fast prototyping but is also suitable for large-scale experiments through its inbuilt support for optimized numerical libraries and parallelization. Code profil-ing, domain visualizations, and data analysis are integrated in a self-contained package available under the Modified BSD License. All these properties allow users to compare various RL algorithms with little effort.

RLPy is a new python based open-source RL framework.

Simplifies the construction of new RL ideas

Accessible for both novice and expert users

Realizes reproducible experiments

RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and ResearchAlborz Geramifard*, Christoph Dann, Robert H. Klein, William Dabney*, Jonathan P. How

Problem

Abstract

[1] A. Tamar, H. Xu, S. Mannor, “Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations”, In Proceedings of the 31t International Conference on Machine Learning (ICML), pages 127-135, 2014[2] D. Calandriello, A. Lazaric, M. Restelli, “Sparse Multi-Task Reinforcement Learning”, In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 819–827, Quebec, Canada, 2014[3] J. Z. Kolter and A. Y. Ng, “Regularization and Feature Selection in Least-squares Temporal Difference Learning”, In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 521–528, New York, NY, USA, 2009[4] A. Geramifard, T. Walsh, N. Roy, and J. How “Batch iFDD: A Scalable Matching Pursuit Algorithm for Solving MDPs”, The Conference on Uncertainty in Artificial Intelligence (UAI), 2013[5] D. Aberdeen, “LibPGRL: A high performance reinforcement learning library in C++”, 2007. URL https://code.google.com/p/libpgrl.[6] F. de Comite, “PIQLE: A Platform for Implementation of Q-Learning Experiments”, 2006. URL http://piqle.sourceforge.net.[7] G. Neumann, “The Reinforcement Learning Toolbox , Reinforcement Learning for Optimal Control Tasks”. MSc thesis, TU Graz, 2005* Author is currently employed at Amazon.

Value Function / Policy

Domain

Learning Agent

Representation

Policy⇡

Q, V

at

st+1, rt+1RL Agent

Experiment

> pip install -U RLPy

Simulator

Cart Pole

Plotting

Learning Steps

Learning Episodes

Learning Time

Return

Number of Features

Termination

TOTAL TIME84.49 (s)

Pendulum:286:step6.90%

(0.72%)26559×

mlab:988:rk45.30%

(1.92%)26559×

5.30%26559×

numeric:256:asarray1.24%

(0.12%)134332×

1.18%106236×

Pendulum:347:_dsdt2.10%

(2.10%)106236×

2.10%106236×

Experiment:105:performanceRun26.30%(0.09%)

10×

4.28%16559×

eGreedy:29:pi33.94%(0.11%)27047×

21.93%16559×

GeneralTools:178:randSet0.60%

(0.09%)27691×

0.59%27047×

Representation:259:bestActions35.40%(0.26%)35987×

33.21%25987×

backend_macosx:24:mainloop11.90%(0.00%)

~:0:<matplotlib.backends._macosx.show>11.90%

(11.74%)1×

11.90%1×

backend_bases:69:__call__11.90%(0.00%)

11.90%1×

~:0:<method 'sum' of 'numpy.ndarray' objects>5.81%

(0.52%)527988×

_methods:16:_sum5.57%

(0.83%)537989×

5.29%527988×

~:0:<method 'reduce' of 'numpy.ufunc' objects>9.02%

(9.02%)1137592×

4.74%537989×

tiles:74:_hash15.67%(8.59%)527988×

5.81%527988×

~:0:<numpy.core.multiarray.arange>2.15%

(2.15%)747267×

1.13%527988×

main:45:main100.00%(0.00%)

OnlineExperiment:54:run87.28%(0.12%)

87.28%1×

OnlineExperiment:125:save12.70%(0.00%)

12.70%1×

2.62%10000×

26.30%10×

12.01%10488×

Greedy_GQ:36:learn46.14%(1.45%)10000×

46.14%10000×

pyplot:123:show11.90%(0.00%)

11.90%1×

11.90%1×

Representation:143:phi48.62%(0.13%)45987×

20.99%20000×

Representation:161:phi_sa5.25%

(3.84%)127961×

0.84%20000×

GeneralTools:117:count_nonzero20.30%

(19.91%)10000×

20.30%10000×

Representation:290:bestAction2.23%

(0.03%)10000×

2.23%10000×

~:0:<numpy.core._dotblas.dot>0.92%

(0.92%)128359×

0.15%20000×

Representation:112:Qs34.14%(0.47%)35987×

Representation:127:Q5.53%

(0.33%)107961×

5.53%107961×

27.63%25987×

~:0:<numpy.core.multiarray.array>2.50%

(2.50%)805132×

0.40%35987×

4.41%107961×

0.76%107961×

tiles:54:phi_nonTerminal48.30%

(17.30%)45063×

48.30%45063×

34.14%35987×

~:0:<method 'max' of 'numpy.ndarray' objects>0.54%

(0.04%)36047×

0.54%35987×

0.80%127961×

~:0:<numpy.core.multiarray.zeros>0.85%

(0.85%)228631×

0.61%127961×

2.18%10000×

_methods:28:_all4.41%

(0.57%)563331×

3.85%563331×

~:0:<method 'all' of 'numpy.ndarray' objects>4.96%

(0.54%)563331×

4.41%563331×

1.12%134332×

0.14%45063×

0.13%45063×

tiles:82:_physical_addr30.51%(7.53%)450630×

30.51%450630×

15.67%527988×

fromnumeric:1643:all7.31%

(0.94%)562496×

7.31%562496×

4.95%562496×

numeric:326:asanyarray1.52%

(0.68%)589084×

1.42%562496×

0.83%589084×

RLPy (rlpy.readthedocs.org)

Sponsored by:

FA9550-09-1-0522 N000141110688

Built-in Hyperparameter Optimization

Improved Experimentation

Reproducible

Parallel Execution

Optimized Implementation (Cython, C++)

Batteries Included!

20 Domains, 8 Learning Algorithms

4 Policies, 7 Representations

Built-in Profiling

Improved Granularity of the agent using OO Python