Top Banner
The Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016
26

The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Jun 09, 2018

Download

Documents

buinhan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

The Nuts and Bolts of Deep RL Research

John Schulman

December 9th, 2016

Page 2: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Outline

Approaching New Problems

Ongoing Development and Tuning

General Tuning Strategies for RL

Policy Gradient Strategies

Q-Learning Strategies

Miscellaneous Advice

Page 3: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Approaching New Problems

Page 4: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

New Algorithm? Use Small Test Problems

I Run experiments quickly

I Do hyperparameter search

I Interpret and visualize learning process: state visitation, value function, etc.

I Counterpoint: don’t overfit algorithm to contrived problem

I Useful to have medium-sized problems that you’re intimately familiar with(Hopper, Atari Pong)

Page 5: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

New Task? Make It Easier Until Signs of Life

I Provide good input features

I Shape reward function

Page 6: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

POMDP Design

I Visualize random policy: does it sometimes exhibit desired behavior?

I Human controlI Atari: can you see game features in downsampled image?

I Plot time series for observations and rewards. Are they on a reasonablescale?

I hopper.py in gym:reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t

I Histogram observations and rewards

Page 7: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Run Your Baselines

I Don’t expect them to work with default parameters

I Recommended:I Cross-entropy method1

I Well-tuned policy gradient method2

I Well-tuned Q-learning + SARSA method

1Istvan Szita and Andras Lorincz (2006). “Learning Tetris using the noisy cross-entropy method”. In: Neural computation.

2https://github.com/openai/rllab

Page 8: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Run with More Samples Than Expected

I Early in tuning process, may need huge number of samplesI Don’t be deterred by published work

I Examples:I TRPO on Atari: 100K timesteps per batch for KL= 0.01I DQN on Atari: update freq=10K, replay buffer size=1M

Page 9: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Ongoing Development and Tuning

Page 10: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

It Works! But Don’t Be Satisfied

I Explore sensitivity to each parameterI If too sensitive, it doesn’t really work, you just got lucky

I Look for health indicatorsI VF fit qualityI Policy entropyI Update size in output space and parameter spaceI Standard diagnostics for deep networks

Page 11: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Continually Benchmark Your Code

I If reusing code, regressions occur

I Run a battery of benchmarks occasionally

Page 12: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Always Use Multiple Random Seeds

Page 13: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Always Be Ablating

I Different tricks may substituteI Especially whitening

I “Regularize” to favor simplicity in algorithm design spaceI As usual, simplicity → generalization

Page 14: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Automate Your Experiments

I Don’t spend all day watching your code print out numbers

I Consider using a cloud computing platform (Microsoft Azure, Amazon EC2,Google Compute Engine)

Page 15: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

General Tuning Strategies for RL

Page 16: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Whitening / Standardizing Data

I If observations have unknown range, standardizeI Compute running estimate of mean and standard deviationI x ′ = clip((x − µ)/σ,−10, 10)

I Rescale the rewards, but don’t shift mean, as that affects agent’s will to live

I Standardize prediction targets (e.g., value functions) the same way

Page 17: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Generally Important Parameters

I DiscountI Returnt = rt + γrt+1 + γ2rt+2 + . . .I Effective time horizon: 1 + γ + γ2 + · · · = 1/(1− γ)

I I.e., γ = 0.99⇒ ignore rewards delayed by more than 100 timesteps

I Low γ works well for well-shaped rewardI In TD(λ) methods, can get away with high γ when λ < 1

I Action frequencyI Solvable with human control (if possible)I View random exploration

Page 18: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

General RL Diagnostics

I Look at min/max/stdev of episode returns, along with mean

I Look at episode lengths: sometimes provides additional informationI Solving problem faster, losing game slower

Page 19: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Policy Gradient Strategies

Page 20: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Entropy as Diagnostic

I Premature drop in policy entropy ⇒ no learning

I Alleviate by using entropy bonus or KL penalty

Page 21: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

KL as Diagnostic

I Compute KL[πold(· | s), π(· | s)

]I KL spike ⇒ drastic loss of performance

I No learning progress might mean steps are too largeI batchsize=100K converges to different result than batchsize=20K.

Page 22: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Baseline Explained Variance

I explained variance = 1−Var[empirical return−predicted value]Var [empirical return]

Page 23: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Policy Initialization

I More important than in supervised learning: determines initial statevisitation

I Zero or tiny final layer, to maximize entropy

Page 24: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Q-Learning Strategies

I Optimize memory usage carefully: you’ll need it for replay buffer

I Learning rate schedules

I Exploration schedules

I Be patient. DQN converges slowlyI On Atari, often 10-40M frames to get policy much better than random

Thanks to Szymon Sidor for suggestions

Page 25: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Miscellaneous Advice

I Read older textbooks and theses, not just conference papers

I Don’t get stuck on problems—can’t solve everything at onceI Exploration problems like cart-pole swing-upI DQN on Atari vs CartPole

Page 26: The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Thanks!