Top Banner
Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford University Box, 6/3/15
39

Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Distributed Deep Q-Learning

Hao Yi Ong

joint work with K. Chavez, A. Hong

Stanford University

Box, 6/3/15

Page 2: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Outline

Introduction

Reinforcement learning

Serial algorithm

Distributed algorithm

Numerical experiments

Conclusion

Introduction 2/39

Page 3: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Motivation

I long-standing challenge of reinforcement learning (RL)

– control with high-dimensional sensory inputs (e.g., vision, speech)– shift away from reliance on hand-crafted features

I utilize breakthroughs in deep learning for RL [M+13, M+15]

– extract high-level features from raw sensory data– learn better representations than handcrafted features with neural

network architectures used in supervised and unsupervised learning

I create fast learning algorithm

– train efficiently with stochastic gradient descent (SGD)– distribute training process to accelerate learning [DCM+12]

Introduction 3/39

Page 4: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Success with Atari games

Introduction 4/39

Page 5: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Theoretical complications

deep learning algorithms require

I huge training datasets

– sparse, noisy, and delayed reward signal in RL– delay of ∼ 103 time steps between actions and resulting rewards– cf. direct association between inputs and targets in supervised

learning

I independence between samples

– sequences of highly correlated states in RL problems

I fixed underlying data distribution

– distribution changes as RL algorithm learns new behaviors

Introduction 5/39

Page 6: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Goals

distributed deep RL algorithm

I robust neural network agent

– must succeed in challenging test problems

I control policies with high-dimensional sensory input

– obtain better internal representations than handcrafted features

I fast training algorithm

– efficiently produce, use, and process training data

Introduction 6/39

Page 7: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Outline

Introduction

Reinforcement learning

Serial algorithm

Distributed algorithm

Numerical experiments

Conclusion

Reinforcement learning 7/39

Page 8: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Playing games

Environment:  Game  emulator  

Agent  

Ac3on:  Game  input  

State:  Series  of  screens    

and  inputs  

Reward:  Game  score  change  

objective: learned policy maximizes future rewards

Rt =

T∑t′=t

γt′−trt′ ,

I discount factor γI reward change at time t′ rt′

Reinforcement learning 8/39

Page 9: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

State-action value function

I basic idea behind RL is to estimate

Q? (s, a) = maxπ

E [Rt | st = s, at = a, π] ,

where π maps states to actions (or distributions over actions)

I optimal value function obeys Bellman equation

Q? (s, a) = Es′∼E

[r + γmax

a′Q? (s′, a′) | s, a

],

where E is the MDP environment

Reinforcement learning 9/39

Page 10: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Value approximation

I typically, a linear function approximator is used to estimate Q?

Q (s, a; θ) ≈ Q? (s, a) ,

which is parameterized by θ

I we introduce the Q-network

– nonlinear neural network state-action value function approximator– “Q” for Q-learning

Reinforcement learning 10/39

Page 11: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Q-network

I trained by minimizing a sequence of loss functions

Li (θi) = Es,a∼ρ(·)

[(yi −Q (s, a; θi))

2],

with

– iteration number i

– target yi = Es′∼E [r + γmaxa′ Q (s′, a′; θi−1) | s, a]

– “behavior distribution” (exploration policy) ρ (s, a)

I architecture varies according to application

Reinforcement learning 11/39

Page 12: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Outline

Introduction

Reinforcement learning

Serial algorithm

Distributed algorithm

Numerical experiments

Conclusion

Serial algorithm 12/39

Page 13: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Preprocessing

raw  screen   downsample  +  grayscale  

final  input  

Serial algorithm 13/39

Page 14: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Network architecture

Serial algorithm 14/39

Page 15: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Convolutional neural network

I biologically-inspired by the visual cortex

I CNN example: single layer, single frame to single filter, stride = 1

Serial algorithm 15/39

Page 16: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Stochastic gradient descent

I optimize Q-network loss function by gradient descent

Q (s, a; θ) := Q (s, a; θ) + α∇θQ (s, a; θ) ,

with

– learning rate α

I for computational expedience

– update weights after every time step– avoid computing full expectations– replace with single samples from ρ and E

Serial algorithm 16/39

Page 17: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Q-learning

Q (s, a) := Q (s, a) + α(r + γmax

a′Q (s′, a′)−Q (s, a)

)I model free RL

– avoids estimating E

I off-policy

– learns policy a = argmaxa Q (s, a; θ)– uses behavior distribution selected by an ε-greedy strategy

Serial algorithm 17/39

Page 18: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Experience replay

a kind of short-term memory

I trains optimal policy using “behavior policy” (off-policy)

– learns policy π? (s) = argmaxa Q (s, a; θ)– uses an ε-greedy strategy (behavior policy) for state-space exploration

I store agent’s experiences at each time step

et = (st, at, rt, st+1)

– experiences form a replay memory dataset with fixed capacity– execute Q-learning updates with random samples of experience

Serial algorithm 18/39

Page 19: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Serial deep Q-learning

given replay memory D with capacity N

initialize Q-networks Q, Q̂ with same random weights θ

repeat until timeoutinitialize frame sequence s1 = {x1} and preprocessed state φ1 = φ (s1)for t = 1, . . . , T

1. select action at =

{maxa Q (φ (st) , a; θ) w.p. 1− εrandom action otherwise

2. execute action at and observe reward rt and frame xt+1

3. append st+1 = (st, at, xt+1) and preprocess φt+1 = φ (st+1)4. store experience (φt, at, rt, φt+1) in D5. uniformly sample minibatch (φj , aj , rj , φj+1) ∼ D

6. set yj =

{rj if φj+1 terminal

rj + γmaxa′ Q̂ (φj+1, a′; θ) otherwise

7. perform gradient descent step for Q on minibatch

8. every C steps reset Q̂ = Q

Serial algorithm 19/39

Page 20: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Theoretical complications

deep learning algorithms require

I huge training datasets

I independence between samples

I fixed underlying data distribution

Serial algorithm 20/39

Page 21: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Deep Q-learning

avoids theoretical complications

I greater data efficiency

– each experience potentially used in many weight udpates

I reduce correlations between samples

– randomizing samples breaks correlations from consecutive samples

I experience replay averages behavior distribution over states

– smooths out learning– avoids oscillations or divergence in gradient descent

Serial algorithm 21/39

Page 22: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Cat video

Mini-break 22/39

Page 23: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Outline

Introduction

Reinforcement learning

Serial algorithm

Distributed algorithm

Numerical experiments

Conclusion

Distributed algorithm 23/39

Page 24: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Data parallelism

downpour SGD: generic asynchronous distributed SGD

θ := θ + αθ

θ  Δθ  

Distributed algorithm 24/39

Page 25: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Model parallelism

on each worker machine

I computation of gradient is pushed down to hardware

– parallelized according to available CPU/GPU resources– uses the Caffe deep learning framework

I complexity scales linearly with number of parameters

– GPU provides speedup, but limits model size– CPU slower, but model can be much larger

Distributed algorithm 25/39

Page 26: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Implementation

I data shards are generated locally on each model worker in real-time

– data is stored independently for each worker– since game emulation is simple, generating data is fast– simple fault tolerance approach: regenerate data if worker dies

I algorithm scales very well with data

– since data lives locally on workers, no data is sent

I update parameter with gradients using RMSprop or AdaGrad

I communication pattern: multiple asynchronous all-reduces

– one-to-all and all-to-one, but asynchronous for every minibatch

Distributed algorithm 26/39

Page 27: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Implementation

I bottleneck is parameter update time on parameter server

– e.g., if parameter server gradient update takes 10 ms, then we canonly do up to 100 updates per second (using buffers, etc.)

I trade-off between parallel updates and model staleness

– because worker is likely using a stale model, the updates are “noisy”and not of the same quality as in serial implementation

Distributed algorithm 27/39

Page 28: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Outline

Introduction

Reinforcement learning

Serial algorithm

Distributed algorithm

Numerical experiments

Conclusion

Numerical experiments 28/39

Page 29: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Evaluation

Numerical experiments 29/39

Page 30: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Snake

I parameters

– snake length grows with number of apples eaten (+1 reward)– one apple at any time, regenerated once eaten– n× n array, with walled-off world (−1 if snake dies)– want to maximize score, equal to apples eaten (minus 1)

I complexity

– four possible states for each cell: {empty, head, body, apple}– state space cardinality is O

(n42n

2)(-ish)

– four possible actions: {north, south, east, west}

Numerical experiments 30/39

Page 31: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Software

I at initialization, broadcast neural network architecture

– each worker spawns Caffe with architecture– populates replay dataset with experiences via random policy

I for some number of iterations:

– workers fetch latest parameters for Q network from server– compute and send gradient update– parameters updated on server with RMSprop or AdaGrad (requires

O(p) memory and time)

I Lightweight use of Spark

– shipping required files and serialized code to worker machines– partitioning and scheduling number of updates to do on each worker– coordinating identities of worker/server machines– partial implementation of generic interface between Caffe and Spark

I ran on dual core Intel i7 clocked at 2.2 GHz, 12 GB RAM

Numerical experiments 31/39

Page 32: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Complexity analysis

I model complexity

– determined by architecture; roughly on the order of number ofparameters

I gradient calculation via backpropagation

– distributed across worker’s CPU/GPU, linear with model size

I communication time and cost

– for each update, linear with model size

Numerical experiments 32/39

Page 33: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Compute/communicate times

I compute/communicate time scales linearly with model size

0 2 4 6 8

·106

0

50

100

150

number of parameters

experim

entaltimes

comms (x1 ms)

gradient (x100 ms)

latency (x1 ms)

– process is compute-bound by gradient calculations– upper bound on update rate inversely proportional to model size– with many workers in parallel, independent of batch size

Numerical experiments 33/39

Page 34: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Serial vs. distributed

I performance scales linearly with number of workers

0 100 200 300

−1

0

1

wall clock time (min)

averagereward

serialdouble

Numerical experiments 34/39

Page 35: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Example game play

Figure: Dumb snake. Figure: Smart snake.

Numerical experiments 35/39

Page 36: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Outline

Introduction

Reinforcement learning

Serial algorithm

Distributed algorithm

Numerical experiments

Conclusion

Conclusion 36/39

Page 37: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Summary

I deep Q-learning [M+13, M+15] scales well via DistBelief [DCM+12]

I asynchronous model updates accelerate training despite lowerupdate quality (vs. serial)

Conclusion 37/39

Page 38: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

Contact

questions, code, ideas, go-karting, swing dancing, . . .

[email protected]

Conclusion 38/39

Page 39: Distributed Deep Q-Learning - Stanford Universitystanford.edu/.../deep_Qlearning_presentation.pdf · Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford

References

I Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin,Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al.Large scale distributed deep networks.In Advances in Neural Information Processing Systems, pages1223–1231, 2012.

I V. Mnih et al.Playing Atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013.

I V. Mnih et al.Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015.

39/39