Asynchronous Deep Q Learning for Breakoutcs229.stanford.edu/proj2016/poster/BonillaZengZhen...Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783.

Asynchronous Deep Q-Learning for BreakoutEdgard Bonilla (edgard), Jiaming Zeng (jiaming), Jennie Zheng (jenniezh)

CS229 Machine Learning, Stanford University

Reinforcement learning allows machines toiteratively determine the best behavior in aspecific environment based on feedback andperformance, making it highly adaptable andapplicable to a wide-range of domains. We applydeep Q-learning1 to teach an artificial agent toplay the Atari game Breakout using RAM states.

• Better accommodation of image inputs• implement other variations on the Boltzmann-Q

policy to further explore exploration vs. exploitation trade-off

1.Mnih,V.,etal.(2015).Human-levelcontrolthroughdeepreinforcementlearning. Nature, 518(7540),529-533.2.Mnih,V.,etal.(2016).Asynchronousmethodsfordeepreinforcementlearning. arXiv preprintarXiv:1602.01783.

MDP Formulation§ States: 1x128 RAM state of game§ Actions: Left, right, Do nothing§ Policy: Linearly annealed ε-greedy method

to select action2

§ Rewards: The score of the game

DenseLayer1

(n)

Inpu

ts:

1x128RA

Mstates

DenseLayer:#actio

ns

DenseLayerk

(n)

…

We implemented Asynchronous Q-Learning withtarget network and reward clipping1,2.

Initialize:DQNNetwork,#ofthreads

Startepisode

… Threadt

GetframeActonenvironmentStoreexperiences

…

UpdateDQN

DQNfromotherThreads

ShareDQNwithotherThreads

Thread1

Ifgameterminal

Input # Frames Average Max Min MedianRAM 55M 8.31 36 1 7

Image 8M 1.09 4 0 1

Setup # Frames Average Max Min Median

1 life, ε-greedy 80M 10.57 39 0 8

5 lives, ε-greedy 80M 12.55 35 2 9

5 lives, Boltzmann 42M 22.15 51 8 22.5

Changes ObservationsPolicy Boltzmann-Q improves more quickly

than ε-greedyLives No noticeable differenceInput RAM trains much faster than images

Network Architecture Experimentation§ Network Settings: k = 2, 3, 4; n = 128, 256.§ Comparison for after 42 hrs of training*

§ We plot the moving averages for Max Q andrewards

Default training setting: RAM state inputs to aDQN of 2 layers, each with 256 nodes, and anε-greedy policy. Each game episode has 5 lives. Discounted reward factor, γ = 0.99

Reward statistics over 100 testing episodes§ Varied settings to compare performance,

including a change in training policy

§ Compare to images: after 84 hours of training*

Network Architecture Experimentation§ Network complexity increases training time§ Most architectures show similar performance

for the first 40 million framesTesting over 100 episodes§ Our implementation of Boltzmann-Q

encourages exploration by giving an upperbound on the probability of choosing thegreedy action.

*Computing Info: All times are based on wall time of Stanford Farmshare’s Barley cluster.

Tools used: OpenAI Gym, Keras, Tensorflow

Introduction

Deep Q-Network (DQN)

Methodology

Results

Results (cont.)

Discussion

Future Work

DQNisaneuralnetworkthat,givenaninputstates,estimatesQ(s,a)forallpossibleactionsa.

Outpu

t:#actio

ns

Asynchronous Deep Q Learning for Breakoutcs229.stanford.edu/proj2016/poster/BonillaZengZhen...Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783.

Documents