Asynchronous Deep Q-Learning for Breakout Edgard Bonilla (edgard), Jiaming Zeng (jiaming), Jennie Zheng (jenniezh) CS229 Machine Learning, Stanford University Reinforcement learning allows machines to iteratively determine the best behavior in a specific environment based on feedback and performance, making it highly adaptable and applicable to a wide-range of domains. We apply deep Q-learning 1 to teach an artificial agent to play the Atari game Breakout using RAM states. • Better accommodation of image inputs • implement other variations on the Boltzmann-Q policy to further explore exploration vs. exploitation trade-off 1. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. 2. Mnih, V., et al. (2016). Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783. MDP Formulation § States: 1x128 RAM state of game § Actions: Left, right, Do nothing § Policy: Linearly annealed ε-greedy method to select action 2 § Rewards: The score of the game Dense Layer 1 (n) Inputs: 1x128 RAM states Dense Layer: # actions Dense Layer k (n) … We implemented Asynchronous Q-Learning with target network and reward clipping 1,2 . Initialize: DQN Network, # of threads Start episode … Thread t Get frame Act on environment Store experiences … Update DQN DQN from other Threads Share DQN with other Threads Thread 1 If game terminal Input # Frames Average Max Min Median RAM 55M 8.31 36 1 7 Image 8M 1.09 4 0 1 Setup # Frames Average Max Min Median 1 life, ε-greedy 80M 10.57 39 0 8 5 lives, ε-greedy 80M 12.55 35 2 9 5 lives, Boltzmann 42M 22.15 51 8 22.5 Changes Observations Policy Boltzmann-Q improves more quickly than ε-greedy Lives No noticeable difference Input RAM trains much faster than images Network Architecture Experimentation § Network Settings: k = 2, 3, 4; n = 128, 256. § Comparison for after 42 hrs of training * § We plot the moving averages for Max Q and rewards Default training setting: RAM state inputs to a DQN of 2 layers, each with 256 nodes, and an ε-greedy policy. Each game episode has 5 lives. Discounted reward factor, γ = 0.99 Reward statistics over 100 testing episodes § Varied settings to compare performance, including a change in training policy § Compare to images: after 84 hours of training * Network Architecture Experimentation § Network complexity increases training time § Most architectures show similar performance for the first 40 million frames Testing over 100 episodes § Our implementation of Boltzmann-Q encourages exploration by giving an upper bound on the probability of choosing the greedy action. *Computing Info: All times are based on wall time of Stanford Farmshare’s Barley cluster. Tools used: OpenAI Gym, Keras, Tensorflow Introduction Deep Q-Network (DQN) Methodology Results Results (cont.) Discussion Future Work DQN is a neural network that, given an input state s, estimates Q(s,a) for all possible actions a. Output: # actions