Top Banner
Deep Reinforcement Learning with Shallow Trees Matineh Shaker AI Scientist (Bonsai) MLConf San Francisco 10 November 2017
28

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Jan 22, 2018

Download

Technology

MLconf
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Deep Reinforcement Learning with Shallow Trees

Matineh ShakerAI Scientist (Bonsai)

MLConf San Francisco

10 November 2017

Page 2: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Outline● Introduction to RL (Reinforcement Learning)

● Markov decision processes

● Value-based methods

● Concept-Network Reinforcement Learning (CNRL)

● Use cases

2

Page 3: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

A Reinforcement Learning Example

3

Rocket Trajectory Optimization:OpenAI Gym’s LunarLander Simulator

Page 4: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

A Reinforcement Learning Example

4

State:

x_positiony_positionx_velocityy_velocityangle angular velocityleft_legright_leg

Action (Discrete):

do nothing (0) fire left engine (1)fire main engine (2)fire right engine (3)

Action (Continuous):

main engine power left/right engine power

Reward: Moving from the top of the screen to landing pad and zero speed has about 100-140 points. Episode finishes if the lander crashes or comes to rest, additional -100 or +100. Each leg ground contact is +10. Firing main engine has -0.3 points each frame.

Page 5: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Basic RL Concepts

5

Reward HypothesisGoals can be described by maximizing the expected cumulative reward .

Sequential Decision MakingActions may have long-term consequences.Rewards may be delayed, like a financial investment.Sometimes the agent sacrifices instant rewards to maximize long-term reward (just like life!)

State DataSequential and non i.i.dAgent’s actions affect the next data samples.

Page 6: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Definitions

Policy Dictates agent’s behavior, and maps from state to action:Deterministic policy: a = Л(s)Stochastic policy: Л(a|s) = P(At = a|St = s)

Value functionDetermines how good each state (and action) is:VЛ(s)=EЛ [ Rt+1+ Rt+2+ 2Rt+3+... | St=s ]QЛ(s,a)

ModelPredicts what the environment will do next (simulator’s job for instance)

6

Page 7: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Agent and Environment

At each time step, the agent: Receives observationReceives rewardTakes action

The environment: Receives actionSends next observationSends next reward

7

Page 8: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Markov Decision Processes (MDP)

8

Mathematical framework for sequential decision making.An environment in which all states are Markovian:

Markov Decision Process is a tuple:

Pictures from David Silver’s Slides

Page 9: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Exploration vs. Exploitation

Exploration vs. Exploitation Dilemma

● Reinforcement learning (specially model-free) is like trial-and-error learning.

● The agent should find a good policy that maximizes future rewards from its experiences

of the environment, in a potentially very large state space.

● Exploration finds more information about the environment, while Exploitation exploits

known information to maximise reward.

9

Page 10: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Value Based Methods: Q-Learning

What are the Problems:

● The iterative update is not scalable enough: ● Computing Q(s,a) for every state-action pair is not feasible most of the times.

Solution:

● Use a function approximator to estimate Q(s,a). such as a neural network! (differentiable)

10

Using Bellman equation as an iterative update, to find optimal policy:

Page 11: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Value Based Methods: Q-Learning

Use a function approximator to estimate the action-value function:

Q(s, a; ) ≅ Q*(s, a)

is the function parameter (weights of NN)

Function approximator can be a deep neural network: DQN

11

Loss Function:

Page 12: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Value Based Methods: DQN

Learning from batches of consecutive samples is problematic and costly:

- Sample correlation: Samples are correlated, which in return, makes inefficient learning

- Bad feedback loops: Current Q-network parameters dictates next training samples and can lead to bad feedback loops (e.g if maximizing action is to move left, training samples will be dominated by samples from left-hand size)

To solve them, use Experience Replay

- Continually update a replay memory table of transitions (st , at , rt , st+1).

- Train Q-network on random mini-batches of transitions from the replay memory.

12

Page 13: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Concept Network Reinforcement Learning

● Solving complex tasks by decomposing them to high level actions or "concepts".

● “Multi-level hierarchical RL” approach, inspired by Sutton’s Options: ○ enables efficient exploration by the abstractions over low level actions, ○ improving sample efficiency significantly, ○ especially in “sparse reward”.

● Allows existing solutions to sub-problems to be composed into an overall solution without requiring re-training.

13

Page 14: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Temporal Abstractions

● At each time t for each state st , a higher level “selector” chooses concept ct among all possible concepts available to the selector.

● Each concept remains active for some time, until a predefined terminal state is reached.

● An internal critic evaluates how close the agent is to satisfying a terminal condition of ct , and sends reward rc(t) to the selector.

● Similar to baseline RL, except that an extra layer of abstraction is defined on the set of “primitive” actions, forming a concept, so that execution of each concept corresponds to a certain action.

14

Page 15: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

LunarLander with Concepts

15

Page 16: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

LunarLander with Concepts

16

Page 17: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Robotics Pick and Place with Concepts

17

Lift Orient Stack

Page 18: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Robotics Pick and Place with Concepts

18

Page 19: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Robotics Pick and Place with Concepts

19

Deep Reinforcement Learning for Dexterous Manipulation with Concept Networkshttps://arxiv.org/abs/1709.06977

Page 20: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Thank you!

20

Page 21: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Backup Slides for Q/A:

21

Page 22: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

DefinitionsState The agent’s internal representation in the environment.Information the agent uses to pick the next action.

Policy Dictates agent’s behavior, and maps from state to action:Deterministic policy: a = Л(s)Stochastic policy: Л(a|s) = P(At = a|St = s) Value functionDetermines how good each state (and action) is:VЛ(s)=EЛ [ Rt+1+ Rt+2+ 2Rt+3+... | St=s ]

QЛ(s,a)ModelPredicts what the environment will do next (simulator’s job for instance)

22

Page 23: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

RL’s Main Loop

23

Page 24: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Value Based Methods: DQN with Experience Replay(2)

24

Page 25: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Learning vs Planning

25

Learning (Model-Free Reinforcement Learning):The environment is initially unknownThe agent interacts with the environment, not knowing about the environment The agent improves its policy based on previous interactions

Planning (Model-based Reinforcement Learning):A model of the environment is known or acquiredThe agent performs computations with the model, without any external interactionThe agent improves its policy based on those computations with the model

Page 26: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

LunarLander with Concept Network

26

Page 27: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Introduction to RL: Challenges

27

Playing Atari with Deep Reinforcement Learning, Mnih et al, Deepmind

Page 28: Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Policy-Based Methods● The Q-function can be complex and unnecessary. All we want is best action!!

● Example: In a very high-dimensional state, it is wasteful and costly to learn exact value of every (state, action) pair.

28

● Defining parameterized policies:

● For each policy, define its value:

● Gradient ascent on policy parameters to find the optimal policy!