Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Deep Reinforcement Learning with Shallow Trees

Matineh ShakerAI Scientist (Bonsai)

MLConf San Francisco

10 November 2017

Outline● Introduction to RL (Reinforcement Learning)

● Markov decision processes

● Value-based methods

● Concept-Network Reinforcement Learning (CNRL)

● Use cases

2

A Reinforcement Learning Example

3

Rocket Trajectory Optimization:OpenAI Gym’s LunarLander Simulator

http://www.youtube.com/watch?v=RfWAUXAJm-o

A Reinforcement Learning Example

4

State:

x_positiony_positionx_velocityy_velocityangle angular velocityleft_legright_leg

Action (Discrete):

do nothing (0) fire left engine (1)fire main engine (2)fire right engine (3)

Action (Continuous):

main engine power left/right engine power

Reward: Moving from the top of the screen to landing pad and zero speed has about 100-140 points. Episode finishes if the lander crashes or comes to rest, additional -100 or +100. Each leg ground contact is +10. Firing main engine has -0.3 points each frame.

Basic RL Concepts

5

Reward HypothesisGoals can be described by maximizing the expected cumulative reward .

Sequential Decision MakingActions may have long-term consequences.Rewards may be delayed, like a financial investment.Sometimes the agent sacrifices instant rewards to maximize long-term reward (just like life!)

State DataSequential and non i.i.dAgent’s actions affect the next data samples.

Definitions

Policy Dictates agent’s behavior, and maps from state to action:Deterministic policy: a = Л(s)Stochastic policy: Л(a|s) = P(At = a|St = s)

Value functionDetermines how good each state (and action) is:VЛ(s)=EЛ [ Rt+1+ Rt+2+ 2Rt+3+... | St=s ]QЛ(s,a)

ModelPredicts what the environment will do next (simulator’s job for instance)

6

Agent and Environment

At each time step, the agent: Receives observationReceives rewardTakes action

The environment: Receives actionSends next observationSends next reward

7

Markov Decision Processes (MDP)

8

Mathematical framework for sequential decision making.An environment in which all states are Markovian:

Markov Decision Process is a tuple:

Pictures from David Silver’s Slides

Exploration vs. Exploitation

Exploration vs. Exploitation Dilemma

● Reinforcement learning (specially model-free) is like trial-and-error learning.

● The agent should find a good policy that maximizes future rewards from its experiences

of the environment, in a potentially very large state space.

● Exploration finds more information about the environment, while Exploitation exploits

known information to maximise reward.

9

Value Based Methods: Q-Learning

What are the Problems:

● The iterative update is not scalable enough: ● Computing Q(s,a) for every state-action pair is not feasible most of the times.

Solution:

● Use a function approximator to estimate Q(s,a). such as a neural network! (differentiable)

10

Using Bellman equation as an iterative update, to find optimal policy:

Value Based Methods: Q-Learning

Use a function approximator to estimate the action-value function:

Q(s, a; ) ≅ Q*(s, a)

is the function parameter (weights of NN)

Function approximator can be a deep neural network: DQN

11

Loss Function:

Value Based Methods: DQN

Learning from batches of consecutive samples is problematic and costly:

- Sample correlation: Samples are correlated, which in return, makes inefficient learning

- Bad feedback loops: Current Q-network parameters dictates next training samples and can lead to bad feedback loops (e.g if maximizing action is to move left, training samples will be dominated by samples from left-hand size)

To solve them, use Experience Replay

- Continually update a replay memory table of transitions (st , at , rt , st+1).

- Train Q-network on random mini-batches of transitions from the replay memory.

12

Concept Network Reinforcement Learning

● Solving complex tasks by decomposing them to high level actions or "concepts".

● “Multi-level hierarchical RL” approach, inspired by Sutton’s Options: ○ enables efficient exploration by the abstractions over low level actions, ○ improving sample efficiency significantly, ○ especially in “sparse reward”.

● Allows existing solutions to sub-problems to be composed into an overall solution without requiring re-training.

13

Temporal Abstractions

● At each time t for each state st , a higher level “selector” chooses concept ct among all possible concepts available to the selector.

● Each concept remains active for some time, until a predefined terminal state is reached.

● An internal critic evaluates how close the agent is to satisfying a terminal condition of ct , and sends reward rc(t) to the selector.

● Similar to baseline RL, except that an extra layer of abstraction is defined on the set of “primitive” actions, forming a concept, so that execution of each concept corresponds to a certain action.

14

LunarLander with Concepts

15

http://www.youtube.com/watch?v=8eura5jcHzs

LunarLander with Concepts

16

http://www.youtube.com/watch?v=KBiGuVggUJQ

Robotics Pick and Place with Concepts

17

Lift Orient Stack


18


19

Deep Reinforcement Learning for Dexterous Manipulation with Concept Networkshttps://arxiv.org/abs/1709.06977

http://www.youtube.com/watch?v=UWl70BMsgfg

https://arxiv.org/abs/1709.06977

Thank you!

20

Backup Slides for Q/A:

21

DefinitionsState The agent’s internal representation in the environment.Information the agent uses to pick the next action.

Policy Dictates agent’s behavior, and maps from state to action:Deterministic policy: a = Л(s)Stochastic policy: Л(a|s) = P(At = a|St = s) Value functionDetermines how good each state (and action) is:VЛ(s)=EЛ [ Rt+1+ Rt+2+ 2Rt+3+... | St=s ]

QЛ(s,a)ModelPredicts what the environment will do next (simulator’s job for instance)

22

RL’s Main Loop

23

Value Based Methods: DQN with Experience Replay(2)

24

Learning vs Planning

25

Learning (Model-Free Reinforcement Learning):The environment is initially unknownThe agent interacts with the environment, not knowing about the environment The agent improves its policy based on previous interactions

Planning (Model-based Reinforcement Learning):A model of the environment is known or acquiredThe agent performs computations with the model, without any external interactionThe agent improves its policy based on those computations with the model

LunarLander with Concept Network

26

Introduction to RL: Challenges

27

Playing Atari with Deep Reinforcement Learning, Mnih et al, Deepmind

Policy-Based Methods● The Q-function can be complex and unnecessary. All we want is best action!!

● Example: In a very high-dimensional state, it is wasteful and costly to learn exact value of every (state, action) pair.

28

● Defining parameterized policies:

● For each policy, define its value:

● Gradient ascent on policy parameters to find the optimal policy!

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Technology