Top Banner
Hierarchical Reinforcement Learning (Part II) Mayank Mittal
124

Learning (Part II) Hierarchical Reinforcement

May 02, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning (Part II) Hierarchical Reinforcement

Hierarchical Reinforcement Learning (Part II)

Mayank Mittal

Page 2: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Page 3: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

Page 4: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building 3. Eat at mensa2. Cross the street

Page 5: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building

➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..

3. Eat at mensa

➔ Open door➔ Wait in a queue➔ Take food➔ …..

2. Cross the street

➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..

Page 6: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Temporal abstraction

Page 7: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building

➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..

3. Eat at mensa

➔ Open door➔ Wait in a queue➔ Take food➔ …..

2. Cross the street

➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..

Page 8: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Transfer/Reusability of Skills

Temporal abstraction

Page 9: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building

➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..

3. Eat at mensa

➔ Open door➔ Wait in a queue➔ Take food➔ …..

2. Cross the street

➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..

How to represent these different goals?

Page 10: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Powerful/meaningful state abstraction

Transfer/Reusability of Skills

Temporal abstraction

Page 11: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Can a learning-based agent do the same?

Powerful/meaningful state abstraction

Transfer/Reusability of Skills

Temporal abstraction

Page 12: Learning (Part II) Hierarchical Reinforcement

Promise of Hierarchical RL

Structured exploration

Transfer learning

Long-term credit assignment (and

memory)

Page 13: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

Environment

AgentM

anag

erW

orke

r(s)

Page 14: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 15: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 16: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 17: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)Le

vel o

f Abs

trac

tion

Temporal R

esolution

Dayan, Peter and Geoffrey E. Hinton. “Feudal Reinforcement Learning.” NIPS (1992).

Page 18: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 19: Learning (Part II) Hierarchical Reinforcement

Detour: Dilated RNN

▪ Able to preserve memories over longer periods

For more details: Chang, S. et al. (2017). Dilated Recurrent Neural Networks, NIPS.

Page 20: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Page 21: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 22: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Page 23: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Page 24: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Worker

Man

ager

Page 25: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Absolute Goal

(-3, 1)

(3, 9)

c : Manager’s Horizon

Page 26: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Directional Goal

Page 27: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Directional Goal

Idea: A single sub-goal (direction) can be reused in many different locations in state space

Page 28: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

▪ Intrinsic reward

Page 29: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

▪ Intrinsic reward

Page 30: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Worker

Man

ager

Page 31: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 32: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

▪ Action Stochastic

Policy!

Page 33: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Page 34: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Why not do end-to-end learning?

Page 35: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Manager & Worker: Separate Actor-Critic

No gradient

Transition Policy

Gradient

Policy Gradient

Page 36: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Qualitative Analysis

Page 37: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Ablative Analysis

Page 38: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Ablative Analysis

Page 39: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Comparison

Page 40: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Action Repeat Transfer

Page 41: Learning (Part II) Hierarchical Reinforcement

Experiences

FeUdal Networks (FUN)

On-Policy Learning

Learning

Wastage!

Page 42: Learning (Part II) Hierarchical Reinforcement

Experiences

Can we do better?

Off-Policy Learning

Learning

Replay Buffer

Reusage!

Page 43: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Unstable Learning

Page 44: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

To-Be-DisclosedUnstable Learning

Page 45: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 46: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Data-Efficient HRL (HIRO)

Page 47: Learning (Part II) Hierarchical Reinforcement

Input Goal Action

Raw Observation Space

Data-Efficient HRL (HIRO)

Page 48: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Rollout sequence

Data-Efficient HRL (HIRO)

Page 49: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Rollout sequence

Data-Efficient HRL (HIRO)

Page 50: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Rollout sequence

Data-Efficient HRL (HIRO)

Page 51: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

c : Manager’s Horizon

Page 52: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Page 53: Learning (Part II) Hierarchical Reinforcement

▪ Intrinsic reward

Data-Efficient HRL (HIRO)

Page 54: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 55: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 56: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 57: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 58: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Unstable Learning To-Be-Disclosed

Page 59: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Unstable Learning Manager’s past experience might become useless

Page 60: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

t = 12 yrs

Goal: “wear a shirt”

Page 61: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Same goal induces different behavior

t = 22 yrs

Goal: “wear a shirt”

Page 62: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Goal relabelling required!

t = 22 yrs

Goal: “wear a dress”

Goal: “wear a shirt”

Page 63: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

where

Page 64: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

where

...

Page 65: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 66: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Ant Push

Page 67: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Qualitative Analysis

Page 68: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Ablative Analysis

Experience Samples (in millions)

Perf

orm

ance

Experience Samples (in millions) Experience Samples (in millions)

Page 69: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Comparison

Page 70: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Comparison

Experience Samples (in millions)

Perf

orm

ance

Page 71: Learning (Part II) Hierarchical Reinforcement

Can we do better?

What is missing?

Structured exploration

Page 72: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 73: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 74: Learning (Part II) Hierarchical Reinforcement

Taken after every N steps

Meta-Learning Shared Hierarchies (MLSH)

Page 75: Learning (Part II) Hierarchical Reinforcement

Computer Vision practice:▪ Train on ImageNet▪ Fine tune on actual task

Slide Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 76: Learning (Part II) Hierarchical Reinforcement

Computer Vision practice:▪ Train on ImageNet▪ Fine tune on actual task

How to generalize this to behavior learning?

Slide Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 77: Learning (Part II) Hierarchical Reinforcement

Environment A

Environment B

...Meta-RL

Algorithm“Fast” RL

Agent

Image Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 78: Learning (Part II) Hierarchical Reinforcement

Environment A

Environment B

...Meta-RL

Algorithm“Fast” RL

Agent

Environment F

ar, o

Testing environments

Image Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 79: Learning (Part II) Hierarchical Reinforcement

GOAL: Find sub-policies that enable fast learning of master policy

Meta-Learning Shared Hierarchies (MLSH)

Page 80: Learning (Part II) Hierarchical Reinforcement

GOAL: Find sub-policies that enable fast learning of master policy

Meta-Learning Shared Hierarchies (MLSH)

Page 81: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 82: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 83: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 84: Learning (Part II) Hierarchical Reinforcement

Ant Two-walks

Meta-Learning Shared Hierarchies (MLSH)

Page 85: Learning (Part II) Hierarchical Reinforcement

Ant Obstacle Course

Meta-Learning Shared Hierarchies (MLSH)

Page 86: Learning (Part II) Hierarchical Reinforcement

Movement Bandits

Meta-Learning Shared Hierarchies (MLSH)

Page 87: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 88: Learning (Part II) Hierarchical Reinforcement

Ablative Analysis

Meta-Learning Shared Hierarchies (MLSH)

Page 89: Learning (Part II) Hierarchical Reinforcement

Ablative Analysis

Meta-Learning Shared Hierarchies (MLSH)

Page 90: Learning (Part II) Hierarchical Reinforcement

Four Rooms

Meta-Learning Shared Hierarchies (MLSH)

Page 91: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 92: Learning (Part II) Hierarchical Reinforcement

SummaryFUN● Directional goals● Dilated RNN● Transition Policy Gradient

MLSH● Generalization in RL algorithm● Inspired from “Options” framework

HIRO● Absolute goals in observation space● Data-efficient ● Off-policy label correction

Page 93: Learning (Part II) Hierarchical Reinforcement

Discussion

▪ How to decide temporal resolution (i.e. c, N)?

▪ Do we need discrete # of sub-policies?

▪ Future prospects of HRL? More hierarchies?

Page 94: Learning (Part II) Hierarchical Reinforcement

Thank you for your attention!

Page 95: Learning (Part II) Hierarchical Reinforcement

Any Questions?

Page 96: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

Page 97: Learning (Part II) Hierarchical Reinforcement

References ▪ Vezhnevets, A.S., Osindero, S., Schaul, T., Heess,

N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). FeUdal Networks for Hierarchical Reinforcement Learning. ICML.

▪ Nachum, O., Gu, S., Lee, H., & Levine, S. (2018). Data-Efficient Hierarchical Reinforcement Learning. NeurIPS.

▪ Frans, K., Ho, J., Chen, X., Abbeel, P., & Schulman, J. (2018). Meta Learning Shared Hierarchies. CoRR, abs/1710.09767.

Page 98: Learning (Part II) Hierarchical Reinforcement

Appendix

Page 99: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

Environment

Manager

Worker(s)

Agent

Page 100: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

Image Credits: Levy A. et. al (2019) Learning Multi-Level Hierarchies With Hindsight, ICLR

Page 101: Learning (Part II) Hierarchical Reinforcement

Detour: A2C

Image Credits: Sergey Levine (2018), CS 294-112 (Lecture 6)

Page 102: Learning (Part II) Hierarchical Reinforcement

Advantage Function:

Update Rule:

FeUdal Networks (FUN)

Worker

Policy Gradient

Page 103: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Advantage Function:

Update Rule:

Transition Policy Gradient

Page 104: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Transition Policy Gradient

Assumption:

● Worker will eventually learn to follow the goal directions● Direction in state-space follows von Mises-Fisher distribution

Page 105: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Learnt sub-goals by Manager

Page 106: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Memory Task: Non-Match

Page 107: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Memory Task: T-Maze

Page 108: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Memory Task: Water-Maze

Page 109: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Comparison

Page 110: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Comparison

Page 111: Learning (Part II) Hierarchical Reinforcement

Network Structure: TD3

Data-Efficient HRL (HIRO)

Dimension of raw

observation space

Dimension of Action Space

Manager

Actor-Critic with2-layer MLP each

Worker

Actor-Critic with2-layer MLP each

For more details: Fujimoto, S., et. al (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML.

Page 112: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

where

...

Page 113: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

Approximately solved by generating candidate goals

Page 114: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

Approximately solved by generating candidate goals :

● Original goal given:

● Absolute goal based on transition observed:

● Randomly sampled candidates:

Page 115: Learning (Part II) Hierarchical Reinforcement

Training

Data-Efficient HRL (HIRO)

Page 117: Learning (Part II) Hierarchical Reinforcement

Network Structure: PPO

Meta-Learning Shared Hierarchies (MLSH)

Number of sub-policies

Dimension of Action Space

Manager

2-layer MLP with 64 hidden units

Each sub-policy

2-layer MLP with 64 hidden units

For more details: Schulman, J., et. al (2017).. Proximal Policy Optimization Algorithms. CoRR, abs/1707.06347

Page 118: Learning (Part II) Hierarchical Reinforcement

Training

Meta-Learning Shared Hierarchies (MLSH)

Page 119: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 120: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 121: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 122: Learning (Part II) Hierarchical Reinforcement

▪ Useful when input data is sequential (such as in speech recognition, language modelling)

Recurrent Neural Network

For more details: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 123: Learning (Part II) Hierarchical Reinforcement

Stochastic NN for HRL (SNN4HRL)

For more details: Florensa, C. et. al (2017). Stochastic Neural Network for Hierarchical Reinforcement Learning. ICLR.

Aims to learn useful skills during pre-training and then leverage them for learning faster in future tasks

Page 124: Learning (Part II) Hierarchical Reinforcement

Variational Information Maximizing Exploration (VIME)

For more details: Houthooft, R. et. al (2016). VIME: Variational Information Maximizing Exploration, NIPS.

Exploration based on maximizing information gain about agent’s belief of the environment