Top Banner
Continuous control with deep reinforcement learning 2016-06-28 Taehoon Kim
16

Continuous control with deep reinforcement learning (DDPG)

Apr 11, 2017

Download

Technology

Taehoon Kim
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Continuous control with deep reinforcement learning (DDPG)

Continuous�control�with�deep�reinforcement�learning

2016-06-28

Taehoon�Kim

Page 2: Continuous control with deep reinforcement learning (DDPG)

Motivation

• DQN�can�only�handle

• discrete (not�continuous)

• low-dimensional action�spaces

• Simple�approach�to�adapt�DQN�to�continuous�domain�is�discretizing

• 7�degree�of�freedom�system�with�discretization�𝑎" ∈ {−𝑘, 0, 𝑘}

• Now�space�dimensionality�becomes�3+ = 2187

• explosion�of�the�number�of�discrete�actions

2

Page 3: Continuous control with deep reinforcement learning (DDPG)

Contribution

• Present�a�model-free,�off-policy�actor-critic�algorithm

• learn�policies�in�high-dimensional,�continuous�action�spaces

• Work�based�on�DPG�(Deterministic�policy�gradient)

3

Page 4: Continuous control with deep reinforcement learning (DDPG)

Background

• actions�𝑎" ∈ ℝ2,�action�space�𝒜 = ℝ2

• history�of�observation,�action�pairs�𝑠" = (𝑥7, 𝑎7,… , 𝑎"97, 𝑥")

• assume�fully-observable�so�𝑠" = 𝑥"

• policy�𝜋: 𝒮 → 𝒫(𝒜)

• Model�environment�as�Markov�decision�process

• initial�state�distribution�𝑝(𝑠7)

• transition�dynamics�𝑝(𝑠"A7|𝑠", 𝑎")

4

Page 5: Continuous control with deep reinforcement learning (DDPG)

Background

• Discounted�future�reward�𝑅" = ∑ 𝛾F9"𝑟(𝑠F, 𝑎F)HFI"

• Goal�of�RL�is�to�learn�a�policy�𝜋 which�maximizes�the�expected�return

• from�the�start�distribution�𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7]

• Discounted�state�visitation�distribution�for�a�policy�𝜋:�ρR

5

Page 6: Continuous control with deep reinforcement learning (DDPG)

Background

• action-value�function�𝑄R 𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"]

• expected�return�after�taking�an�action�𝑎" in�state�𝑠" and�following�policy�𝜋

• Bellman�equation

• 𝑄R 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R(𝑠"A7, 𝑎"A7) ]

• With�deterministic policy�𝜇: 𝒮 → 𝒜

• 𝑄^ 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^ 𝑠"A7, 𝜇(𝑠"A7 )]

6

Page 7: Continuous control with deep reinforcement learning (DDPG)

Background

• Expectation�only�depends�on�the�environment• possible�to�learn�𝑄𝝁 off-policy,�where�transitions�are�generated�fromdifferent�stochastic�policy�𝜷

• Q-learning�with�greedy�policy�𝜇 𝑠 = argmaxf𝑄 𝑠, 𝑎

• 𝐿 𝜃i = 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i − 𝑦" n]

• where�𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i)

• To�scale�Q-learning�into�large�non-linear�approximators:• a�replay�buffer,�a�separate�target�network

7

(acommonly usedoff-policy algorithm)

Page 8: Continuous control with deep reinforcement learning (DDPG)

Deterministic�Policy�Gradient�(DPG)

• In�continuous space,�finding�the�greedy�policy�requires�an�optimization�of�𝑎" at�

every�timestep

• too�slow�to�large,�unconstrained�function�approximators�and�nontrivial�action�spaces

• Instead,�used�an�actor-critic approach�based�on�the�DPG algorithm

• actor:�𝜇 𝑠 𝜃^ : 𝒮 → 𝒜

• critic:�𝑄(𝑠, 𝑎|𝜃i)

8

Page 9: Continuous control with deep reinforcement learning (DDPG)

Learning�algorithm

• Actor�is�updated�by�following�the�applying�the�chain�rule�to�the�expected�return�

from�the�start�distribution�𝒥 w.r.t 𝜃^

• 𝛻rs𝒥 ≈ 𝔼N~j𝜷 𝛻rs𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ =

𝔼N~j𝜷 𝛻Q𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY ∇rs𝜇 𝑠 𝜃^ |NIN"

• Silver�et�al.�(2014)�proved�this�is�the�policy�gradient

• the�gradient�of�policy’s�performance

9

Page 10: Continuous control with deep reinforcement learning (DDPG)

Contributions

• Introducing�non-linear�function�approximators�means�that�

convergence�is�no�longer�guaranteed

• But�essential�to�learn�and�generalize�on�large�state�spaces

• Contribution

• To�provide�modifications�to�DPG,�inspired�by�the�success�of�DQN

• Allow�to�use�neural�network�function�approximators�to�learn�in�large�state�and�

action�spaces�online

10

Page 11: Continuous control with deep reinforcement learning (DDPG)

Challenges�1

• NN�for�RL�usually�assume�that�the�samples�are�i.i.d.

• but�when�the�samples�are�generated�from�exploring�sequentially�in�an�environment,�

this�assumption�no�longer�holds.

• As�DQN,�we�use�replay�buffer to�address�this�issue

• As�DQN,�we�used�target�network�for�stable�learning�but�use�“soft”�target�

updates

• 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`,�with�𝜏 ≪ 1

• Target�network�slowly�change�that�greatly�improve�the�stability�of�learning

11

Page 12: Continuous control with deep reinforcement learning (DDPG)

Challenges�2

• When�learning�from�low�dimensional�feature�vector,�observations�may�have�

different�physical�units�(i.e.�positions and�velocities)

• make�it�difficult to�learn�effectively and�also�to�find�hyper-parameters which�generalize across�environments

• Use�batch�normalization�[Ioffe &�Szegedy,�2015]�to�normalize each�dimension�across�the�samples�in�a�minibatch to�have�unit�mean�and�variance

• Also�maintains�a�running�average of�the�mean and�variance�for�normalization during�testing

• Use�all�layers�of�𝜇 and�𝑄 prior�to�the�action�input

• Can�train�different�units�without�needing�to�manually�ensure�the�units�were�within�a�set�range

12

(explorationorevaluation)

Page 13: Continuous control with deep reinforcement learning (DDPG)

Challenges�3

• Advantage�of�off-policies�algorithm�(i.e.�DDPG)�is�that�we�can�treat�the�problem�

of�exploration�independently�from�the�learning�algorithm

• Constructed�an�exploration�policy�𝜇` by�adding�noise�sampled�from�a�noise�

process�𝒩

• 𝜇` 𝑠" = 𝜇 𝑠" 𝜃"^ + 𝒩

• Use�an�Ornstein-Uhlenbeck process�to�generate�temporally�

correlated�exploration�for�exploration�efficiency�with�inertia

13

Page 14: Continuous control with deep reinforcement learning (DDPG)

14

Page 15: Continuous control with deep reinforcement learning (DDPG)

Experiment�details

• Adam.�𝑙𝑟^ = 109|,�𝑙𝑟i = 109}

• 𝑄 include�𝐿n weight�decay�of�109n and�𝛾 = 0.99

• 𝜏 = 0.001

• ReLU for�hidden�layers,�tanh for�output�layer�of�the�actor�to�bound�the�actions

• NN:�2�hidden� layers�with�400�and�300�units• Action�is�not�included�until�the�2nd hidden�layer�of�𝑄

• The�final�layer�weights�and�biases�are�initialized�from�a�uniform�distribution� −3×109},3×109}

• to�ensure�the�initial�outputs�for�the�policy�and�value�estimates�were�near�zero

• The�other�layers�are�initialized�from�uniform�distributions� − 7�, 7�where�𝑓 is�the�fan-in�of�the�layer

• Replay�buffer� ℛ = 10�,�Ornstein-Uhlenbeck process:�𝜃 = 0.15, 𝜎 = 0.2

15

Page 16: Continuous control with deep reinforcement learning (DDPG)

References

1. [Wang,�2015]�Wang,�Z.,�de�Freitas,�N.,�&�Lanctot,�M.�(2015).�Dueling�network�architectures�for�

deep�reinforcement�learning. arXiv preprint�arXiv:1511.06581.

2. [Van,�2015]�Van�Hasselt,�H.,�Guez,�A.,�&�Silver,�D.�(2015).�Deep�reinforcement�learning�with�

double�Q-learning. CoRR,�abs/1509.06461.

3. [Schaul,�2015]�Schaul,�T.,�Quan,�J.,�Antonoglou,�I.,�&�Silver,�D.�(2015).�Prioritized�experience�

replay. arXiv preprint�arXiv:1511.05952.

4. [Sutton,�1998]�Sutton,�R.�S.,�&�Barto,�A.�G.�(1998). Reinforcement�learning:�An�introduction(Vol.�

1,�No.�1).�Cambridge:�MIT�press.

16