Continuous control with deep reinforcement learning (DDPG)

Continuous�control�with�deep�reinforcement�learning

2016-06-28

Taehoon�Kim

Motivation

• DQN�can�only�handle

• discrete (not�continuous)

• low-dimensional action�spaces

• Simple�approach�to�adapt�DQN�to�continuous�domain�is�discretizing

• 7�degree�of�freedom�system�with�discretization�𝑎" ∈ {−𝑘, 0, 𝑘}

• Now�space�dimensionality�becomes�3+ = 2187

• explosion�of�the�number�of�discrete�actions

2

Contribution

• Present�a�model-free,�off-policy�actor-critic�algorithm

• learn�policies�in�high-dimensional,�continuous�action�spaces

• Work�based�on�DPG�(Deterministic�policy�gradient)

3

Background

• actions�𝑎" ∈ ℝ2,�action�space�𝒜 = ℝ2

• history�of�observation,�action�pairs�𝑠" = (𝑥7, 𝑎7,… , 𝑎"97, 𝑥")

• assume�fully-observable�so�𝑠" = 𝑥"

• policy�𝜋: 𝒮 → 𝒫(𝒜)

• Model�environment�as�Markov�decision�process

• initial�state�distribution�𝑝(𝑠7)

• transition�dynamics�𝑝(𝑠"A7|𝑠", 𝑎")

4

Background

• Discounted�future�reward�𝑅" = ∑ 𝛾F9"𝑟(𝑠F, 𝑎F)HFI"

• Goal�of�RL�is�to�learn�a�policy�𝜋 which�maximizes�the�expected�return

• from�the�start�distribution�𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7]

• Discounted�state�visitation�distribution�for�a�policy�𝜋:�ρR

5

Background

• action-value�function�𝑄R 𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"]

• expected�return�after�taking�an�action�𝑎" in�state�𝑠" and�following�policy�𝜋

• Bellman�equation

• 𝑄R 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R(𝑠"A7, 𝑎"A7) ]

• With�deterministic policy�𝜇: 𝒮 → 𝒜

• 𝑄^ 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^ 𝑠"A7, 𝜇(𝑠"A7 )]

6

Background

• Expectation�only�depends�on�the�environment• possible�to�learn�𝑄𝝁 off-policy,�where�transitions�are�generated�fromdifferent�stochastic�policy�𝜷

• Q-learning�with�greedy�policy�𝜇 𝑠 = argmaxf𝑄 𝑠, 𝑎

• 𝐿 𝜃i = 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i − 𝑦" n]

• where�𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i)

• To�scale�Q-learning�into�large�non-linear�approximators:• a�replay�buffer,�a�separate�target�network

7

(acommonly usedoff-policy algorithm)

Deterministic�Policy�Gradient�(DPG)

• In�continuous space,�finding�the�greedy�policy�requires�an�optimization�of�𝑎" at�

every�timestep

• too�slow�to�large,�unconstrained�function�approximators�and�nontrivial�action�spaces

• Instead,�used�an�actor-critic approach�based�on�the�DPG algorithm

• actor:�𝜇 𝑠 𝜃^ : 𝒮 → 𝒜

• critic:�𝑄(𝑠, 𝑎|𝜃i)

8

Learning�algorithm

• Actor�is�updated�by�following�the�applying�the�chain�rule�to�the�expected�return�

from�the�start�distribution�𝒥 w.r.t 𝜃^

• 𝛻rs𝒥 ≈ 𝔼N~j𝜷 𝛻rs𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ =

𝔼N~j𝜷 𝛻Q𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY ∇rs𝜇 𝑠 𝜃^ |NIN"

• Silver�et�al.�(2014)�proved�this�is�the�policy�gradient

• the�gradient�of�policy’s�performance

9

Contributions

• Introducing�non-linear�function�approximators�means�that�

convergence�is�no�longer�guaranteed

• But�essential�to�learn�and�generalize�on�large�state�spaces

• Contribution

• To�provide�modifications�to�DPG,�inspired�by�the�success�of�DQN

• Allow�to�use�neural�network�function�approximators�to�learn�in�large�state�and�

action�spaces�online

10

Challenges�1

• NN�for�RL�usually�assume�that�the�samples�are�i.i.d.

• but�when�the�samples�are�generated�from�exploring�sequentially�in�an�environment,�

this�assumption�no�longer�holds.

• As�DQN,�we�use�replay�buffer to�address�this�issue

• As�DQN,�we�used�target�network�for�stable�learning�but�use�“soft”�target�

updates

• 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`,�with�𝜏 ≪ 1

• Target�network�slowly�change�that�greatly�improve�the�stability�of�learning

11

Challenges�2

• When�learning�from�low�dimensional�feature�vector,�observations�may�have�

different�physical�units�(i.e.�positions and�velocities)

• make�it�difficult to�learn�effectively and�also�to�find�hyper-parameters which�generalize across�environments

• Use�batch�normalization�[Ioffe &�Szegedy,�2015]�to�normalize each�dimension�across�the�samples�in�a�minibatch to�have�unit�mean�and�variance

• Also�maintains�a�running�average of�the�mean and�variance�for�normalization during�testing

• Use�all�layers�of�𝜇 and�𝑄 prior�to�the�action�input

• Can�train�different�units�without�needing�to�manually�ensure�the�units�were�within�a�set�range

12

(explorationorevaluation)

Challenges�3

• Advantage�of�off-policies�algorithm�(i.e.�DDPG)�is�that�we�can�treat�the�problem�

of�exploration�independently�from�the�learning�algorithm

• Constructed�an�exploration�policy�𝜇` by�adding�noise�sampled�from�a�noise�

process�𝒩

• 𝜇` 𝑠" = 𝜇 𝑠" 𝜃"^ + 𝒩

• Use�an�Ornstein-Uhlenbeck process�to�generate�temporally�

correlated�exploration�for�exploration�efficiency�with�inertia

13

14

Experiment�details

• Adam.�𝑙𝑟^ = 109|,�𝑙𝑟i = 109}

• 𝑄 include�𝐿n weight�decay�of�109n and�𝛾 = 0.99

• 𝜏 = 0.001

• ReLU for�hidden�layers,�tanh for�output�layer�of�the�actor�to�bound�the�actions

• NN:�2�hidden� layers�with�400�and�300�units• Action�is�not�included�until�the�2nd hidden�layer�of�𝑄

• The�final�layer�weights�and�biases�are�initialized�from�a�uniform�distribution� −3×109},3×109}

• to�ensure�the�initial�outputs�for�the�policy�and�value�estimates�were�near�zero

• The�other�layers�are�initialized�from�uniform�distributions� − 7�, 7�where�𝑓 is�the�fan-in�of�the�layer

• Replay�buffer� ℛ = 10�,�Ornstein-Uhlenbeck process:�𝜃 = 0.15, 𝜎 = 0.2

15

References

1. [Wang,�2015]�Wang,�Z.,�de�Freitas,�N.,�&�Lanctot,�M.�(2015).�Dueling�network�architectures�for�

deep�reinforcement�learning. arXiv preprint�arXiv:1511.06581.

2. [Van,�2015]�Van�Hasselt,�H.,�Guez,�A.,�&�Silver,�D.�(2015).�Deep�reinforcement�learning�with�

double�Q-learning. CoRR,�abs/1509.06461.

3. [Schaul,�2015]�Schaul,�T.,�Quan,�J.,�Antonoglou,�I.,�&�Silver,�D.�(2015).�Prioritized�experience�

replay. arXiv preprint�arXiv:1511.05952.

4. [Sutton,�1998]�Sutton,�R.�S.,�&�Barto,�A.�G.�(1998). Reinforcement�learning:�An�introduction(Vol.�

1,�No.�1).�Cambridge:�MIT�press.

16

Continuous control with deep reinforcement learning (DDPG)

Technology