Lecture 7: DQN - GitHub Pages · PDF fileLecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym ... Deep Q-Networks ... Deep Reinforcement Learning, David Silver,
Post on 13-Mar-2018
244 Views
Preview:
Transcript
Q-function Approximation: Q-Nets
(1) state, s
(2) quality (reward)for all actions(eg, [0.5, 0.1, 0.0, 0.8] LEFT: 0.5, RIGHT 0.1 UP: 0.0, DOWN: 0.8)
111 11111
1
1
Convergence
Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind
min
✓
TX
t=0
[
ˆQ(st, at|✓)� (rt + �max
a0ˆQ(st+1, a
0|✓))]2
Reinforcement + Neural Net
http://stackoverflow.com/questions/10722064/training-a-neural-network-with-reinforcement-learning
NATURE.COM/NATURE26 February 2015 £10
Vol. 518, No. 7540
EPIDEMIOLOGY
SHARE DATA IN OUTBREAKS
Forge open access to sequences and more
PAGE 477
COSMOLOGY
A GIANT IN THE EARLY UNIVERSE
A supermassive black hole at a redshift of 6.3
PAGES 490 & 512
QUANTUM PHYSICS
TELEPORTATION FOR TWO
Transferring two properties of a single photon
PAGES 491 & 516
INNOVATIONS INThe microbiome
Self-taught AI software attains human-level
performance in video games PAGES 486 & 529
T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E
Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind
1. Correlations between samples
Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
1. Correlations between samples
Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
2. Non-stationary targets
min
✓
TX
t=0
[
ˆQ(st, at|✓)� (rt + �max
a0ˆQ(st+1, a
0|✓))]2
Y = Q(st, at|✓) Y = rt + �max
a0ˆQ✓(st+1, a
0|✓)
DQN’s three solutions
1. Go deep
2. Capture and replay • Correlations between samples
3. Separate networks: create a target network • Non-stationary targets
Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind
Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
Solution 1: go deep
Solution 2: experience replayDeep Q-Networks (DQN): Experience Replay
To remove correlations, build data-set from agent’s own experience
s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0
s3, a3, r4, s4...
s
t
, at
, rt+1, st+1 ! s
t
, at
, rt+1, st+1
Sample experiences from data-set and apply update
l =
✓r + � max
a
0Q(s 0, a0,w�) � Q(s, a,w)
◆2
To deal with non-stationarity, target parameters w� are held fixed
Capturerandom sample
& Replaymin
✓
TX
t=0
[
ˆQ(st, at|✓)� (rt + �max
a0ˆQ(st+1, a
0|✓))]2
ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind
Solution 2: experience replay
Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
Problem 2: correlations between samplesDeep Q-Networks (DQN): Experience Replay
To remove correlations, build data-set from agent’s own experience
s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0
s3, a3, r4, s4...
s
t
, at
, rt+1, st+1 ! s
t
, at
, rt+1, st+1
Sample experiences from data-set and apply update
l =
✓r + � max
a
0Q(s 0, a0,w�) � Q(s, a,w)
◆2
To deal with non-stationarity, target parameters w� are held fixed
Problem 3: non-stationary targets
min
✓
TX
t=0
[
ˆQ(st, at|✓)� (rt + �max
a0ˆQ(st+1, a
0|✓))]2
Y = Q(st, at|✓) Y = rt + �max
a0ˆQ✓(st+1, a
0|✓)
Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
Solution 3: separate target network
min
✓
TX
t=0
[
ˆQ(st, at|✓)� (rt + �max
a0ˆQ(st+1, a
0|✓))]2
min
✓
TX
t=0
[
ˆQ(st, at|✓)� (rt + �max
a0ˆQ(st+1, a
0|¯✓))]2
Solution 3: separate target network
min
✓
TX
t=0
[
ˆQ(st, at|✓)� (rt + �max
a0ˆQ(st+1, a
0|¯✓))]2
(2)Ws(1)s
(1)s (2) Y (target)
Understanding Nature
Paper (2015)
Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
top related