Top Banner
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecture 7: Eligibility Traces
37

RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

Sep 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces

Page 2: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

N-step TD Prediction

❐  Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

Page 3: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

❐  Monte Carlo:

❐  TD:n  Use V to estimate remaining return

❐  n-step TD:n  2 step return:

n  n-step return:

Mathematics of N-step TD Prediction

TtT

tttt rrrrR 13

221

−−+++ ++++= γγγ !

)( 11)1(

++ += tttt sVrR γ

)( 22

21)2(

+++ ++= ttttt sVrrR γγ

)(13

221

)(ntt

nnt

nttt

nt sVrrrrR ++

−+++ +++++= γγγγ !

Page 4: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Learning with N-step Backups

❐  Backup (on-line or off-line):

❐  Error reduction property of n-step returns

❐  Using this, you can show that n-step methods converge

)()(max)(}|{max sVsVsVssREs

nt

nts

πππ γ −≤−=

n step return

Maximum error using n-step return Maximum error using V

ΔVt(st ) = α Rt(n) − Vt(st )[ ]

Page 5: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Random Walk Examples

❐  How does 2-step TD work here?❐  How about 3-step TD?

Page 6: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

A Larger Example

❐  Task: 19 state random walk

❐  Do you think there is an optimal n (for everything)?

Page 7: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Averaging N-step Returns

❐  n-step methods were introduced to help with TD(λ) understanding

❐  Idea: backup an average of several returnsn  e.g. backup half of 2-step and half of 4-

step

❐  Called a complex backupn  Draw each componentn  Label with the weights for that

component

)4()2(

21

21

ttavgt RRR +=

One backup

Page 8: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Forward View of TD(λ)

❐  TD(λ) is a method for averaging all n-step backups n  weight by λn-1 (time since

visitation)n  λ-return:

❐  Backup using λ-return:

Rtλ = (1− λ ) λn −1

n=1

∑ Rt(n)

ΔVt(st ) = α Rtλ − Vt(st )[ ]

Page 9: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

λ-return Weighting Function

Page 10: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Relation to TD(0) and MC

❐  λ-return can be rewritten as:

❐  If λ = 1, you get MC:

❐  If λ = 0, you get TD(0)

Rtλ = (1− λ ) λn−1

n=1

T− t−1

∑ Rt(n) + λT−t−1Rt

Rtλ = (1−1) 1n−1

n=1

T−t−1

∑ Rt(n ) + 1T− t−1Rt = Rt

Rtλ = (1− 0) 0n−1

n=1

T−t−1

∑ Rt(n ) + 0T− t−1Rt = Rt

(1)

Until termination After termination

Page 11: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Forward View of TD(λ) II

❐  Look forward from each state to determine update from future states and rewards:

Page 12: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

λ-return on the Random Walk

❐  Same 19 state random walk as before❐  Why do you think intermediate values of λ are best?

Page 13: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Backward View of TD(λ)

❐  The forward view was for theory❐  The backward view is for mechanism

❐  New variable called eligibility tracen  On each step, decay all traces by γλ and increment the

trace for the current state by 1n  Accumulating trace

+∑")(set

et(s) =γλet−1(s) if s ≠ st

γλet−1(s) +1 if s = st% & '

Page 14: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

On-line Tabular TD(λ)Initialize V(s) arbitrarily and e(s) = 0, for all s ∈SRepeat (for each episode) : Initialize s Repeat (for each step of episode) : a← action given by π for s Take action a, observe reward, r, and next state $ s δ ← r +γV( $ s ) − V (s) e(s)← e(s) +1 For all s : V(s) ←V(s) +αδe(s) e(s) ←γλe(s) s← $ s Until s is terminal

Page 15: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Backward View

❐  Shout δt backwards over time❐  The strength of your voice decreases with temporal

distance by γλ

)()( 11 tttttt sVsVr −+= ++ γδ

Page 16: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Relation of Backwards View to MC & TD(0)

❐  Using update rule:

❐  As before, if you set λ to 0, you get to TD(0) ❐  If you set λ to 1, you get MC but in a better way

n  Can apply TD(1) to continuing tasksn  Works incrementally and on-line (instead of waiting to

the end of the episode)

)()( sesV ttt αδ=Δ

Page 17: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Forward View = Backward View

❐  The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating

❐  The book shows:

❐  On-line updating with small α is similar

ΔVtTD(s)

t= 0

T−1

∑ = αt= 0

T−1

∑ Isst (γλ)k− tδ kk=t

T−1

∑ ΔVtλ(st )Isst

t= 0

T−1

∑ = αt= 0

T−1

∑ Isst (γλ)k− tδ kk=t

T−1

ΔVtTD(s)

t= 0

T−1

∑ = ΔVtλ(st )

t= 0

T−1

∑ Isst

Backward updates Forward updates

algebra shown in book

Page 18: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

On-line versus Off-line on Random Walk

❐  Same 19 state random walk❐  On-line performs better over a broader range of parameters

Page 19: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Control: Sarsa(λ)

❐  Save eligibility for state-action pairs instead of just states

et(s, a) =γλet−1(s, a) +1 if s = st and a = atγλet−1(s,a) otherwise

$ % &

Qt+1(s, a) = Qt(s, a) +αδtet(s, a)

δt = rt+1 + γQt(st+1,at+1) −Qt(st , at )

Page 20: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Sarsa(λ) Algorithm

Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, aRepeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, ! s Choose ! a from ! s using policy derived from Q (e.g. ? - greedy) δ ← r +γQ( ! s , ! a ) −Q(s, a) e(s,a)← e(s,a) +1 For all s,a : Q(s, a)←Q(s, a) +αδe(s, a) e(s, a) ←γλe(s, a) s← ! s ;a ← ! a Until s is terminal

Page 21: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Sarsa(λ) Gridworld Example

❐  With one trial, the agent has much more information about how to get to the goal n  not necessarily the best way

❐  Can considerably accelerate learning

Page 22: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Three Approaches to Q(λ)

❐  How can we extend this to Q-learning?

❐  If you mark every state action pair as eligible, you backup over non-greedy policyn  Watkins: Zero out eligibility

trace after a non-greedy action. Do max when backing up at first non-greedy choice.

et(s, a) =

1 + γλet−1(s, a)0

γλet−1(s,a)

if s = st , a = at ,Qt−1(st ,at ) = max a Qt−1(st , a) if Qt−1(st ,at) ≠ maxa Qt−1(st ,a)

otherwise

%

& '

( '

Qt +1(s, a) = Qt(s, a) +αδtet(s, a)

δt = rt +1 + γ max + a Qt(st +1, + a ) −Qt (st ,at)

Page 23: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Watkins’s Q(λ)Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, aRepeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, ! s Choose ! a from ! s using policy derived from Q (e.g. ? - greedy)

a* ← arg maxb Q( ! s , b) (if a ties for the max, then a* ← ! a )

δ ← r +γQ( ! s , ! a ) −Q(s, a* ) e(s,a)← e(s,a) +1 For all s,a : Q(s, a)←Q(s, a) +αδe(s, a)

If ! a = a*, then e(s, a) ←γλe(s,a) else e(s, a)← 0 s← ! s ;a ← ! a Until s is terminal

Page 24: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Peng’s Q(λ)

❐  Disadvantage to Watkins’s method:n  Early in learning, the

eligibility trace will be “cut” (zeroed out) frequently resulting in little advantage to traces

❐  Peng: n  Backup max action except

at endn  Never cut traces

❐  Disadvantage:n  Complicated to implement

Page 25: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Naïve Q(λ)

❐  Idea: is it really a problem to backup exploratory actions?n  Never zero tracesn  Always backup max at

current action (unlike Peng or Watkins’s)

❐  Is this truly naïve?❐  Works well is preliminary

empirical studies

What is the backup diagram?

Page 26: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Comparison Task

From McGovern and Sutton (1997). Towards a better Q(λ)

❐  Compared Watkins’s, Peng’s, and Naïve (called McGovern’s here) Q(λ) on several tasks.n  See McGovern and Sutton (1997). Towards a Better

Q(λ) for other tasks and results (stochastic tasks, continuing tasks, etc)

❐  Deterministic gridworld with obstaclesn  10x10 gridworldn  25 randomly generated obstaclesn  30 runsn  α = 0.05, γ = 0.9, λ = 0.9, ε = 0.05, accumulating traces

Page 27: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Comparison Results

From McGovern and Sutton (1997). Towards a better Q(λ)

Page 28: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Convergence of the Q(λ)’s

❐  None of the methods are proven to converge.n  Much extra credit if you can prove any of them.

❐  Watkins’s is thought to converge to Q*

❐  Peng’s is thought to converge to a mixture of Qπ and Q*

❐  Naïve - Q*?

Page 29: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

Eligibility Traces for Actor-Critic Methods

❐  Critic: On-policy learning of Vπ. Use TD(λ) as described before.

❐  Actor: Needs eligibility traces for each state-action pair.❐  We change the update equation:

❐  Can change the other actor-critic update:

pt+1(s, a) =pt(s,a) +αδ t if a = at and s = st

pt (s, a) otherwise# $ %

),(),(),(1 aseaspasp tttt αδ+=+to

pt+1(s, a) =pt(s,a) +αδ t 1− π (s, a)[ ] if a = at and s = st

pt(s,a) otherwise% & ' to ),(),(),(1 aseaspasp tttt αδ+=+

et(s, a) =γλet−1(s, a) +1 − πt (st ,at) if s = st and a = at

γλet−1(s, a) otherwise% & '

where

Page 30: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

Replacing Traces

❐  Using accumulating traces, frequently visited states can have eligibilities greater than 1n  This can be a problem for convergence

❐  Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1

et(s) =γλet−1(s) if s ≠ st

1 if s = st% & '

Page 31: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

Replacing Traces Example

❐  Same 19 state random walk task as before❐  Replacing traces perform better than accumulating traces over more

values of λ

Page 32: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

Why Replacing Traces?

❐  Replacing traces can significantly speed learning

❐  They can make the system perform well for a broader set of parameters

❐  Accumulating traces can do poorly on certain types of tasks

Why is this task particularly onerous for accumulating traces?

Page 33: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

More Replacing Traces

❐  Off-line replacing trace TD(1) is identical to first-visit MC

❐  Extension to action-values:n  When you revisit a state, what should you do with the

traces for the other actions?n  Singh and Sutton say to set them to zero:

et(s, a) =

10

γλet−1(s, a)

$

% &

' &

if s = st and a = atif s = st and a ≠ at

if s ≠ st

Page 34: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34

Implementation Issues

❐  Could require much more computationn  But most eligibility traces are VERY close to zero

❐  If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices)

Page 35: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35

Variable λ

❐  Can generalize to variable λ

❐  Here λ is a function of timen  Could define

et(s) =γλtet−1(s) if s ≠ st

γλtet−1(s) +1 if s = st% & '

τλλλλt

ttt s == or )(

Page 36: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 36

Conclusions

❐  Provides efficient, incremental way to combine MC and TDn  Includes advantages of MC (can deal with lack of

Markov property)n  Includes advantages of TD (using TD error,

bootstrapping)❐  Can significantly speed learning❐  Does have a cost in computation

Page 37: RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 37

Something Here is Not Like the Other