RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces


N-step TD Prediction

❐  Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)


❐  Monte Carlo:

❐  TD:n  Use V to estimate remaining return

❐  n-step TD:n  2 step return:

n  n-step return:

Mathematics of N-step TD Prediction

TtT

tttt rrrrR 13

221

−−+++ ++++= γγγ !

)( 11)1(

++ += tttt sVrR γ

)( 22

21)2(

+++ ++= ttttt sVrrR γγ

)(13

221

)(ntt

nnt

nttt

nt sVrrrrR ++

−+++ +++++= γγγγ !


Learning with N-step Backups

❐  Backup (on-line or off-line):

❐  Error reduction property of n-step returns

❐  Using this, you can show that n-step methods converge

)()(max)(}|{max sVsVsVssREs

nt

nts

πππ γ −≤−=

n step return

Maximum error using n-step return Maximum error using V

ΔVt(st ) = α Rt(n) − Vt(st )[ ]


Random Walk Examples

❐  How does 2-step TD work here?❐  How about 3-step TD?


A Larger Example

❐  Task: 19 state random walk

❐  Do you think there is an optimal n (for everything)?


Averaging N-step Returns

❐  n-step methods were introduced to help with TD(λ) understanding

❐  Idea: backup an average of several returnsn  e.g. backup half of 2-step and half of 4-

step

❐  Called a complex backupn  Draw each componentn  Label with the weights for that

component

)4()2(

21

21

ttavgt RRR +=

One backup


Forward View of TD(λ)

❐  TD(λ) is a method for averaging all n-step backups n  weight by λn-1 (time since

visitation)n  λ-return:

❐  Backup using λ-return:

Rtλ = (1− λ ) λn −1

n=1

∞

∑ Rt(n)

ΔVt(st ) = α Rtλ − Vt(st )[ ]


λ-return Weighting Function


Relation to TD(0) and MC

❐  λ-return can be rewritten as:

❐  If λ = 1, you get MC:

❐  If λ = 0, you get TD(0)

Rtλ = (1− λ ) λn−1

n=1

T− t−1

∑ Rt(n) + λT−t−1Rt

Rtλ = (1−1) 1n−1

n=1

T−t−1

∑ Rt(n ) + 1T− t−1Rt = Rt

Rtλ = (1− 0) 0n−1

n=1

T−t−1

∑ Rt(n ) + 0T− t−1Rt = Rt

(1)

Until termination After termination


Forward View of TD(λ) II

❐  Look forward from each state to determine update from future states and rewards:


λ-return on the Random Walk

❐  Same 19 state random walk as before❐  Why do you think intermediate values of λ are best?


Backward View of TD(λ)

❐  The forward view was for theory❐  The backward view is for mechanism

❐  New variable called eligibility tracen  On each step, decay all traces by γλ and increment the

trace for the current state by 1n  Accumulating trace

+∑")(set

et(s) =γλet−1(s) if s ≠ st

γλet−1(s) +1 if s = st% & '


On-line Tabular TD(λ)Initialize V(s) arbitrarily and e(s) = 0, for all s ∈SRepeat (for each episode) : Initialize s Repeat (for each step of episode) : a← action given by π for s Take action a, observe reward, r, and next state $ s δ ← r +γV( $ s ) − V (s) e(s)← e(s) +1 For all s : V(s) ←V(s) +αδe(s) e(s) ←γλe(s) s← $ s Until s is terminal


Backward View

❐  Shout δt backwards over time❐  The strength of your voice decreases with temporal

distance by γλ

)()( 11 tttttt sVsVr −+= ++ γδ


Relation of Backwards View to MC & TD(0)

❐  Using update rule:

❐  As before, if you set λ to 0, you get to TD(0) ❐  If you set λ to 1, you get MC but in a better way

n  Can apply TD(1) to continuing tasksn  Works incrementally and on-line (instead of waiting to

the end of the episode)

)()( sesV ttt αδ=Δ


Forward View = Backward View

❐  The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating

❐  The book shows:

❐  On-line updating with small α is similar

ΔVtTD(s)

t= 0

T−1

∑ = αt= 0

T−1

∑ Isst (γλ)k− tδ kk=t

T−1

∑ ΔVtλ(st )Isst

t= 0

T−1

∑ = αt= 0

T−1

∑ Isst (γλ)k− tδ kk=t

T−1

∑

ΔVtTD(s)

t= 0

T−1

∑ = ΔVtλ(st )

t= 0

T−1

∑ Isst

Backward updates Forward updates

algebra shown in book


On-line versus Off-line on Random Walk

❐  Same 19 state random walk❐  On-line performs better over a broader range of parameters


Control: Sarsa(λ)

❐  Save eligibility for state-action pairs instead of just states

et(s, a) =γλet−1(s, a) +1 if s = st and a = atγλet−1(s,a) otherwise

$ % &

Qt+1(s, a) = Qt(s, a) +αδtet(s, a)

δt = rt+1 + γQt(st+1,at+1) −Qt(st , at )


Sarsa(λ) Algorithm

Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, aRepeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, ! s Choose ! a from ! s using policy derived from Q (e.g. ? - greedy) δ ← r +γQ( ! s , ! a ) −Q(s, a) e(s,a)← e(s,a) +1 For all s,a : Q(s, a)←Q(s, a) +αδe(s, a) e(s, a) ←γλe(s, a) s← ! s ;a ← ! a Until s is terminal


Sarsa(λ) Gridworld Example

❐  With one trial, the agent has much more information about how to get to the goal n  not necessarily the best way

❐  Can considerably accelerate learning


Three Approaches to Q(λ)

❐  How can we extend this to Q-learning?

❐  If you mark every state action pair as eligible, you backup over non-greedy policyn  Watkins: Zero out eligibility

trace after a non-greedy action. Do max when backing up at first non-greedy choice.

et(s, a) =

1 + γλet−1(s, a)0

γλet−1(s,a)

if s = st , a = at ,Qt−1(st ,at ) = max a Qt−1(st , a) if Qt−1(st ,at) ≠ maxa Qt−1(st ,a)

otherwise

%

& '

( '

Qt +1(s, a) = Qt(s, a) +αδtet(s, a)

δt = rt +1 + γ max + a Qt(st +1, + a ) −Qt (st ,at)


Watkins’s Q(λ)Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, aRepeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, ! s Choose ! a from ! s using policy derived from Q (e.g. ? - greedy)

a* ← arg maxb Q( ! s , b) (if a ties for the max, then a* ← ! a )

δ ← r +γQ( ! s , ! a ) −Q(s, a* ) e(s,a)← e(s,a) +1 For all s,a : Q(s, a)←Q(s, a) +αδe(s, a)

If ! a = a*, then e(s, a) ←γλe(s,a) else e(s, a)← 0 s← ! s ;a ← ! a Until s is terminal


Peng’s Q(λ)

❐  Disadvantage to Watkins’s method:n  Early in learning, the

eligibility trace will be “cut” (zeroed out) frequently resulting in little advantage to traces

❐  Peng: n  Backup max action except

at endn  Never cut traces

❐  Disadvantage:n  Complicated to implement


Naïve Q(λ)

❐  Idea: is it really a problem to backup exploratory actions?n  Never zero tracesn  Always backup max at

current action (unlike Peng or Watkins’s)

❐  Is this truly naïve?❐  Works well is preliminary

empirical studies

What is the backup diagram?


Comparison Task

From McGovern and Sutton (1997). Towards a better Q(λ)

❐  Compared Watkins’s, Peng’s, and Naïve (called McGovern’s here) Q(λ) on several tasks.n  See McGovern and Sutton (1997). Towards a Better

Q(λ) for other tasks and results (stochastic tasks, continuing tasks, etc)

❐  Deterministic gridworld with obstaclesn  10x10 gridworldn  25 randomly generated obstaclesn  30 runsn  α = 0.05, γ = 0.9, λ = 0.9, ε = 0.05, accumulating traces


Comparison Results

From McGovern and Sutton (1997). Towards a better Q(λ)


Convergence of the Q(λ)’s

❐  None of the methods are proven to converge.n  Much extra credit if you can prove any of them.

❐  Watkins’s is thought to converge to Q*

❐  Peng’s is thought to converge to a mixture of Qπ and Q*

❐  Naïve - Q*?


Eligibility Traces for Actor-Critic Methods

❐  Critic: On-policy learning of Vπ. Use TD(λ) as described before.

❐  Actor: Needs eligibility traces for each state-action pair.❐  We change the update equation:

❐  Can change the other actor-critic update:

pt+1(s, a) =pt(s,a) +αδ t if a = at and s = st

pt (s, a) otherwise# $ %

),(),(),(1 aseaspasp tttt αδ+=+to

pt+1(s, a) =pt(s,a) +αδ t 1− π (s, a)[ ] if a = at and s = st

pt(s,a) otherwise% & ' to ),(),(),(1 aseaspasp tttt αδ+=+

et(s, a) =γλet−1(s, a) +1 − πt (st ,at) if s = st and a = at

γλet−1(s, a) otherwise% & '

where


Replacing Traces

❐  Using accumulating traces, frequently visited states can have eligibilities greater than 1n  This can be a problem for convergence

❐  Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1

et(s) =γλet−1(s) if s ≠ st

1 if s = st% & '


Replacing Traces Example

❐  Same 19 state random walk task as before❐  Replacing traces perform better than accumulating traces over more

values of λ


Why Replacing Traces?

❐  Replacing traces can significantly speed learning

❐  They can make the system perform well for a broader set of parameters

❐  Accumulating traces can do poorly on certain types of tasks

Why is this task particularly onerous for accumulating traces?


More Replacing Traces

❐  Off-line replacing trace TD(1) is identical to first-visit MC

❐  Extension to action-values:n  When you revisit a state, what should you do with the

traces for the other actions?n  Singh and Sutton say to set them to zero:

et(s, a) =

10

γλet−1(s, a)

$

% &

' &

if s = st and a = atif s = st and a ≠ at

if s ≠ st


Implementation Issues

❐  Could require much more computationn  But most eligibility traces are VERY close to zero

❐  If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices)


Variable λ

❐  Can generalize to variable λ

❐  Here λ is a function of timen  Could define

et(s) =γλtet−1(s) if s ≠ st

γλtet−1(s) +1 if s = st% & '

τλλλλt

ttt s == or )(


Conclusions

❐  Provides efficient, incremental way to combine MC and TDn  Includes advantages of MC (can deal with lack of

Markov property)n  Includes advantages of TD (using TD error,

bootstrapping)❐  Can significantly speed learning❐  Does have a cost in computation


Something Here is Not Like the Other

RL Lecture 7: Eligibility Traces Lecture 7.pdf · R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods Critic: On-policy

Documents