Top Banner
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces
45

Chapter 7: Eligibility Traces

Feb 06, 2016

Download

Documents

Astrid Adji

Chapter 7: Eligibility Traces. N-step TD Prediction. Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps). Mathematics of N-step TD Prediction. Monte Carlo: TD: Use V to estimate remaining return n-step TD: 2 step return: n-step return:. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces

Page 2: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

N-step TD Prediction

Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

Page 3: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Monte Carlo:

TD: Use V to estimate remaining return

n-step TD: 2 step return:

n-step return:

Mathematics of N-step TD Prediction

TtT

tttt rrrrR 13

221

)( 11)1(

tttt sVrR

)( 22

21)2(

ttttt sVrrR

)(13

221

)(ntt

nnt

nttt

nt sVrrrrR

Page 4: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Learning with N-step Backups

Backup (on-line or off-line):

Error reduction property of n-step returns

Using this, you can show that n-step methods converge

)()(max)(}|{max sVsVsVssREs

nt

nts

n step return

Maximum error using n-step return Maximum error using V

Vt(st ) Rt(n) Vt(st )

Page 5: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Random Walk Examples

How does 2-step TD work here? How about 3-step TD?

Page 6: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

A Larger Example

Task: 19 state random walk

Do you think there is an optimal n (for everything)?

Page 7: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Averaging N-step Returns

n-step methods were introduced to help with TD() understanding

Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-

step

Called a complex backup Draw each component Label with the weights for that

component

)4()2(

21

21

ttavgt RRR

One backup

Page 8: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Forward View of TD()

TD() is a method for averaging all n-step backups

weight by n-1 (time since visitation)

-return:

Backup using -return:

Rt (1 ) n 1

n1

Rt(n)

Vt(st ) Rt Vt(st )

Page 9: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

-return Weighting Function

Page 10: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Relation to TD(0) and MC

-return can be rewritten as:

If = 1, you get MC:

If = 0, you get TD(0)

Rt (1 ) n 1

n1

T t 1

Rt(n) T t 1Rt

Rt (1 1) 1n 1

n1

T t 1

Rt(n ) 1T t 1 Rt Rt

Rt (1 0) 0n 1

n1

T t 1

Rt(n ) 0T t 1 Rt Rt

(1)

Until termination After termination

Page 11: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Forward View of TD() II

Look forward from each state to determine update from future states and rewards:

Page 12: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

-return on the Random Walk

Same 19 state random walk as before Why do you think intermediate values of are best?

Page 13: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Backward View of TD()

The forward view was for theory The backward view is for mechanism

New variable called eligibility trace On each step, decay all traces by and increment the

trace for the current state by 1 Accumulating trace

)(set

et(s) et 1(s) if s st

et 1(s) 1 if s st

Page 14: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

On-line Tabular TD()

Initialize V(s) arbitrarily and e(s) 0, for all s SRepeat (for each episode) : Initialize s Repeat (for each step of episode) : a action given by for s Take action a, observe reward, r, and next state s r V( s ) V (s) e(s) e(s) 1 For all s : V(s) V(s) e(s) e(s) e(s) s s Until s is terminal

Page 15: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Backward View

Shout t backwards over time The strength of your voice decreases with temporal

distance by

)()( 11 tttttt sVsVr

Page 16: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Relation of Backwards View to MC & TD(0)

Using update rule:

As before, if you set to 0, you get to TD(0) If you set to 1, you get MC but in a better way

Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to

the end of the episode)

)()( sesV ttt

Page 17: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Forward View = Backward View

The forward (theoretical) view of TD() is equivalent to the backward (mechanistic) view for off-line updating

The book shows:

On-line updating with small is similar

VtTD(s)

t0

T 1

t 0

T 1

Isst( )k t k

kt

T 1

Vt(st )Isst

t0

T 1

t 0

T 1

Isst( )k t k

kt

T 1

VtTD(s)

t 0

T 1

Vt(st )

t0

T 1

Isst

Backward updates Forward updates

algebra shown in book

Page 18: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

On-line versus Off-line on Random Walk

Same 19 state random walk On-line performs better over a broader range of parameters

Page 19: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Control: Sarsa()

Save eligibility for state-action pairs instead of just states

et(s, a) et 1(s, a) 1 if s st and a at

et 1(s,a) otherwise

Qt 1(s,a) Qt(s,a) tet(s, a)

t rt 1 Qt(st1,at1) Qt(st , at )

Page 20: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Sarsa() Algorithm

Initialize Q(s,a) arbitrarily and e(s,a) 0, for all s,aRepeat (for each episode) : Initialize s,a Repeat (for each step of episode) : Take action a, observe r, s Choose a from s using policy derived from Q (e.g. ? - greedy) r Q( s , a ) Q(s,a) e(s,a) e(s,a) 1 For all s,a : Q(s,a) Q(s,a) e(s,a) e(s,a) e(s, a) s s ;a a Until s is terminal

Page 21: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Sarsa() Gridworld Example

With one trial, the agent has much more information about how to get to the goal

not necessarily the best way Can considerably accelerate learning

Page 22: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Sarsa() Reluctant Walk Example

N states and two types of action right and wrong When a wrong action, no reward and state not changed Right action moves towards the goal Reward +1 received when the terminal state reached

Consider Sarsa(lambda) with eligibility traces

Page 23: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Sarsa() Reluctant Walk Example

% Example 7.7% reluctant walk - choosing right action moves you towards the goal% states are numbered from 1 to number_statesclear all;close all;lambda=0.9;alpha=0.05;gamma=0.95;

% rewardsnumber_states=15;number_actions=2;number_runs=50;r=zeros(2,number_states);% first row are the right actionsr(1,number_states)=1;% change alphajjj=0;alpha_range=0.05:0.05:0.4;for alpha=alpha_range jjj=jjj+1;

all_episodes=[]; for kk=1:number_runs

S_prime(1,:)=(1:number_states);S_prime(2,:)=(1:number_states);% generate a right action sequenceright_action=(rand(1,number_states)>0.5)+1;% next state for the right action for i=1:number_states

S_prime(right_action(i),i)=i+1;end% eligibility tracese=zeros(number_actions,number_states+1);Qsa=rand(number_actions,number_states);Qsa(:,number_states+1)=0;

num_episodes=10;t=1;% repeat for each episodefor episode=1:num_episodes

epsi=1/t; % initialize state s=1;

Page 24: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Sarsa() Reluctant Walk Example% chose action a from s using epsilon greedy policy chose_policy=rand>epsi; if chose_policy [val,a]=max(Qsa(:,s)); else a=ceil(rand*(1-eps)*number_actions); end; % repeat for each step of episode episode_size(episode)=0; path=[]; random=[]; while s~=number_states+1 epsi=1/t; % take action a, observe r, s_pr s_pr=S_prime(a,s); % chose action a_pr from s_pr using epsilon greedy policy chose_policy_pr=rand>epsi; [val,a_star]=max(Qsa(:,s_pr)); if chose_policy_pr random=[random 0]; a_pr=a_star; else a_pr=ceil(rand*(1-eps)*number_actions); random=[random 1]; end;

% reward if s_pr==number_states+1 r=1; else r=0; end;

delta=r+gamma*Qsa(a_pr,s_pr)-Qsa(a,s); % eligibility traces e(a,s)=e(a,s)+1;

% Sarasa lambda algorithm Qsa=Qsa+alpha*delta*e; e=gamma*lambda*e;

s=s_pr; a=a_pr; episode_size(episode)=episode_size(episode)+1; path=[path s]; t=t+1; end; %while s end;

Page 25: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Sarsa() Reluctant Walk Example

all_episodes=[all_episodes; episode_size]; end; % kk last_episode=cumsum(mean(all_episodes)); all_episode_size(jjj)=last_episode(num_episodes);end; % alphaepisode_size=mean(all_episodes);plot([0 cumsum(episode_size)],(0:num_episodes));xlabel('Time step');ylabel('Episode index');title('Eligibility traces-SARSA random walk 15 states learning rate with eps=1/t, lambda=0.9');figure(2)plot(alpha_range, (all_episode_size));xlabel('Alpha');ylabel('Number of steps for 10 episodes');title('Eligibility traces-SARSA random walk 15 states learning rate with eps=1/t, lambda=0.9');figure(3)plot( episode_size(1:num_episodes))xlabel('Episode index');ylabel('Episode length in steps');title('Eligibility traces-SARSA random walk 15 states learning rate with eps=1/t, lambda=0.9');

Page 26: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Eligibility traces with alpha> 0.5 is unstable

Sarsa() Reluctant Walk Example

Page 27: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Learning rate in eligibility traces, alpha=0.4

Sarsa() Reluctant Walk Example

Page 28: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Three Approaches to Q()

How can we extend this to Q-learning?

If you mark every state action pair as eligible, you backup over non-greedy policy

Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.

et(s, a) 1 et 1(s,a)

0et 1(s,a)

if s st , a at ,Qt 1(st ,at ) max a Qt 1(st , a) if Qt 1(st ,at) max a Qt 1(st ,a)

otherwise

Qt 1(s,a) Qt(s,a) tet(s, a)

t rt1 max a Qt(st1, a ) Qt (st ,at)

Page 29: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

Watkins’s Q()

Initialize Q(s,a) arbitrarily and e(s,a) 0, for all s,aRepeat (for each episode) : Initialize s,a Repeat (for each step of episode) : Take action a, observe r, s Choose a from s using policy derived from Q (e.g. ? - greedy)

a* arg max b Q( s ,b) (if a ties for the max, then a* a )

r Q( s , a ) Q(s,a* ) e(s,a) e(s,a) 1 For all s,a : Q(s,a) Q(s,a) e(s,a)

If a a*, then e(s, a) e(s,a) else e(s, a) 0 s s ;a a Until s is terminal

Page 30: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

Peng’s Q()

Disadvantage to Watkins’s method:

Early in learning, the eligibility trace will be “cut” (zeroed out) frequently resulting in little advantage to traces

Peng: Backup max action except

at end Never cut traces

Disadvantage: Complicated to implement

Page 31: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

Naïve Q()

Idea: is it really a problem to backup exploratory actions?

Never zero traces Always backup max at

current action (unlike Peng or Watkins’s)

Is this truly naïve? Works well is preliminary

empirical studies

What is the backup diagram?

Page 32: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

Comparison Task

From McGovern and Sutton (1997). Towards a better Q()

Compared Watkins’s, Peng’s, and Naïve (called McGovern’s here) Q() on several tasks.

See McGovern and Sutton (1997). Towards a Better Q() for other tasks and results (stochastic tasks, continuing tasks, etc)

Deterministic gridworld with obstacles 10x10 gridworld 25 randomly generated obstacles 30 runs = 0.05, = 0.9, = 0.9, = 0.05, accumulating traces

Page 33: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

Comparison Results

From McGovern and Sutton (1997). Towards a better Q()

Page 34: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34

Convergence of the Q()’s

None of the methods are proven to converge. Much extra credit if you can prove any of them.

Watkins’s is thought to converge to Q*

Peng’s is thought to converge to a mixture of Q and Q*

Naïve - Q*?

Page 35: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35

Eligibility Traces for Actor-Critic Methods

Critic: On-policy learning of V. Use TD() as described before.

Actor: Needs eligibility traces for each state-action pair. We change the update equation:

Can change the other actor-critic update:

pt1(s,a) pt(s,a) t if a at and s st

pt (s, a) otherwise

),(),(),(1 aseaspasp tttt to

pt1(s,a) pt(s,a) t 1 (s,a) if a at and s st

pt(s,a) otherwise to ),(),(),(1 aseaspasp tttt

et(s, a) et 1(s, a) 1 t (st ,at) if s st and a at

et 1(s,a) otherwise

where

Page 36: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 36

Replacing Traces

Using accumulating traces, frequently visited states can have eligibilities greater than 1

This can be a problem for convergence

Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1

et(s) et 1(s) if s st

1 if s st

Page 37: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 37

Replacing Traces Example

Same 19 state random walk task as before Replacing traces perform better than accumulating traces over more

values of

Page 38: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 38

Why Replacing Traces?

Replacing traces can significantly speed learning

They can make the system perform well for a broader set of parameters

Accumulating traces can do poorly on certain types of tasks

Why is this task particularly onerous for accumulating traces?

Page 39: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 39

Replacing traces

Sarsa() Reluctant Walk Example

delta=r+gamma*Qsa(a_pr,s_pr)-Qsa(a,s); % replacing traces% e(:,s)=0; e(a,s)=1;% Sarasa lambda algorithm Qsa=Qsa+alpha*delta*e; e=gamma*lambda*e;

Page 40: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 40

Replacing traces

Sarsa() Reluctant Walk Example

Page 41: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 41

More Replacing Traces

Off-line replacing trace TD(1) is identical to first-visit MC

Extension to action-values: When you revisit a state, what should you do with the

traces for the other actions? Singh and Sutton say to set them to zero:

et(s, a) 10

et 1(s, a)

if s st and a at

if s st and a at

if s st

Page 42: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 42

Implementation Issues

Could require much more computation But most eligibility traces are VERY close to zero

If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices)

Page 43: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 43

Variable

Can generalize to variable

Here is a function of time Could define

et(s) tet 1(s) if s st

tet 1(s) 1 if s st

t

ttt s or )(

Page 44: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 44

Conclusions

Provides efficient, incremental way to combine MC and TD

Includes advantages of MC (can deal with lack of Markov property)

Includes advantages of TD (using TD error, bootstrapping)

Can significantly speed learning Does have a cost in computation

Page 45: Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 45

Something Here is Not Like the Other