Top Banner
Reinforcement Learning Eligibility Traces 主主主 主主主 主主主主主主主 主主主主主 主主主
71

Reinforcement Learning Eligibility Traces

Dec 30, 2015

Download

Documents

erich-hester

Reinforcement Learning Eligibility Traces. 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Content. n-step TD prediction Forward View of TD(  ) Backward View of TD(  ) Equivalence of the Forward and Backward Views Sarsa(  ) Q(  ) Eligibility Traces for Actor-Critic Methods Replacing Traces - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

主講人:虞台文

大同大學資工所智慧型多媒體研究室

Page 2: Reinforcement Learning Eligibility Traces

Content n-step TD prediction Forward View of TD() Backward View of TD() Equivalence of the Forward and Backward Views Sarsa() Q() Eligibility Traces for Actor-Critic Methods Replacing Traces Implementation Issues

Page 3: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

n-Step

TD Prediction

大同大學資工所智慧型多媒體研究室

Page 4: Reinforcement Learning Eligibility Traces

Elementary Methods

DynamicProgramming

Monte CarloMethods

TD(0)

Page 5: Reinforcement Learning Eligibility Traces

Monte Carlo vs. TD(0)

Monte Carlo– observe reward for all steps in an episode

TD(0)– observed one step only

2 11 2 3

T tt t t t Tr r rR r

1 1(1) ( )t t tR r V s

Page 6: Reinforcement Learning Eligibility Traces

n-Step TD Prediction

TD (1-step) Monte Carlo2-step 3-step n-step

(1)tR

(2)tR

(3)tR

( )ntR

tR

Page 7: Reinforcement Learning Eligibility Traces

n-Step TD Prediction

2 11 2 3

T tt t t t TR r r r r

1 1(1) ( )t t ttR r V s

21 2 2

(2) ( )t t t tt r r V sR 2 3

1(

23

3)

3( )t t t t tt r r rR V s

2 11 2 3

( ) ( )n nt t t t

nn t t nt r r r r V sR

corrected n-step truncated return

Page 8: Reinforcement Learning Eligibility Traces

Backups

Monte Carlo 1( ) ( ) ( )t t t t t t tV s V Rs V s

TD(0) 1 1 1( ) ( ) ( () )t tt t t tt t tV s V s V sr V s (1)( ) ( )t t t ttV s R V s

n-step TD (1

)( ) ( ) ( )t t t t tn

t tV s V s R V s

( )t tV s

( )[ ( )]( )

0

nt t t

t

t

R V s s sV s

s s

Page 9: Reinforcement Learning Eligibility Traces

n-Step TD Backup

online

offline

1( ) ( ) ( )t t tV s V s V s 1

0

( ) ( ) ( )T

tt

V s V s V s

When offline, the new V(s) will be for the next episode.

( )[ ( )]( )

0

nt t t

t

t

R V s s sV s

s s

Page 10: Reinforcement Learning Eligibility Traces

( )max { | } ( ) max ( ) ( )n nt t

s sE R s s V s V s V s

Error Reduction Property

online

offline

1( ) ( ) ( )t t tV s V s V s 1

0

( ) ( ) ( )T

tt

V s V s V s

n-step return Maximum error using V (current value)

Maximum error using n-step return

Page 11: Reinforcement Learning Eligibility Traces

Example (Random Walk)

A B C D E

start

0 0 0 0 0 1

Consider 2-step TD, 3-step TD, …

V(s) 1/6 2/6 3/6 4/6 5/6

n=? is optimal?

Page 12: Reinforcement Learning Eligibility Traces

Example (19-state Random Walk)start

1 0 0 0 0 1

offlineonline

AverageRMSE

Over First10 Trials

Page 13: Reinforcement Learning Eligibility Traces

Exercise (Random Walk)

+1 1

Standardmoves

Page 14: Reinforcement Learning Eligibility Traces

Exercise (Random Walk)

+1 1

Standardmoves

1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ

ent n’s and ’s), and compare their performance.3. Find optimal policy.

1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ

ent n’s and ’s), and compare their performance.3. Find optimal policy.

Page 15: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

The Forward View of TD()

大同大學資工所智慧型多媒體研究室

Page 16: Reinforcement Learning Eligibility Traces

Averaging n-step Returns

We are not limited to simply using n-step TD returns

For example, we could take average n-step TD returns like:

(2) (4)1 1

2 2avgt t tR R R

One backup

Sum to 1

Page 17: Reinforcement Learning Eligibility Traces

TD() -Return

w1

w2

w3

wTt 1

1nw

TD() is a method for averaging all n-step backups – weight by n1 (time since

visitation)– Called -return

Backup using -return:

1 ( )

1

(1 ) n nt t

n

R R

( ) ( )t t t t tV s R V s

Page 18: Reinforcement Learning Eligibility Traces

TD() -Return

w1

w2

w3

wTt

1nw

TD() is a method for averaging all n-step backups – weight by n1 (time since

visitation)– Called -return

Backup using -return:

1 ( )

1

(1 ) n nt t

n

R R

( ) ( )t t t t tV s R V s

Page 19: Reinforcement Learning Eligibility Traces

Forward View of TD()

A theoretical view

Page 20: Reinforcement Learning Eligibility Traces

TD() on the Random Walk

Page 21: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

The Backward View of TD()

大同大學資工所智慧型多媒體研究室

Page 22: Reinforcement Learning Eligibility Traces

Why Backward View?

Forward view is acausal– Not implementable

Backward view is causal– Implementable– In the offline case, achieving the

same result as the forward view

Page 23: Reinforcement Learning Eligibility Traces

Eligibility Traces

Each state is associated with an additional memory variable eligibility trace, defined by:

1

1

( )( )

( ) 1t t

tt t

e s s se s

e s s s

Page 24: Reinforcement Learning Eligibility Traces

Eligibility Traces

Each state is associated with an additional memory variable eligibility trace, defined by:

1

1

( )( )

( ) 1t t

tt t

e s s se s

e s s s

Page 25: Reinforcement Learning Eligibility Traces

Eligibility Traces

Each state is associated with an additional memory variable eligibility trace, defined by:

1

1

( )( )

( ) 1t t

tt t

e s s se s

e s s s

Page 26: Reinforcement Learning Eligibility Traces

Eligibility Recency of Visiting

At any time, the traces record which states have recently been visited, where “recently" is defined in terms of .

The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur.

Reinforcing event

1

1

( )( )

( ) 1t t

tt t

e s s se s

e s s s

The moment-by-moment 1-step TD errors

1 1( ) ( )t tt t t tr V s V s

Page 27: Reinforcement Learning Eligibility Traces

Reinforcing Event

The moment-by-moment 1-step TD errors

1 1( ) ( )t tt t t tr V s V s

( ) ( )tt tV s e s

Page 28: Reinforcement Learning Eligibility Traces

TD()

Eligibility Traces1

1

( )

( ) 1( ) t

t tt

te s s s

s ss

e se

Reinforcing Events 1 1( ) ( )t tt t t tr V s V s

Value updates ( ) ( )t tt s sV e

Page 29: Reinforcement Learning Eligibility Traces

Online TD()

Initialize

Repeat (for each step of

Initialize arbitrarily and , for all

Repeat (for each episode):

action given by for

( )

episode)

Take action , observe

) 0

:

(V s e s

s

a

a

s

s S

( ) ( )

( ) ( ) 1

( ) (

reward, , and next state

Until is

For a

) ( )

termina

ll :

( ) ( )

l

r s

r V s V s

e s e s

s

V s V s e s

e s e s

s s

s

Page 30: Reinforcement Learning Eligibility Traces

Backward View of TD()

Page 31: Reinforcement Learning Eligibility Traces

Backwards View vs. MC & TD(0)

Set to 0, we get to TD(0) Set to 1, we get MC but in a better

way– Can apply TD(1) to continuing tasks– Works incrementally and on-line (instead of

waiting to the end of the episode)

How about 0 < < 1?

Page 32: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

Equivalence of the Forward and Backward Views

大同大學資工所智慧型多媒體研究室

Page 33: Reinforcement Learning Eligibility Traces

Offline TD()’s

Offline Forward TD() -Return

1 ( )

1

(1 ) n nt t

n

R R

[ ( )](

0) t t t

t

t

f R V s s s

s sV s

Offline Backward TD()

1

1

( )( )

( ) 1t t

tt t

e s s se s

e s s s

1 1( ) ( )t t t t t tr V s V s ( ) ( )t

bttV s e s

Page 34: Reinforcement Learning Eligibility Traces

Forward View = Backward View

1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

Backward updates Forward updates

1

0t

tss

t

s sI

s s

See the proof

Page 35: Reinforcement Learning Eligibility Traces

Forward View = Backward View

1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

Backward updates Forward updates

1

0t

tss

t

s sI

s s

Page 36: Reinforcement Learning Eligibility Traces

TD() on the Random Walk

AverageRMSE

Over First10 Trials

Offline -return(forward)

Online TD()(backward)

Page 37: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

Sarsa()

大同大學資工所智慧型多媒體研究室

Page 38: Reinforcement Learning Eligibility Traces

Sarsa()

TD() – Use eligibility traces for policy evaluation

How can eligibility traces be used for control?– Learn Qt(s, a) rather than Vt(s).

Page 39: Reinforcement Learning Eligibility Traces

Sarsa()

1

1

( , ) ( , ) ( , )

( , ) 1 ( , ) ( , )( , ) t t

tt

t t t

e s a s a s a

e s a s a se s

aa

1 1 1( , ) ( , )t t t tt t t tr Q s a Q s a

1( , ) ( , ) ( , )tt t tQ s ea Q ss a a

EligibilityTraces

ReinforcingEvents

Updates

Page 40: Reinforcement Learning Eligibility Traces

Sarsa()

Initialize

Repeat (for

Initialize arbitrarily and

each step of episode):

, for all

Repeat (for each episo

Take a

( , ) ( , )

ction

0 ,

, observe

Choose f

de)

r

,

om

:

,

Q s a e s a s a

s a

a r s

a

( , ) ( , )

using pol

1

( , ) ( , ) ( , )

icy derived from (e.g. -greedy)

For all :

( , )

( , )

s Q

r Q s a Q s a

e(s,a) e(s,a)

s,a

Q s a Q s a e s a

e s a e s a

Until

is ter l

;

minas

s s a a

Initialize

Repeat (for

Initialize arbitrarily and

each step of episode):

, for all

Repeat (for each episo

Take a

( , ) ( , )

ction

0 ,

, observe

Choose f

de)

r

,

om

:

,

Q s a e s a s a

s a

a r s

a

( , ) ( , )

using pol

1

( , ) ( , ) ( , )

icy derived from (e.g. -greedy)

For all :

( , )

( , )

s Q

r Q s a Q s a

e(s,a) e(s,a)

s,a

Q s a Q s a e s a

e s a e s a

Until

is ter l

;

minas

s s a a

Page 41: Reinforcement Learning Eligibility Traces

Sarsa() Traces in Grid World

With one trial, the agent has much more information about how to get to the goal – not necessarily the best way

Considerably accelerate learning

Page 42: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

Q()

大同大學資工所智慧型多媒體研究室

Page 43: Reinforcement Learning Eligibility Traces

Q-Learning

An off-policy method– breaks from time to time to take exploratory actions– a simple time trace cannot be easily implemented

How to combine eligibility traces and Q-learning?

Three methods:– Watkins's Q() – Peng's Q ()– Naïve Q ()

Page 44: Reinforcement Learning Eligibility Traces

Watkins's Q() Behavior policy(e.g., -greedy)

Estimation policy(e.g., greedy)

GreedyPath

Non-GreedyPath

Firstnon-greedy

action

Page 45: Reinforcement Learning Eligibility Traces

Backups Watkins's Q()

Two cases:1. Both behavior and

estimation policies take the greedy path.

2. Behavior path has taken a non-greedy action before the episode ends.

Case 1Case 2

How to define the eligibility traces?

Page 46: Reinforcement Learning Eligibility Traces

Watkins's Q()

1

1 1

1 1

1

0 (

( , ) ( , ) and1

, ) max

( , )

(

( , )

( , ) max ( , )

(

, )

, )

t tt

t t t a t t

t t t a t tt

te s a ot

s a s ae

Q s a Q s a

s aQ s a Q s a

e

herwis

s

e

a

1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a

1( , ) ( , ) ( , )tt t tQ s ea Q ss a a

Page 47: Reinforcement Learning Eligibility Traces

Watkins's Q() Initialize

Repeat (for

Initialize arbitrarily and

each step of episode):

, for all

Repeat (for each episo

Take a

( , ) ( , )

ction

0 ,

, observe

Choose f

de)

r

,

om

:

,

Q s a e s a s a

s a

a r s

a

using policy derived from (e.g. -greedy)

(if ties for the max, th* arg max ( , ) *

( , ) ( , *)

en )

For all :

( , ) ( , ) 1

b

s

a Q s b a a a

r Q s a Q s a

e s

Q

a e s a

s,a

If , then

els

( , ) ( , ) ( , )

* ( , ) ( , )

( , ) 0

Until is termina

l

e

;

Q s a Q s a e s a

a a e s a e s a

e s a

s s a a

s

Initialize

Repeat (for

Initialize arbitrarily and

each step of episode):

, for all

Repeat (for each episo

Take a

( , ) ( , )

ction

0 ,

, observe

Choose f

de)

r

,

om

:

,

Q s a e s a s a

s a

a r s

a

using policy derived from (e.g. -greedy)

(if ties for the max, th* arg max ( , ) *

( , ) ( , *)

en )

For all :

( , ) ( , ) 1

b

s

a Q s b a a a

r Q s a Q s a

e s

Q

a e s a

s,a

If , then

els

( , ) ( , ) ( , )

* ( , ) ( , )

( , ) 0

Until is termina

l

e

;

Q s a Q s a e s a

a a e s a e s a

e s a

s s a a

s

Page 48: Reinforcement Learning Eligibility Traces

Peng's Q() Cutting off traces loses much of the advant

age of using eligibility traces. If exploratory actions are frequent, as they

often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning.

Peng's Q() is an alternate version of Q() meant to remedy this.

Page 49: Reinforcement Learning Eligibility Traces

Backups Peng's Q()

Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).

Never cut traces Backup max action except at end The book says it outperforms Watkins Q(λ) a

nd almost as well as Sarsa(λ) Disadvantage: difficult for implementation

Page 50: Reinforcement Learning Eligibility Traces

Peng's Q() See

Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).

for notations.

Page 51: Reinforcement Learning Eligibility Traces

Naïve Q()

Idea: Is it really a problem to backup exploratory actions?– Never zero traces– Always backup max at current action

(unlike Peng or Watkins’s) Is this truly naïve? Works well is preliminary empirical

studies

Page 52: Reinforcement Learning Eligibility Traces

Naïve Q()

1

1 1

1 1

1

0 (

( , ) ( , ) and1

, ) max

( , )

(

( , )

( , ) max ( , )

(

, )

, )

t tt

t t t a t t

t t t a t tt

te s a ot

s a s ae

Q s a Q s a

s aQ s a Q s a

e

herwis

s

e

a

1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a

1( , ) ( , ) ( , )tt t tQ s ea Q ss a a

Page 53: Reinforcement Learning Eligibility Traces

Comparisons

McGovern, Amy and Sutton, Richard S. (1997) Towards a better Q(). Presented at the Fall 1997 Reinforcement Learning Workshop.

Deterministic gridworld with obstacles– 10x10 gridworld– 25 randomly generated obstacles– 30 runs = 0.05, = 0.9, = 0.9, = 0.05,– accumulating traces

Page 54: Reinforcement Learning Eligibility Traces

Comparisons

Page 55: Reinforcement Learning Eligibility Traces

Convergence of the Q()’s

None of the methods are proven to converge.– Much extra credit if you can prove any of

them. Watkins’s is thought to converge to Q*

Peng’s is thought to converge to a mixture of Q and Q*

Naïve - Q*?

Page 56: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

Eligibility Traces for Actor-Critic Methods

大同大學資工所智慧型多媒體研究室

Page 57: Reinforcement Learning Eligibility Traces

Actor-Critic Methods

Environment

ValueFunction

Policy

Critic

Actor

Action

state reward

TDerror

Critic: On-policy learning of V. Use TD() as described before.

Actor: Needs eligibility traces for each state-action pair.

Page 58: Reinforcement Learning Eligibility Traces

Policy Parameters Update

Environment

ValueFunction

Policy

Critic

Actor

Action

state reward

TDerror

Method 1:

1

( , ) if and ( , )

( , ) otherwiset t t t

tt

p s a a a s sp s a

p s a

1( , ) ( , ) ( , )tt t tp s ea p ss a a

Page 59: Reinforcement Learning Eligibility Traces

Policy Parameters Update

Environment

ValueFunction

Policy

Critic

Actor

Action

state reward

TDerror

Method 2:

1( , ) ( , ) ( , )tt t tp s ea p ss a a

1

( , ) if and ( , )

( , ) othe

1 ( ,

r se

)

wit t t t

t

t

p s a a a s sp s a

p s

a

a

s

1

1

( , ) 1 ( , ) if and ( , )

( , ) otherwiset t t t t t

tt

e s a s a s s a ae s a

e s a

Page 60: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

Replacing Traces

大同大學資工所智慧型多媒體研究室

Page 61: Reinforcement Learning Eligibility Traces

Accumulating/Replacing Traces

1

1

( )( )

( ) 1t t

tt t

e s s se s

e s s s

Accumulating Traces:

1( )( )

1t t

tt

e s s se s

s s

Replacing Traces:

Page 62: Reinforcement Learning Eligibility Traces

Why Replacing Traces?

Using accumulating traces, frequently visited states can have eligibilities greater than 1– This can be a problem for convergence

Replacing traces can significantly speed learning

They can make the system perform well for a broader set of parameters

Accumulating traces can do poorly on certain types of tasks

Page 63: Reinforcement Learning Eligibility Traces

Example (19-State Random Walk)

Page 64: Reinforcement Learning Eligibility Traces

Extension to action-values

When you revisit a state, what should you do with the traces for the other actions?

Singh and Sutton (1996) to set traces of all other actions from the revisited state to 0.

1

0 if

1 if and

( , )

(

, ) if

and t t

t

t

t

t

ts s

s s a a

e s a

e s a s s

a a

Page 65: Reinforcement Learning Eligibility Traces

Reinforcement Learning

Eligibility Traces

Implementation Issues

大同大學資工所智慧型多媒體研究室

Page 66: Reinforcement Learning Eligibility Traces

Implementation Issues

For practical use we cannot compute every trace down to the last.

Dropping very small values is recommended and encouraged.

If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices).

Use with neural networks and backpropagation generally only causes a doubling of needed computational power.

Page 67: Reinforcement Learning Eligibility Traces

Variable Can generalize to variable

Here is a function of time– E.g.,

1

1

( ) if ( )

( ) 1 if t t

tt

t t

te s s se s

e s s s

o( ) rt

tt ts

Page 68: Reinforcement Learning Eligibility Traces

Proof1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

0

( ) ( )k

tt k

t ssk

e s I

An accumulating eligibility trace can be written explicitly (non-recursively) as

1 1

0 0 0

( () )k

T T tt k

t sst t

b

ktV s I

0 1k t T

0 1k t T 1 1

0

( )k

T Tt k

ss tk t k

I

1 1

0

( )t

T Tk t

ss kt k t

I

k t

Page 69: Reinforcement Learning Eligibility Traces

Proof1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

1 1

0 0 0

( () )k

T T tt k

t sst t

b

ktV s I

1 1

0 0 0

( () )k

T T tt k

t sst t

b

ktV s I

( )1

( )ft tt t tRV s V s

1 ( )

1

( ) (1 ) n nt t t

n

V s R

0

1 1( ) (1 ) [ ( )]t t t t tV s r V s 1 2

1 2 2(1 ) [ ( )]t t t tr r V s 2 2 3

1 2 3 3(1 ) [ ( )]t t t t tr r r V s

Page 70: Reinforcement Learning Eligibility Traces

Proof1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

1 1

0 0 0

( () )k

T T tt k

t sst t

b

ktV s I

1 1

0 0 0

( () )k

T T tt k

t sst t

b

ktV s I

1

( )ft tV s

0

1 1 1( ) ( ) [ ( ) ( )]t t t t t t tV s r V s V s 1

2 2 2( ) [ ( ) ( )]t t t t tr V s V s 2

3 3 3( ) [ ( ) ( )]t t t t tr V s V s

01 1 ( ) [ ( ) ( )]t t t t tr V s V s

12 2 1( ) [ ( ) ( )]t t t t tr V s V s

23 3 2( ) [ ( ) ( )]t t t t tr V s V s

0( ) t 1

1( ) t

22( ) t

Page 71: Reinforcement Learning Eligibility Traces

Proof1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

1 1

0 0 0

( () )k

T T tt k

t sst t

b

ktV s I

1 1

0 0 0

( () )k

T T tt k

t sst t

b

ktV s I

1

( )ft tV s

( )k t

kk t

11

0

( )t

TT

sst

k tk

k t

I

1

0

( )t

T

sst

ft ts IV

1

( )T

k tk

k t

0 1t k T

0 1t k T

1

0 0

( )t

T kk t

k ssk t

I

1 1

0 0 0

( )( )t k

ft t

T T tt k

ss t sst t k

I IV s

1 1

0 0 0

( )( )t k

ft t

T T tt k

ss t sst t k

I IV s

k t