Top Banner
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado at Boulder [email protected] and Lyle Ungar University of Pennsylvania [email protected]
18

1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

1

Rates of Convergence of Performance Gradient Estimates Using Function

Approximation and Bias in Reinforcement Learning

Greg Grudic University of Colorado at Boulder

[email protected]

andLyle Ungar

University of [email protected]

Page 2: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

2

Reinforcement Learning (MDP)• Policy

• Reinforcement feedback (environment) • Goal: modify policy to maximize reward

• State-action value function

( , ; ) Pr ;t ts a a a s s

tr

( ) { }01

,tt

tE r sr g pp

¥

== å

( ) { }1

1, ,, k

t k t tk

Q E r s s a as ap g p¥

-+

== = =å

Page 3: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

3

• Policy parameterized by –

• Searching space implies searching policy space

• Performance function implicitly depends on –

Policy Gradient Formulation

( , ; )s a

Page 4: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

4

RL Policy Gradient Learning

1t t

Where is the performance gradient

Update equation for parameters

small positive step size

Page 5: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

5

• Computation linear in the number of parameters – avoids blow-up from discretization

• Generalization in state space is implicitly defined by the parametric representation

Why Policy Gradient RL?

( , ; )s a

Page 6: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

6

Estimating the Performance Gradient

• REINFORCE (Williams 1992): gives an unbiased estimate of– HOWEVER: slow convergence

• has high variance

• GOAL: Find PG algorithms with low variance estimates of

Page 7: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

7

Performance Gradient Formulation

( )( )

( ) ( )1

, ;,

Mi

is i

s ad s Q s a b sp pp qr

q q=

¶¶ é ù= -ê úë û¶ ¶å å

( ) { }00

Pr ,tt

t

d s s s sp g p¥

=

= =å

( )b s Î Â

Where:

(arbitrary)

[Sutton, McAllester, Singh, Mansour, 2000] and [Konda and Tsitsiklis, 2000]

Page 8: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

8

Two Open Questions for Improving Convergence of PG

Estimates

• How should observations of be used to reduce the variance in estimates of the performance gradient?

• Does there exist that reduces the variance in estimating the performance gradient?

( ), iQ s ap

( ) 0b s ¹

Page 9: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

9

Assumptions

, , ,obs i i iQ s a Q s a s a

2,, 0, ,ii i s aE s a V s a

2,ˆ ˆ, , , , is a

i i iE Q s a Q s a V Q s aN

Where:

Therefore, after N observations:

,obs iQ s aIndependently distributed (MDP)

Page 10: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

10

PG Model 1: Direct Q estimates

¶( )

( )µ ( )1

, ;,

Mi

is i

s ad s Q s a

pp p qrq q=

¶¶=

¶ ¶å å

µ ( ) ( )1, ,i obs i

N

Q s a Q s aN

p p= å For N observations

Where:

Page 11: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

11

PG Model 2: PIFA

¶( )

( )µ ( )1

, ;,

Mi

iF s i

s ad s Q s a

pp p qrq q=

¶¶=

¶ ¶å å

chosen using N observations of

Policy Iteration with Function Approximation [Sutton, McAllester, Singh, Mansour, 2000]

( ) ( ) ( ), ,1

ˆ ,i i i

L

i a a l a ll

Q s a f ws sp p f

=

= =å,ia l

w ,obs iQ s a

Where:

Page 12: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

12

PG Model 3: Non-Zero Bias

¶( )

( ) µ ( ) ( )1

, ;,

Mi

ib s i

s ad s Q s a b s

pp p qrq q=

¶¶ é ù= -ê ú¶ ¶ ë ûå å

µ ( ) ( )1, ,i obs i

N

Q s a Q s aN

p p= å For N observations

Where:

( ) µ ( )1

1,

M

ii

b s Q s aM

p

=

= å Average of Q estimatein s

Page 13: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

13

Theoretical Results

·min max

F

ML MLC V C

N Nrq

é ù¶ê ú£ £ê ú¶ë û

¶min max

1 1C V C

N Nrq

é ù¶ê ú£ £ê ú¶ë û

·min max

1 11 11 1

b

C V CN NM M

rq

é ùæ ö æ ö¶÷ ÷ç çê ú£ £- -÷ ÷ç ç÷ ÷ç çê úè ø è ø¶ë û

Page 14: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

14

2 2 2 2max , min ,

, {1,..., }, {1,..., }max , min

i is a s as S i Ms S i M

s s s sÎ ÎÎ Î

= =

( )( ) ( )

( )( ) ( )

22 2

min min1

22 2

max max1

, ;

, ;

Mi

s i

Mi

s i

s aC d s

s aC d s

p

p

p q sq

p q sq

=

=

é ùæ ö¶ê ú÷ç= ÷å å çê ú÷÷ççè øê ú¶ë ûé ùæ ö¶ê ú÷ç= ÷å å çê ú÷÷ççè øê ú¶ë û

Where:

Page 15: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

15

Experimental Result 1: Convergence of Three Algorithms

Page 16: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

16

Experimental Result 2:

¶F

V

V

rq

rq

é ù¶ê úê ú¶ë ûé ù¶ê úê ú¶ë û

Page 17: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

17

Experimental Result 3:

b

V

V

rqrq

é ù¶ê úê ú¶ë ûé ù¶ê úê ú¶ë û

Page 18: 1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

18

Conclusion

• Implementation of PG algorithms significantly affects convergence

• Linear basis function representations of Q can substantially degrade convergence

• Appropriately chosen bias terms can improve convergence