Top Banner

of 9

Direct Learning by Reinforcement

May 30, 2018

Download

Documents

Aaron Merrill
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/14/2019 Direct Learning by Reinforcement

    1/9

    13Direct Learning

    by R einforcement*e n n i e S i tDep artme nt o f Electrical Engineering,Arizona State University,Tempe, Arizona, USA

    1 3. 1 I n t r o d u c t i o n . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . 11511 3.2 A G e n e r a l F r a m e w o r k f o r D i r e c t L e a r n i n g T h r o u g h A s s o c i a t i o n a n d

    R e i n f o r c e m e n t . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . 115113.2.1 The Critic Network 13.2.2 The Action Network 13.2.3 On line Learning Algorithms1 3.3 A n a l y t i c a l C h a r a c t e r i s t ic s o f a n O n l i n e N D P L e a r n i n g P r o c e s s . . . . . .. . . . . .. . . . . .. 1154 13.3.1 Stoch astic Ap prox imatio n Algorithm s 13.3.2 Conve rgence n Statistical Average o rAction and Critic Networks1 3.4 E x a m p l e 1 . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . 1156

    13.4.1 The Cart-Pole Balancing Problem 13.4.2 Sim ulation Results1 3.5 E x a m p l e 2 . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . 11581 3.6 C o n c l u s i o n .. . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . 1159

    R e f e r e n c e s . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . 1159

    13 .1 Introduct ionT h i s c h a p t e r f o c u s e s o n a s y s t e m a t i c t r e a t m e n t f o r d e v e lo p i n g ag e n e r ic o n l i n e l e a rn i n g c o n t r o l s y s te m b a s e d o n t h e f u n d a m e n -t a l p r i n c i p l e o f reinforcement learn ing or , m or e spec i f i ca l ly ,n e u r a l dynam ic programming. T h i s o n l i n e l e a r n i n g s y s t e mi m p r o v e s i t s p e r f o r m a n c e o v e r t i m e i n t w o a s p e c t s . F i r s t , i tl e a r n s f r o m i t s o w n m i s t a k e s t h r o u g h t h e r e i n f o r c e m e n t s i g n a lf r o m t h e e x t e r n a l e n v i r o n m e n t a n d t r i e s t o r e i n f o r c e i t s a c t i o nt o i m p r o v e f u t u r e p e r f o r m a n c e . S e c o n d , s y s t e m s t a t e s a s s o c i -a t e d w i t h t h e p o s i t i v e r e i n f o r c e m e n t a r e m e m o r i z e d t h r o u g h an e t w o r k l e a r n i n g p r o c e s s w h e r e i n t h e f u t u r e , s i m i l a r s t a t es w i llb e m o r e p o s i ti v e l y a s s o c ia t e d w i t h a c o n t r o l a c t i o n l e a d i n g t o ap o s i t i v e r e i n f o r c e m e n t . T h i s d i s c u s s i o n a l s o i n t r o d u c e s a s u c -c e s s fu l c a n d i d a t e o f o n l i n e l e a r n i n g c o n t r o l d e s i g n . R e a l - t i m el e a r n i n g a l g o r i t h m s a r e d e r i v e d f o r i n d i v i d u a l c o m p o n e n t s i nt h e l e a r n i n g s y s t e m . S o m e a n a l y t i c a l i n s i g h t a r e p r o v i d e d t o g i v eg u i d e l i n e s o n t h e l e a r n i n g p r o c e s s t h a t t a k e s p l a c e i n e a c hm o d u l e o f t h e o n l i n e l ea r n i n g c o n t r o l s y s te m . T h e p e r f o r m a n c eo f t h e o n l i n e l e a r n i n g c o n t r o l l e r i s m e a s u r e d b y it s l e a r n i n gs p e e d , s u c c e s s r a t e o f l e a r n i n g , a n d t h e d e g r e e t o m e e t t h e

    * Portions reprinted, with permission, from the 2001 IEEE Transactions onNeural Networks 2(2), 264-276.Supported by NSF under grant ECS-0002098.Cop yr ight 2005 by Academic Press .Al l r ight s of r eproduct ion in any form reserved .

    l e a r n i n g c o n t r o l o b j e c t i v e . T h e o v e r a l l l e a r n i n g c o n t r o l s y s t e mp e r f o r m a n c e is t e s te d o n a s i n gl e c a r t - p o l e b a l a n c i n g p r o b l e ma n d a p e n d u l u m s w i n g - u p a n d b a l a n c i n g ta sk .

    13.2 A G eneral Framew ork for DirectLearning Throu gh Assoc iat ionand ReinforcementC o n s i d e r a c la ss o f l e a r n in g d e c i s io n a n d c o n t r o l p r o b l e m s i nt e r m s o f o p t i m i z i n g a p e r f o r m a n c e m e a s u r e o v e r t i m e w i t h t h ef o l l o w i n g c o n s tr a i n ts . F i rs t, a m o d e l o f t h e e n v i r o n m e n t o r t h es y s t e m t h a t i n t e r a c t s w i t h t h e l e a r n e r i s n o t a v a i l a b l e a p r i o r i .T h e e n v i r o n m e n t / s y s t e m c a n b e s t o c h a s t i c , n o n l i n e a r , a n ds u b j e c t t o c h a n g e . S e c o n d , l e a r n i n g t a k e s p l a c e " o n - t h e - f l y "w h i l e i n t e r a c t i n g w i t h t h e e n v i r o n m e n t . T h i r d , e v e n t h o u g hm e a s u r e m e n t s f r o m t h e e n v i r o n m e n t a r e a v a i l a b l e f r o m o n ed e c i s i o n a n d c o n t r o l s t e p t o t h e n e x t , a f in a l o u t c o m e o f th el e a r n i n g p r o c e s s f r o m a g e n e r a t e d s e q u e n c e o f d e ci s io n s a n dc o n t r o l s c o m e s a s a d e l a y e d s i g n a l i n a n i n d i c a t i v e " w i n o rl o s e " f o r m a t . F i g u r e 1 3 .1 i s a s c h e m a t i c d i a g r a m o f a n o n l i n el e a r n i n g c o n t r o l s c h e m e . T h e b i n a r y r e i n f o r c e m e n t s i g n a l r ( t )i s p r o v i d e d f r o m t h e e x t e r n al e n v i r o n m e n t a n d is e i t h e r a 0 o ra - 1 c o r r e s p o n d i n g t o s u c c e s s o r f a i l u r e , r e s p e c t i v e l y .

    1151

  • 8/14/2019 Direct Learning by Reinforcement

    2/9

    1152 J e n n i e S i

    .

    " / t J(t-O-r(OX(t) Uc(t )

    Primaryreinforcement

    r ( t ) . . . . . . . . . .x ( t )

    t , o n ] LCr ' , t ,c -" ' [ :7 5 ' :Schematic Diagram for Implementat ions of Neural

    present sign al flow , and the dashed lines are the paths foreter tuning.

    I n o u r d i r e c t o n l i n e l e a r n i n g c o n t r o l d e s i g n , t h e c o n t r o l l e r is

    o f o p t i m a l i t y . T h i s s e t o f s y s t e m o p e r a t i o n s w i llr c e d t h r o u g h m e m o r y o r a s s oc i a ti o n b e tw e e n s t a te s

    o r k t o m a k e t h e e q u a t i o n o f t h e p r i n c i p le o fe b a l a n c e d.

    T o b e m o r e q u a n t i t a t i v e , c o n s i d e r t h e c r i t i c n e t w o r k a si n F i g u r e 1 3 . 1 . T h e o u t p u t o f th e c r i t ic e l e m e n t , t h e

    r e w a r d - t o - g o .Specifical ly , i t approximates R ( t ) a t t ime t g iven by :

    R ( t ) = r ( t + 1) + o t r ( t + 2) + . . . , (13 .1)R ( t ) i s t h e f u t u r e a c c u m u l a t i v e r e w a r d - t o - g o v a l u e a t

    b le m (0 < oL < 1 ). We hav e u sed oL = 0 .95 in ou r imp le-e n t a t i o n s ; r ( t + 1) i s the ex te rna l re in fo rcemen t va lue a time t 4- 1 .

    1 3 .2 .1 T h e C r i t i c N e t w o r kT h e c r i t i c n e t w o r k i s u sed to p rov ide ] ( t ) a s a n a p p r o x i m a t eo f R ( t ) i n e q u a t i o n 1 3 . 1 . T h e p r e d i c t i o n e r r o r f o r t h e c r i ti ce lemen t i s def ined as :

    e c ( t ) = c l J ( t ) - [ J ( t - 1) - r( t )] , (13 .2)a n d t h e o b j e c t i v e f u n c t i o n t o b e m i n i m i z e d i n t h e c r i t i c n e t -work i s as fo l lows :

    1 e ~ ( t ) . ( 13 . 3 )c(t) = ~T h e w e i g h t u p d a t e r u l e f o r t h e c r i t i c n e t w o r k i s a g r a d i e n t -

    b a s e d a d a p t a t i o n g i v e n b y :w c ( t + 1) = w e ( t ) + A w e ( t ) . (13 .4)

    S E c ( t ) ] (13 .5) W c ( t ) = l c( t) ~ - - ~ ( t ) ]S E e ( t ) (13 .6)[ S E c ( t ) S J ( t ) ][

    In equat ions 13 .4 to 13 .6 , l c ( t ) > 0 is the l ea rn ing ra te o f thec r i ti c n e t w o r k a t t i m e t , w h i c h u s u a l l y de c r e a se s w i t h t i m e t o asmal l va lue , and We i s the we igh t vec to r in the c r i t i c ne two rk .

    1 3 .2 .2 T h e A c t i o n N e t w o r kT h e p r i n c i p l e i n a d a p t i n g t h e a c t i o n n e t w o r k is to indirect lyb a c k - p r o p a g a t e t h e e r r o r b e t w e e n t h e d e s i r e d u l t i m a t e o b j e c -t i ve , d e n o t e d b y U o a n d t h e a p p r o x i m a t e J f u n c t io n f r o m t h ecr i t i c ne twork . S ince 0 i s def ined as the re in fo rcemen t s igna lfo r success , Uc i s se t to 0 in the d i scuss ion i s des ign p ara d igmand in the fo l lowing case s tud ies . In the ac t ion ne twork , thes t a te m e a s u r e m e n t s a r e u s e d a s i n p u t s t o c r e a t e a c o n t r o l a s t h eo u t p u t o f t h e n e t w o r k . I n t u r n , t h e a c t i o n n e t w o r k c a n b ei m p l e m e n t e d b y e i t h e r a l i n e a r o r a n o n l i n e a r n e t w o r k ,d e p e n d i n g o n t h e c o m p l e x i t y o f t h e p r o b l e m . T h e w e i g h tu p d a t i n g i n t h e a c t i o n n e t w o r k c a n b e f o r m u l a t e d a s f o l l o w s .Le t the fo l lowing be t rue :

    e a ( t ) = l ( t ) - U c ( t) . ( 13 .7 )T h e w e i g h t s i n t h e a c t i o n n e t w o r k a r e u p d a t e d t o m i n i m i z e

    t h e f o l l o w i n g p e r f o r m a n c e e r r o r m e a s u r e :E a ( t ) = ~ e 2 ( t ) . (13 .8)

    T h e u p d a t e a l g o r i t h m i s t h e n s i m i l a r t o t h e o n e i n t h e c r i t icne tw ork . By a g rad ie n t descen t ru le :

    W a ( t + 1) = W a ( t ) + A W a ( t ) . (13 .9)~ E a ( t ) ]aw~(t) = l a ( t ) Sw~(t )J " (13 .10 )

    S E a ( t ) S E a ( t ) S l ( t ) S u ( t )- - - - - - (13 .11)S w ~ ( t ) S ] ( t ) S u ( t ) S w ~ ( t ) "In equat ions 13 .9 to 13 .11, l a ( t ) > 0 i s the l ea rn ing ra te o f

    t h e a c t i o n n e t w o r k a t t i m e t , w h i c h u s u a l ly d e c re a s e s w i t h t i m eto a smal l va lue , and wa i s the weigh t vec to r in the ac t ionn e t w o r k .

  • 8/14/2019 Direct Learning by Reinforcement

    3/9

    1 3 D i r e c t L e a r n i n g b y R e i n f o r c e m e n t 11531 3 .2 . 3 O n l i n e L e a r n i n g A l g o r i t h m sT h i s c h a p t e r i s o n l i n e l e a r n i n g c o n f i g u r a t i o n i n t r o d u c e d i nt h e p r e v i o u s s u b s e c t i o n s in v o l ve s t w o m a j o r c o m p o n e n t s i n t h el e a r n i n g sys tem: the a c t i o n n e t w o r k a n d t h e c r i ti c n e t w o r k .T h e f o l l o w i n g de v is es l e a r n i n g a l g o r i t h m s a n d e l a b o r a te s h o wl e a r n i n g t a k e s p la c e i n e a c h o f th e t w o m o d u l e s . I n t h e d i s c us -s i o n i s N D P d e s i g n , b o t h t h e a c t i o n n e t w o r k a n d t h e c r i t i cn e t w o r k a r e n o n l i n e a r m u l t i l a y e r f e e d - f o r w a r d n e t w o r k s . I nt h e s e d e s i gn s , o n e h i d d e n l a y e r is u s e d i n e a c h n e t w o r k . T h en e u r a l n e t w o r k s t r u c t u r e f o r t h e n o n l i n e a r , m u l t i l a y e r c r i t i cn e t w o r k i s s h o w n i n F i g u r e 1 3 .2 . I n t h e c r i ti c n e t w o r k , t h eo u t p u t J ( t ) w i l l b e o f t h e f o r m :

    NhJ ( t ) = Z w ~ ) ( t )p i ( t )" (13 .12)i--11 - - exp q~(t)p i ( t ) - - 1 2_exp-qdt) ' i = 1 . . . . N h . (13 .13)n + l ( t ) ( t ) x j ( t) , i = 1 ,. , N h . (13 .14)i ( t ) = Z Wc , j ""j = l

    T h e q i i s t h e i t h h i d d e n n o d e i n p u t o f t h e c r i ti c n e t w o r k , a n dP i is th e c o r r e s p o n d i n g o u t p u t o f t h e h i d d e n n o d e . T h e N h ist h e t o t a l n u m b e r o f h i d d e n n o d e s i n t h e c r i ti c n e tw o r k , a n dn + 1 i s t h e t o t a l n u m b e r o f i n p u t s i n t o t h e c r i ti c n e t w o r ki n c l u d i n g t h e a n a l o g a c t i o n v a l u e u ( t ) f r o m t h e a c t i o n n e t -w o r k . By a p p l y i n g t h e c h a i n r u le , t h e a d a p t a t i o n o f t h e c r i ti cn e t w o r k i s s u m m a r i z e d b e l o w .

    (1 ) For Aw~2) ( h i d d e n t o o u t p u t l a y e r ) :A (2) t [ ~ E c ( t ) (13 .15)Wci ( ) = Ic( t) [ - - ^ ( 2) t '~w~ ()

    ~E , - ( t ) ~ E~ ( t ) ~ ] ( t )~ w ~ 2 )( t) - ~] ( t) c)w~,(2 ) ( ) t - - C ~ e c ( t ) p i ( t ) . (13 .16)

    (2) For Aw~ 1) ( inp u t to h idd en laye r ) :8 E c ( t )Aw ~, i ) ( t) = l c ( t) - (1) t 'a W e ,, ( ) .

    8 E c ( t ) 8 E c ( t ) 8 I ( t ) 8 p i ( t ) 8 q i ( t )(1 ) ~ ] ( t ) 8p i ( t ) (1)q i ( t ) ~ W c ~ ( t )W c ~ , ( t )= " (2 ) t [ ~ ( 1 - p ~ ( t ) ) ] x j ( t ) .ec ( t )wc~ ( )

    (13 .17)

    (13 .18)

    (13 .19)

    N o w , i n v e s t i g a t e t h e a d a p t a t i o n i n t h e a c t i o n n e t w o r k ,w h i c h i s i m p l e m e n t e d b y a f e e d - f o r w a r d n e t w o r k s i m i l ar t ot h e o n e i n F i g u r e 1 3 .2 , e x c e p t th a t t h e i n p u t s a r e t h e n m e a s -u r e d s t a t es a n d t h e o u t p u t i s t h e a c t i o n u ( t ) . T h e f o l lo w i n g a r et h e a s s o c i a t e d e q u a t i o n s f o r t h e a c t i o n n e t w o r k :

    1 - exp -"( t)u ( t ) - - (13 .20)1 + e xp u( t)NhZ ( 2 ) ( t ) g i ( t ) " (13 .21)( t ) = w~i = 11 - ex p -ha(t)- i = 1, Nh. (13.22)g i ( t ) 1 + ex p -h~(t) . . . . .

    h i ( t ) = ~ w ( 1 ) ( t ) x j ( t) , i = 1 . . . N h . (13 .23)afjj = lT h e v i s t h e i n p u t t o t h e a c t i o n n o d e , a n d g i a n d h i a re t h e

    o u t p u t a n d t h e i n p u t o f t h e h i d d e n n o d e s o f t h e a c t io n n e t -w o r k , r e s p e c t i v e l y . S i n c e t h e a c t i o n n e t w o r k i n p u t s t h e s t a t em e a s u r e m e n t s o n l y , t h e r e i s n o ( n + 1 ) t h t e r m i n e q u a t i o n13 .23 a s in the c r i t ic ne twork ( see equa t ion 13 .14 fo r compar i -s o n ) . T h e u p d a t e r u l e f o r t h e n o n l i n e a r m u l t i l a y e r a c t i o nn e t w o r k a l so c o n t a i n s t w o s e ts o f e q u a ti o n s :

    (1 ) For Aw (2) (h id den to ou tpu t laye r ) :

    d r

    IGURE 1 3. 2 Sch em atic Diagram for the Implementa tion of aonlinear Critic Network Using a Feed-Forward Network with oneHidden Layer

    A (23 [ 8 G ( t ) ] ,w o , ( t ) : o (t )

    G ( t ) S E a ( t ) 8 ] ( t ) 8 u ( t ) 8 v ( t )(2)w~ ( t) ~ l ( t ) ~ u ( t ) ~ v ( t ) ~Wa ( 2 ) ( t ) '= e a ( t ) (1 - - u2( t ) g i ( t )

    i= 1

    [w~e)(t) ~ ( 1 - P~( t ) )w~ , ! l+ , ( t ) ]

    (13 .24)

    (13 .25)

    (13 .26)

    In equa t ions 13 .24 to 13 .26 , 8 l ( t ) / S u ( t ) i s o b t a i n e d b yc h a n g i n g v a r i a b l e s a n d b y a c h a i n r u l e . T h e r e s u l t i s t h es u m m a t i o n t e r m , a n d " (1) i s t h e w e i g h t a s s o c i a te d w i t hWCi~1l+1t h e i n p u t e l e m e n t f r o m t h e a c t io n n e t w o r k .

  • 8/14/2019 Direct Learning by Reinforcement

    4/9

    1154 J e n n i e S i(2 ) For Aw(~ ) ( inp u t to h id den laye r ) :

    5 E . ( t )l a ( t ) - - S W ( a ~ ) ( t ,S E a ( t ) S E a ( t ) S J ( t ) S u ( t ) S v ( t ) S g i ( t ) S h i ( t )

    (1) tw ( ~ ) (t ) S ] ( t ) S u ( t ) S v ( t ) S g i ( t ) S h i ( t ) S w ~ j ( )

    (13 .27)

    (13 .28)

    = e a ( t ) f ~ ( 1 - u 2 ( t ) ) ] w ~ y ) ( t ) [ ~ ( 1 - g 2 ( t ) ) ] x j ( t )

    Z c,. . . .i= 1

    I n i m p l e m e n t a t i o n , e q u a t i o n s 1 3 .1 6 a n d 1 3 .1 9 a r e u s e d t o1 3 .2 9 a re u s e d t o u p d a t e t h e w e i g h t s i n t h e a c t i o n n e t w o r k .

    1 3 .3 A n a l y t i c a l C h a r a c t e r i s t ic s o f a nO n l i n e N D P L e a r n i n g P r o c e s sh i s s e c t i o n i s d e d i c a t e d t o e x p o s i t i o n s o f a n a ly t i c a l p r o p e r t i e s

    o f t h e o n l i n e l e a r n i n g a l g o r i t h m s i n t h e c o n t e x t o f n e u r a ly n a m i c p r o g r a m m i n g ( N D P ) . I t is im p o r t a n t t o n o t e t h at

    a v a i la b l e t r a i n i n g s e t s o f i n p u t / o u t p u t p a i r s t o b e u s e do r a p p r o x i m a t i n g J * i n t h e s e n s e o f le a s t - sq u a r e s -f i t in N D Pp p l i c a t io n s . Bo t h t h e c o n t r o l a c t i o n u a n d t h e a p p r o x i m a t e d J

    f u n c t i o n a r e u p d a t e d a c c o r d i n g t o a n e r r o r f u n c t i o n t h a tc h a n g e s f r o m o n e t i m e - s t e p t o t h e n e x t . T h e r e f o r e , t h e c o n v e r -g e n c e a r g u m e n t f o r t h e s t e e p e s t d e s c e n t a l g o r i t h m d o e s n o t

    o l d v a l i d f o r a n y o f t h e t w o n e t w o r k s , a c t i o n o r c r i ti c . T h i se s u l t s i n a s i m u l a t i o n a p p r o a c h t o e v a l u a t e t h e c o s t - t o - g ou n c t i o n J f o r a g i v e n c o n t r o l a c t i o n u . T h e o n l i n e l e a r n i n g

    a c e , a i m i n g a t i t e r a t iv e l y i m p r o v i n g t h e c o n t r o l p o l i c ie sp u t a t i o n a l d i f fi c u l ti e s t h a t d o n o t a r i s e in a m o r e t y p i c a l

    n e t w o r k t r a i n i n g c o n t e x t . T h e c l o s es t a n a ly t i c a l r e s u lt sn t e r m s o f a p p r o x i m a t i n g J f u n c t i o n w a s o b t a i n e d b y T si ts ik li s( 1 9 9 7) w h e r e a l i n e a r in p a r a m e t e r f u n c t i o n a p p r o x i m a t o r w a ss e d t o a p p r o x i m a t e t h e J f u n c t i o n . T h e l im i t o f c o n v e rg e n c e

    l u t i o n t o a s e t o f i n t e r p r e t a b l e l i n e a re q u a t i o n s , a n d a b o u n d w a s p l a c e d o n t h e r e s u l t i n g a p p r o x i -m a t i o n e r r o r .

    I t i s w o r t h p o i n t i n g o u t t h a t t h e e x i s t i n g i m p l e m e n t a t i o n sf N D P a r e u s u a l l y c o m p u t a t i o n a l l y v e r y i n te n s i v e ( Be r ts e k a snd Ts i ts ik li s, 1996) and o f ten r equ i re a cons ide rab le am ou ntf tr ia l a n d e r ro r . M o s t o f th e c o m p u t a t i o n s a n d e x p e r i-

    m e n t a t i o n s w i t h d i f f e r e n t a p p r o a c h e s w e r e c o n d u c t e d o f f -in e . T h e f o l l o w i n g p a r a g r a p h s p r o v i d e s o m e a n a l y t ic a l in s i g h t

    o n t h e o n l i n e l e a r n i n g p r o c e s s f o r p r o p o s e d N D P d e s i g n sin th is chap te r . Spec i f ica l ly , the s tochas t ic approx ima t iona r g u m e n t i s u s e d t o r e v e a l t h e a s y m p t o t i c p e r f o r m a n c e o f

    t h i s d i s c u s s i o n ' s o n l i n e N D P l e a r n i n g a l g o r i t h m s i n a n a v e r -a g e d s e n s e f o r t h e a c t i o n a n d t h e c r i t i c n e t w o r k s u n d e r c e r t a i nc o n d i t i o n s .

    1 3 . 3 . 1 S t o c h a s t i c A p p r o x i m a t i o n A l g o r i t h m sT h e o r i g i n a l w o r k i n r e c u r s i v e s t o c h a s t i c a p p r o x i m a t i o na l g o r i t h m s w a s i n t ro d u c e d b y R o b b i n s a n d M o n r o ( 19 5 1) ,w h o d e v e l o p e d a n d a n a l y z e d a r e c u r s iv e p r o c e d u r e f o r fi n d i n gt h e r o o t o f a r e a l- v a l u e d f u n c t i o n g ( w ) of a r ea l va r iab le w . Th ef u n c t i o n i s n o t k n o w n , b u t n o i s e - c o r r u p t e d o b s e r v a ti o n s c o u l db e t a k e n a t v a lu e s o f w s e l e c t ed b y t h e e x p e r i m e n t e r .

    A f u n c t i o n g ( w ) w i t h t h e f o r m g ( w ) = E x [ f (w ) ] (Ex[] is thee x p e c t a t i o n o p e r a t o r ) i s c a l l e d a r e g r e s s i o n f u n c t i o n o f f ( w )and , conve r se ly , f ( w ) i s c a l le d a s a m p l e f u n c t i o n o f g ( w ) . T h ef o ll o w i ng c o n d i ti o n s a r e n e e d e d t o o b t a i n t h e R o b b i n s - M o n r oa l g o r i t h m ( Ro b b i n s a n d M o n r o , 1 9 5 1 ) :

    C I : T h e g ( w ) has a s ing le roo t w * , g ( w * ) = 0, and:g ( w ) < 0 i f w < w*.g ( w ) > 0 i f w > w*.

    T h i s f i r st c o n d i t i o n i s a s s u m e d w i t h l i tt l e lo s s o f g e n e r a l -i t y s in c e m o s t f u n c t i o n s o f a s i n gl e r o o t n o t s a t is f y in g t h isc o n d i t i o n c a n b e m a d e t o d o s o b y m u l t i p l y i n g t h ef u n c t i o n b y - 1.

    C 2 : T h e v a r ia n c e o f f ( w ) f r o m g ( w ) is f ini te :( r e ( w ) = E x [ g ( w ) - f ( w ) ] 2 < o o . ( 1 3. 30 )

    C3:Ig(w)[ < B l l W - - w*[ + B0 < oo. (13.31)

    T h i s t h i r d c o n d i t i o n i s a v e r y m i l d c o n d i t i o n . T h e v a l u e so f B1 a n d B0 n e e d n o t b e k n o w n t o p r o v e t h e v a l i d i t y o fthe a lgor i thm. As long a s the roo t l ie s in some f in i tein te rva l , the ex is tence o f B1 and B0 can a lways be a s -s u m e d .

    I f t h e c o n d i t i o n s C1 t h r o u g h C3 a r e s a t is f ie d , th e a l g o r i t h mf r o m R o b b i n s a n d M o n r o ( 1 9 5 1) c a n b e u s e d t o i t e r a t i v e l y s e e kt h e r o o t w * o f th e f u n c t i o n g ( w ) :

    w ( t + 1 ) = w ( t ) - l ( t ) f [ w ( t ) ] , (13 .32)w h e r e l ( t ) i s a s e q u e n c e o f p o s i t iv e n u m b e r s t h a t s a t is f y t h ef o l lo w i n g c o n d i t i o n s :

    1) lim l ( t ) = 0t--~OG

    2) ~ l ( t ) = c x ~t 0

    3) 12( t) < oc .t= O

    (13.33)

  • 8/14/2019 Direct Learning by Reinforcement

    5/9

    13 Direc t Learn ing by Re in forceme nt 1155F u r t h e r m o r e , w ( t ) w i ll c o n v e r g e to w a r d w * in t h e m e a n

    q u a r e e r r o r s e n s e a n d w i t h p r o b a b i l i t y o f v a l u e 1 :l im E x [ ] l w ( t ) - W * H 2 ] = 0.

    t - - + O C

    Prob(,t~lim w( t ) = w* }T h e c o n v e r g e n c e w i t h p r o b a b i l i t y 1 in

    a l led convergence almost t r u l y . I n t h i s c h a p t e r , th e R o b b i n s -o n r o a l g o r it h m is a p p li e d t o o p t i m i z a t i o n p r o b l e m s ( K u s he r

    nd Yin, 1997) . In that se t t ing, g ( w ) = ~ E / ~ W , where E i s anb j e ct i ve f u n c t io n t o b e o p t i m i z e d . I f E h a s a l o ca l o p t i m u m a tw * , g ( w ) wil l sa t is fy the con di t io n C1 loca l ly a t w* . I f E has au a d r a t i c f o r m , g ( w ) will sa t isfy the condit ion C1 globally.

    (13 .34)= 1. (13.35)equ a t io n 13 .35 is a lso

    s q u a r e b e t w e e n t h e t w o . T o o b t a i n a ( l o c a l ) m i n i m u m f o r t h ea v e r a g e d e r r o r m e a s u r e i n e q u a t i o n 1 3 .3 8 , t h e Ro b b i n s - M o n r oa l g o r i t h m c a n b e a p p l i e d b y f i r s t t a k i n g a d e r i v a t iv e o f t h ise r r o r w i t h r e s p e c t t o t h e p a r a m e t e r s , w h i c h a r e t h e w e i g h ts i nthe ac t ion ne twork in th is ca se . Le t :

    ~a = l - Uc. (13.3 7)S i n c e J i s s m o o t h i n # ~ , a n d # a b e l o n g s t o a b o u n d e d s e t,

    t h e d e r i v a t iv e o f E a w i t h r e s p e c t t o t h e w e i g h t s o f th e a c t i o nn e t w o r k is t h e n o f t h e f o r m :

    E x [ e o j . (13 .38)1 3 .3 . 2 C o n v e r g e n c e i n S t a t i st i c a l A v e r a g e f o rA c t i o n a n d C r i t ic N e t w o r k s

    d y n a m i c p r o g r a m m i n g i s s ti ll i n i t s e a rl y s t ag e o fd e v e l o p m e n t . T h e p r o b l e m i s n o t t r i v i a l d u e t o s e v e r a l c o n -s e c u t i v e l e a r n i n g s e g m e n t s b e i n g u p d a t e d s i m u l t a n e o u s l y .A p r a c t i c a ll y e ff e c ti v e o n - l i n e l e a r n i n g m e c h a n i s m a n d a s t e p -b y - s t e p a n a l y t ic a l g u id e f o r t h e l e a r n i n g p r o c e s s d o n o t c o e x i s ta t t h is t i m e . T h i s c h a p t e r i s d e d i c a t e d t o r e l i a b le i m p l e m e n t a -t i o n s o f N D P a l g o r i t h m s f o r s o l v i n g a g e n e r a l cl as s o f o n -l i n e l e a r n i n g c o n t r o l p r o b l e m s . A s d e m o n s t r a t e d i n p r e v i o u ss e c t io n s , e x p e r i m e n t a l r e s u lt s i n t h i s d i r e c t i o n a r e v e r y e n c o u r -a g i n g . T h i s s e c t i o n p r o v i d e s s o m e a s y m p t o t i c c o n v e r g e n c er e su l ts f o r e a c h c o m p o n e n t o f t h e N D P s y st e m . T h e R o b b i n s -

    o n r o a l g o r i t h m p r o v i d e d i n t h e p r e v i o u s s e c t i o n i s t h ea i n t o o l t o o b t a i n r e s u l t s . T h r o u g h o u t t h i s c h a p t e r , i t i si m p l i e d t h a t t h e s t a t e m e a s u r e m e n t s a r e s a m p l e s o f a c o n t i n u -

    us s ta te - space . Spec i f ica l ly , the d iscuss ion a ssumes wi thou ts s o f g e n e r a l i t y t h a t t h e i n p u t X ) C X C 7-4~h a s d i s c r e t e p r o b -

    t y d e n s i t y p ( X ) = ~ ? J l p j g ( X - X j ) , wh ere 8 ( ) i s the de l taT h e f o ll o w in g p a ra g r a p h s a n a l y z e o n e c o m p o n e n t o f t h e

    T o e x a m i n e t h e l e a r n i n g p r o c e s s t a k i n g p l a c e i n t h e a c t i o nr k i s d e f i n e d a s:

    1E a = ~ Z P i [ l ( X i ) - O c ] 2i

    1 E x [ ( l U c ) 2 ] .- -2(13.36)

    I t c a n b e s e e n t h a t e q u a t i o n 1 3 .3 6 i s a n " a v e r a g e d " e r r o rt h e e s t i m a t e d J a n d a f i n al d e s i re d v a l u e U c. T o

    A c c o r d i n g t o t h e R o b b i n s - M o n r o a l g o r i t h m , t h e r o o t ( c a nbe a loca l roo t ) o f ~Ea /~fva a s a func t io n o f fva can be ob ta inedb y t h e f o l l o w i n g r e c u r s iv e p r o c e d u r e i f h e r o o t e x i st s a n d i f h es tep s ize l~( t ) mee ts a l l the r equ i rements desc r ibed in equa t ion13.35:

    V Va(t q - 1 ) = # a ( t ) - - l a ( t ) [ e a ~ ] . (13 .39)E q u a t i o n 1 3 .3 7 m a y b e c o n s i d e r e d as a n i n s t a n t a n e o u s e r r o r

    b e t w e e n a s a m p l e o f t h e J f u n c t i o n a n d t h e d e s i r e d v a l u e Uc.T h e r e f o r e , e q u a t i o n 1 3 .3 9 is e q u i v a l e n t t o t h e u p d a t e e q u a t i o nf o r t h e a c t i o n n e t w o r k g i v e n i n e q u a t i o n s 1 3 .9 t o 1 3 .1 1 . F r o mt h i s v i e w p o i n t , t h e o n l i n e a c t i o n n e t w o r k u p d a t i n g r u l e o fthese equ a t ion s 13 .9 to 13 .11 i s ac tua l ly conve rg ing to a( lo c a l) m i n i m u m o f t he e r r o r sq u a r e b e tw e e n t h e J f u n c t i o nand the des i r ed va lue Uc in a s ta t i s t ica l ave rage sense . Or ino t h e r w o r d s , e v e n i f t h e s e e q u a t i o n s r e p r e s e n t a r e d u c t i o n i ni n s t a n t a n e o u s e r r o r s q u a r e a t e a c h i t e r a t i v e t i m e - s t e p , t h ea c t i o n n e t w o r k u p d a t i n g r u l e a s y m p t o t i c a l l y r e a c h e s a ( lo c a l )m i n i m u m o f t h e s t a ti s ti c a l a v e ra g e o f ( J - U c) 2 .By t h e s a m e t o k e n , a s i m i l a r f r a m e w o r k c a n b e c o n s t r u c t e dt o d e s c r ib e t h e c o n v e r g e n c e o f t h e c r i ti c n e t w o r k . Re ca ll t h a tt h e r e s i d u a l o f t h e p r i n c i p l e o f o p t i m a l i t y e q u a t i o n t o b eb a l a n c e d b y t h e c r i ti c n e t w o r k i s o f t h e f o l l o w i n g f o r m :

    ec(t ) = c~I( t ) - l ( t - 1) + r ( t ) . (13 .40)T h e i n s t a n t a n e o u s e r r o r s q u a r e o f th i s r e s i d u a l is g i v e n as :

    E c ( t ) = ~ e ~ ( t ) .I n s t e a d o f t h e i n s t a n t a n e o u s e r r o r s q u a r e , l et :

    (13 .41)

    Ec = Ex[E~] , (13.42)a n d a s s u m e t h a t t h e e x p e c t a t i o n i s w e l l - d e f i n e d o v e r t h e d i s -c r e t e s t a te m e a s u r e m e n t s . T h e d e r i v a t iv e o f E c w i t h r e s p e c t tot h e w e i g h t s o f t h e c r i t ic n e t w o r k i s t h e n o f th e f o r m :

  • 8/14/2019 Direct Learning by Reinforcement

    6/9

    1 1 5 6 J e n n i e S ia E c [ 8ec ly 0 . s g n ( x ) = 0 , i f x = 0 .

    - 1 , i f x < 0 .T h e n o n l i n e a r d i f f e r e n t i a l e q u a t i o n s o f 1 3 . 4 5 a n d 1 3 . 4 8 a r e

    n u m e r i c a l l y so l v e d b y a f o u r t h - o r d e r R u n g e - K u t t a m e t h o d . T h i sm o d e l p r o v i d e s f o u r s t a t e v a r i a b l e s : ( 1 ) x ( t ) , p o s i t i o n o f th e c a r to n t h e t r a c k ; ( 2 ) 0 ( t ) , a n g l e o f t h e p o l e w i t h r e s p e c t t o t h e v e r t i c a lp o s i t i o n ; ( 3 ) k ( 0 , c a r t v e l o c i ty ; a n d ( 4 ) 0 ( t ) , a n g u l a r v e l oc i ty .

    I n t h e c u r r e n t s t u d y , a r u n c o n s is t s o f a m a x i m u m o f 1 ,0 00c o n s e c u t i v e t r ia l s . I t i s c o n s i d e r e d s u c c e s s f u l i f t h e l a s t t ri a l( t r ia l n u m b e r l e s s t h a n 1 ,0 0 0 ) o f t h e r u n h a s l a s t e d 6 0 0 ,0 0 0t i m e - s t e p s . O t h e r w i s e , i f t h e c o n t r o l l e r i s u n a b l e t o l e a r n t ob a l a n c e t h e c a r t - p o l e w i t h i n 1 ,0 0 0 t r i a ls ( i .e ., n o n e o f t h e 1 ,0 0 0t r i al s h a s l a s t e d o v e r 6 00 , 0 00 t i m e - s t e p s ) , t h e n t h e r u n i sc o n s i d e r e d u n s u c c e s s f u l . T h i s c h a p t e r ' s s i m u l a t i o n s h a v e u s e d0 . 02 s e c f o r e a c h t i m e - s t e p , a n d a t ri a l i s a c o m p l e t e p r o c e s sf r o m s t a r t t o f a l l . A p o l e i s c o n s i d e r e d f a l l e n w h e n t h e p o l e i so u t s i d e t h e r a n g e o f [ - 1 2 , 1 2 ] a n d / o r t h e c a r t i s b e y o n d t h er a n g e o f [ - 2 . 4 , 2 . 4 ] m e t e r s i n r e f e r e n c e t o t h e c e n t r a l p o s i t i o no n t h e t r a c k . N o t e t h a t a l t h o u g h t h e f o r c e F a p p l i e d t o t h e c a r ti s b i n a r y , t h e c o n t r o l u ( t ) f e d i n t o t h e c r i t i c n e t w o r k a s s h o w ni n F i g u r e 1 3 . 1 i s c o n t i n u o u s .1 3 .4 .2 S i m u l a t i o n R e s u l tsS e v e r a l e x p e r i m e n t s w e r e c o n d u c t e d t o e v a l u a t e t h e e f f e c t i v e -n e s s o f t h i s c h a p t e r ' s l e a r n i n g c o n t r o l d e s i g n s . T h e p a r a m e t e r su s e d i n t h e s i m u l a t i o n s a r e s u m m a r i z e d i n T a b l e 1 3 . 1 w i t h t h ep r o p e r n o t a t i o n s d e f i n e d i n t h e f o l l o w i n g :

    /c(0): I n i t i a l l e a r n i n g r a t e o f t h e c r it i c n e t w o r k / a (0 ) : I n i t i a l l e a r n i n g r a t e o f t h e a c t i o n n e t w o r k / c ( t) : L e a r n i n g r a t e o f t h e c r i ti c n e t w o r k a t t i m e t , w h i c hi s d e c r e a s e d b y 0 . 0 5 e v e r y 5 t i m e - s t e p s u n t i l i t re a c h e s

    0 . 0 0 5 a n d s t a y s a t l c ( f ) = 0 . 0 0 5 t h e r e a f t e r la ( t ) : L e a r n i n g r a t e o f t h e a c t i o n n e t w o r k a t t i m e t , w h i c h

    i s d e c r e a s e d b y 0 .0 5 e v e r y 5 t i m e - s t e p s u n t i l i t r e a c h e s0 . 0 0 5 a n d s t a y s a t l a ( f ) = 0 . 0 0 5 t h e r e a f t e r

    Nc : I n t e r n a l c y c l e o f th e c r i t i c n e t w o r k N ~ : I n t e r n a l c y c le o f th e a c t i o n n e t w o r k T o: I n t e r n a l t r a i n i n g e r r o r t h r e s h o l d f o r t h e cr i t ic n e t w o r k T a: I n t e r n a l t r a i n i n g e r r o r t h r e s h o l d f o r t h e a c t i o n n e t -

    w o r k Nh : N u m b e r o f h i d d e n n o d e s

  • 8/14/2019 Direct Learning by Reinforcement

    7/9

    1 3 D i r e c t L e a r n i n g b y R e i n f o r c e m e n t 1157ABLE 13.1 Sum mary of Parameters Used in Obtaining the Resul tsiven in Table 13.2arameter It(O) lo(o) l,.(f) la(f) *alue 0.3 0.3 0.005 0.005 *Parameter N,- Na Tc T~ Nhalue 50 100 0.05 0.005 6

    N o t e t h a t t h e w e i g h t s in t h e a c t i o n a n d t h e c r i t ic n e t w o r k si n e d u s i n g t h e i r i n t e r n a l c y c le s , N a a n d N o , r e s p e c -

    t iv e l y . T h a t i s, w i t h i n e a c h t i m e - s t e p , t h e w e i g h t s o f t h e t w on e t w o r k s w e r e u p d a t e d f o r a t m o s t N a a n d N c t i m e s , r e s p e c -iv e ly , o r s t o p p e d o n c e t h e i n t e r n a l t r a i n i n g e r r o r t h r e s h o l d T a

    a n d T~ h a s b e e n m e t .T o b e m o r e r e a li s ti c , b o t h a s e n s o r n o i s e a n d a n a c t u a t o r

    n o i s e h a v e b e e n a d d e d t o t h e s ta t e m e a s u r e m e n t s a n d t h ea c t i o n n e t w o r k o u t p u t . S p e c if ic a ll y , t h e a c t u a t o r n o i s e h a s

    e e n i m p l e m e n t e d t h r o u g h u ( t ) = u ( t ) + p , w h e r e 9 i s a u n i -f o r m l y d i s t r i b u t e d r a n d o m v a r i ab l e . F o r t h e s e n s o r n o i s e , b o t hu n i f o r m a n d G a u s s ia n r a n d o m v a r ia b le s w e r e a d d e d t o t h ea n g l e m e a s u r e m e n t s 0 . T h e u n i f o r m s t a te s e n s o r n o i s e w a si m p l e m e n t e d t h r o u g h 0 = (1 + n o i s e p e r c e n t a g e ) x 0. G a u s -s i a n s e n s o r n o i s e w a s z e r o m e a n w i t h s p e c i f i e d v a r i a n c e .

    T h e p r o p o s e d c o n f i g u r a ti o n o f n e u r a l d y n a m i c p r o g r a m -m i n g h a s b e e n e v a l u a t e d , a n d t h e r e s u l t s a r e s u m m a r i z e d i nT a b l e 1 3 . 2 . T h e s i m u l a t i o n r e s u l t s s u m m a r i z e d i n T a b l e 1 3 . 2w e r e o b t a i n e d t h r o u g h a v e r a g e d r u n s . S p e c i f i c a ll y , 1 00 r u n sw e r e p e r f o r m e d t o o b t a i n t h e r e s u l t s r e p o r t e d h e r e . E a c hr u n w a s i n i t ia l i z e d t o r a n d o m c o n d i t i o n s i n t e r m s o f n e t w o r kw e i g h t s . I f a r u n i s s u c c e s sf u l , t h e n u m b e r o f t ri a l s i t t o o k t o

    a l a n c e t h e c a r t - p o l e i s t h e n r e c o r d e d . T h e n u m b e r o f t ri a lsi s te d i n t h e t a b le c o r r e s p o n d s t o t h e o n e a v e r a g e d o v e r a llo f t h e s u c c e s s fu l r u n s . T h e r e f o r e , t h e r e i s a n e e d t o r e c o r d t h ep e r c e n t a g e o f s u c c e s s fu l r u n s o u t o f 10 0. T h i s n u m b e r i sa l s o r e c o r d e d i n t h e t a b l e . A g o o d c o n f i g u r a t i o n i s t h e o n e

    a h i g h p e r c e n t a g e o f su c c e s s f u l r u n s a s w e l l a s a l o wa v e r ag e n u m b e r o f tr i al s n e e d e d t o l e a r n to p e r f o r m t h e b a l a n -c i n g t a s k .

    type Success rate Num ber of trials

    0- 2

    100% 6rm 5% actuator 100% 8m 10% actuator 100% 14

    rm 5% sensor 100% 32for m 10% sensor 100% 54

    sian (y2 = 0.1 sensor 100% 164sian if2 = 0.2 sensor 100% 193

    10

    8 -

    6

    4

    2O)8 o

    - 2-4 -

    - 6- 8 0 500 1000 1500

    T i m e S t e p sF IGUR E 13 .3i ng T r ia l fo r t he N DP Con t ro l l e r when t he S ys t em i s F ree o f Noi se

    A Typical Ang le Trajectory Dur ing a Successful Learn-

    F i g u r e 1 3 .3 s h o w s a t y p i c a l m o v e m e n t o r t r a j e c t o r y o f t h ep e n d u l u m a n g l e u n d e r a n N D P c o n t r o l l e r f o r a s u c c es s fu ll e a r n i n g t ri a l. T h e s y s t e m u n d e r c o n s i d e r a t i o n i s n o t s u b j e c tt o a n y n o i s e . F i g u r e 1 3 .4 r e p r e s e n t s a s u m m a r y o f ty p i c a ls t a t is t i c s o f t h e l e a r n i n g p r o c e s s i n h i s t o g r a m s . I t c o n t a i n sv e r t ic a l a n g le h i s t o g r a m s w h e n t h e s y s t e m l e a r n s to b a l a n c et h e c a r t - p o l e u s i n g id e a l s t at e m e a s u r e m e n t s w i t h o u t n o i s ec o r r u p t i o n .

    x 105 H is t non l inear NDP6

    - 1 .5 - 1 - 0 .5 0 0 .5 1 1 .5 2D e g r e e s

    13 .2 Performance Evaluation of NDP Learning Co ntrol ler when

    second column represents the percentage of successful runs out of100. The third colum n depicts the average num ber of trials it too k to learn tohe cart-pole. The averag e s taken ove r the successfulruns.

    F I G U R E 1 3 .4 H i s t o g r a m o f A n g l e V a r i a ti o n s U n d e r t h e C o n t r o lo f N D P O n - L i n e a r L e a r n i n g M e c h a n i s m i n t h e S i n g le C a r t - P o l eP rob l em. T he sys t em i s f ree o f no i se i n t h i s case .

  • 8/14/2019 Direct Learning by Reinforcement

    8/9

    1158 J e n n i e S i13.5 Exam ple 2

    t i on e x a m i n e s t h e p e r f o r m a n c e o f th e p r o p o s e d N D Pg n in a p e n d u l u m s w i n g - u p a n d b a l a n c i n g t as k . T h e ca see r s t u d y is id e n t i c a l to t h e o n e i n S a n t a m a r i a e t a l . ( 1 9 9 6 ) .

    T h e p e n d u l u m i s h e l d b y o n e e n d a n d c a n s w i n g in a v e r ti c a le . T h e p e n d u l u m is a c t u a t e d b y a m o t o r t h a t a p p l i e d a

    i c s o f t h e p e n d u l u mr e as f o l lows :

    TABLE 13 .3 Performance Evaluation of ND P Learning Controller toSwing Up and T hen Balance a PendulumReinforcementimplementation Success rate Num ber of trialsSetting 1 100% 4.2Setting 2 96% 3.5

    30 0 A ng l e (D eg ree )

    d m 3- - ( F + m l g s in ( 1 3 )) . ( 13 . 47)d t 4 m l 2dO- - = m . ( 1 3 . 4 8 )d t

    I n e q u a t i o n s 1 3 . 4 7 a n d 1 3 .4 8 , m = 1 / 3 a n d l = 3 / 2 a r e t h ea s s a n d l e n g t h o f t h e p e n d u l u m b a r , re s pe c ti v el y , a n d g = 9 . 8

    s t h e g r a v i t y . T h e a c t i o n i s t h e a n g u l a r a c c e l e r a t i o n F , a n d i t isn d e d b e tw e e n - 3 a n d 3 , n a m e l y Fm in = - 3 a n d Fma x = 3.n i s a p p l i e d e v e r y f o u r t i m e - s t e p s . T h e s y s t e m

    t a te s a r e t h e c u r r e n t a n g l e 0 a n d t h e a n g u l a r v e l o c i t y m . T h i sq u i re s t h e c o n t r o l l e r t o n o t o n l y sw i n g u p t h e b a r b u t

    l s o t o b a l a n c e i t a t t h e t o p p o s i t i o n . T h e p e n d u l u m i n i t i a l l yi ts s t il l a t t3 = w. This t a sk i s cons ide r ed d i f f i cu l t in the s ensep t i m a l s o l u t i o n , a n d c o m p l e x n u m e r i c a l m e t h o d s a r e r e -u i r e d t o c o m p u t e i t , a n d ( 2 ) t h e m a x i m u m a n d m i n i m u m

    a n g u l a r a c c e l e ra t i o n v al u e s a re n o t s t r o n g e n o u g h t o m o v e t h ee n d u l u m s t r a i g h t u p f r o m t h e s t a r t i n g s t a t e w i t h o u t f i r s t

    c r ea t in g an g u l a r m o m e n t u m ( S a n ta m a r i a e t a l . , 1996) .I n t h is s t u d y , a r u n c o n s i st s o f a m a x i m u m o f 1 00 c o n s e c u -ive t ri a l s . I t i s co ns id e r ed succes s f u l i f the l a s t t r i a l ( t r i a lu m b e r l e ss t h a n 1 0 0 ) o f t h e r u n h a s la s t e d 8 0 0 t i m e - s t e p s

    ( w i t h a s te p s i z e o f 0 . 0 5 s e c ). O t h e r w i s e , i f t h e N D P c o n t r o l l e rs u n a b l e t o s w i n g u p a n d k e e p t h e p e n d u l u m b a l a n c e d a t t h eo p w i t h i n 1 0 0 t r ia l s ( i .e ., n o n e o f t h e 1 0 0 t ra i l s h a s la s t e d o v e r

    8 0 0 t i m e - s t e p s ) , t h e n t h e r u n i s c o n s i d e r e d u n s u c c e s s f u l . I nt h i s d i s c u s s i o n ' s s i m u l a t i o n s , a t r i a l i s e i t h e r t e r m i n a t e d a t t h ee n d o f th e 8 0 0 t i m e s - s t e p s o r w h e n t h e a n g u l a r v e l o c i t y o f th e

    e nd ul um i s g r ea te r than 2 w (i .e ., to > 2nv).T h e f o l l o w i n g p a r a g r a p h s e x p l a in t w o i m p l e m e n t a t i o n s c e-a r i o s w i t h d i f f e r e n t s e t t i n g s i n r e i n f o r c e m e n t s i g n a l r . I n

    s e t t i n g 1 , r = 0 w h e n t h e a n g l e d i s p l a c e m e n t i s w i t h i n 9 0 f r o m t h e p o s i t i o n o f 13 = 0 ; r = - 0 . 4 w h e n t h e a n g l e i s i nh e l o w e r h a l f o f t h e p la n e : a n d r = - 1 w h e n t h e a n g u l a r

    oc i t y 00 > 2~ r . I n s e t t ing 2 , r = 0 wh en the a ng le d i sp lace -e n t is w i t h i n 1 0 f r o m t h e p o s i t i o n o f t3 = 0 ; r = - 0 . 4 w h e n

    h e a n g l e i s i n t h e r e m a i n i n g a r e a o f t h e p l a n e ; a n d r = - 1e n t h e a n g u l a r v e l o c i t y ~0 > 2 ~r .T h i s c h a p t e r i s p r o p o s e d N D P c o n f i g u r a t i o n i s t h e n u s e d t o

    e r f o r m t h e d e s c r i b e d t a s k . T h e s a m e c o n f i g u r a t i o n a n d t h ea m e l e a r n i n g p a r a m e t e r s a s t h o s e i n t h e f i r s t c a s e s t u d y a r es e d . N D P c o n t r o l l e r p e r f o r m a n c e i s s u m m a r i z e d i n T a b l e

    250

    200

    15 0

    10 0

    50

    0

    -% 1 6 0 2 6 0

    \360 460 560 660 760 860 90 0

    T i me-s teps(A) Entire TrialA ng l e (D eg ree )3

    400 450 500 550 600 650 700 750 800T i me-s teps(B) Port ion of Enti re Trajectory

    FIG UR E 13.5 A Typical Angle Trajectory D urin g a Successful Learn-ing Tr ia l for the NDP Control ler in the Pendulum Swing Up andBalancing Task1 3 . 3 . T h e s i m u l a t i o n r e s u l t s s u m m a r i z e d i n t h e t a b l e w e r eo b t a i n e d t h r o u g h a v e r a g e d r u n s . S p e c i f i c a l l y , 6 0 r u n s w e r ep e r f o r m e d t o o b t a i n t h e r e s u l t s r e p o r t e d h e r e . N o t e t h a tm o r e r u n s h a v e b e e n u s e d t h a n t h o s e i n S a n t a m a r i a e t a l .

  • 8/14/2019 Direct Learning by Reinforcement

    9/9

    1 3 D i r e c t L e a r n i n g b y R e i n f o r c e m e n t 11591 9 9 6 ) ( w h i c h w a s 3 6 ) t o g e n e r a t e t h e f i n a l r e s u l t s t a t i s t i c s .

    o t h e r s i m u l a t i o n c o n d i t i o n h a s b e e n k e p t t h e s a m e a set al . ( 1 9 9 6 ) . E a c h r u n w a s i n i t i a l i z e d t o

    = - r r a n d t o = 0 . T h e n u m b e r o f t r ia l s l i st e d i n t h e t a b l es p o n d s t o t h e o n e a v e r a g e d o v e r a ll o f t h e s u c c e s s f u l

    h e p e r c e n t a g e o f su c c e s s fu l r u n s o u t o f 6 0 w a s a ls ot r a j e c t o r y

    t h e p e n d u l u m a n g l e u n d e r a n N D P c o n t r o l l e r fo r a s u c ce s s-1 a n d s e t t i n g 2 .

    T h e s e c o n d c o l u m n r e p re s e n ts t h e p e r c e n t a g e o f s uc c e ss f ulu t o f 60 . T h e t h i r d c o l u m n d e p i c t s th e a v e r a ge n u m b e r

    i a l s i t t o o k t o l e a r n t o s u c c e s s fu l l y p e r f o r m t h e t a s k . T h eo v e r t h e s u c c e s sf u l r u n s .

    1 3 .6 C o n c l u s i o n

    T h i s c h a p t e r i s a n i n t r o d u c t i o n t o a l e a r n i n g c o n t r o l s c h e m et h a t c a n b e i m p l e m e n t e d i n r e al ti m e . I t m a y b e v i e w e d a s am o d e l i n d e p e n d e n t a p p r o a c h t o t h e a d a p t i v e c r i t i c d e s i g n s .T h e c h a p t e r d e m o n s t r a t e s t h e i m p l e m e n t a t i o n d e t a i l s a n dl e a r n i n g r e s u l t s u s i n g t w o i l l u s t r a t i v e e x a m p l e s . T h e c h a p t e ra l so p r o v i d e s a v i e w o n t h e c o n v e r g e n c e p r o p e r t y o f th e l e a r n -i n g p r o c e s s u n d e r t h e a s s u m p t i o n t h a t t h e t w o n e t w o r km o d e l s , t h e c r i t i c a n d t h e a c t i o n , c a n b e l e a r n e d s e p a r a t e l y .T h e a n a ly s i s o f th e c o m p l e t e l e a r n i n g p r o c e s s r e m a i n s t o b e a no p e n i s s u e .

    Re f e r e n c e sBar to, A.G. , Sutton, R.S . , and And erson, C .W . (1983) . Ne uron- l ikeadaptive elements that can solve diff icult learning control prob-lems. IE EE Transactions on Systems, Ma n, and Cybernetics 13,834-847.Bertsekas, D.P., and Tsitsiklis, J.N. (1996). Neuro-dynamic program-ming. Belmont, MA: Athena Scientif ic.Kushner, H.J. , and Y in, G.G. (1997). Stochastic approximation algo-rithms and applications. New York: Springer.Robbins, H., and Monro, S. (1951). A stochastic approximation

    m e t h o d . Annals of Mathematics and Statistics 22, 400-407.Santamaria, J.C., Sutton , R.S. , and Ram, A. (1996). Experiments withreinforcement learning in problems with continuous s ta te andaction spaces. CO INS Technical Report 96-88, University of Massa-chussetts, Amherst.Tsitsiklis, J .N., and Van R oy, B. (1997). An analysis o f tem pora l-difference learning with funct ion approxim ation. IEE E Transactionson Automatic Control 42(5), 674-690.Werbos , P .L (1990) . A m enu of design for re inforcemen t learning overtime. Neural Networks for C ontrol, W.T. Miller III , R.S. Sutton , a ndP.J. W erbos (eds.) , M IT Press, Cambridge, MA .Werbos , P . I . (1992) . Approxim ate dyn amic pro gram min g for real-t ime control and neural modeling. H and boo k of Inte l ligent Con -trol. D. White and D. Sofge (eds.) , Van Nostrand Reinhold, NewYork.