Top Banner
Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
23

Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Reinforcement Learning (1)

Bob DurrantSchool of Computer Science

University of Birmingham

(Slides: Dr Ata Kabán)

Page 2: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

• Learning by reinforcement• State-action rewards• Markov Decision Process• Policies and Value functions• Q-learning

Reinforcement Learning (1)

Page 3: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Learning by reinforcement

• Examples:– Learning to play Backgammon– Robot learning to dock on battery charger

• Characteristics:– No direct training examples – delayed rewards instead– Need for exploration & exploitation– The environment is stochastic and unknown– The actions of the learner affect future rewards

Page 4: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Reinforcement Learning

Supervised Learning

Page 5: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Brief history & successes

• Minsky’s PhD thesis (1954): Stochastic Neural-Analog Reinforcement Computer

• Analogies with animal learning and psychology • TD-Gammon (Tesauro, 1992) – big success story• Job-shop scheduling for NASA space missions (Zhang and

Dietterich, 1997)• Robotic soccer (Stone and Veloso, 1998) – part of the world-

champion approach• ‘An approximate solution to a complex problem can be better

than a perfect solution to a simplified problem’

Page 6: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

The RL problem

States

Actions

Immediate rewards

Eventual reward

Discount factor

from any starting state

Page 7: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Markov Decision Process (MDP)

• MDP is a formal model of the RL problem• At each discrete time point

– Agent observes state st and chooses action at

– Receives reward rt from the environment and the state changes to st+1

• Markov assumption: rt=r(st,at) st+1=(st,at)i.e. rt and st+1 depend only on the current state and action– In general, the functions r and may not be deterministic

and are not necessarily known to the agent

s

Page 8: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Agent’s Learning Task

Execute actions in environment, observe results and• Learn action policy that maximises

from any starting state in S.Here is the discount factor for future rewards

• Note: • Target function is • There are no training examples of the form (s,a) but

only of the form ((s,a),r)

AS :...][ 2

21 ttt rrrE

10

AS :

Page 9: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Example: TD-Gammon

• Immediate reward:+100 if win-100 if lose0 for all other states

• Trained by playing 1.5 million games against itself• Now approximately equal to the best human player

Page 10: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Example: Mountain-Car

• States: position and velocity• Actions: accelerate forward, accelerate backward, coast• Rewards

– Reward=-1for every step, until the car reaches the top– Reward=1 at the top, 0 otherwise, <1

• The eventual reward will be maximised by minimising the number of steps to the top of the hill

Page 11: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Value function

We will consider deterministic worlds first• Given a policy (adopted by the agent), define

an evaluation function over states:

• Property:

02

21 ...)(

iit

itttt rrrrsV

)(

...)()(

1

21

tt

tttt

sVr

rrrsV

Page 12: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Example

Grid world environment• Six possible states• Arrows represent possible

actions• G: goal stateOne optimal policy – denoted *

What is the best thing to do when in each state?

Compute the values of the states for this policy – denoted V*

Page 13: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

r(s,a) (immediate reward) values: V*(s) values, with =0.9

one optimal policy:

V*(s6) = 100 + 0.9*0 = 100

V*(s5) = 0 + 0.9*100 = 90

V*(s4) = 0 + 0.9*90 = 81

Restated, the task is to learn the optimal policy

)(),(maxarg* ssV

)( )( 1**

ttt sVrsV

Page 14: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

The task, revisited

• We might try to have the agent learn the evaluation function V*

• It could then do a look ahead search to chose the best action from any state using

… yes, if we knew both the transition function and the reward function r. In general these are unknown to the agent, so it cannot choose actions this way

• BUT: there is a way to do it!!

))},((),({maxarg)( ** asVasrsa

Page 15: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Q function

• Define a new function very similar to V*

• What difference does this make?• If the agent learns Q, then it can choose the

optimal actions even without knowing • Let us see how.

)),((),(),( * asVasrasQ

Page 16: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

)),((),(),( * asVasrasQ • Rewrite things replacing this new definition:

• Now, let denote the agent’s current approximation to Q. Consider the iterative update rule. Under some assumptions (<s,a> visited infinitely often), this will converge to the true Q:

),(max),(

)),((),(),(

:yrecursivel Q rewrite tous allows This

),(max)(

),(maxarg

))},((),({maxarg)(

1

*

*

**

asQasr

asVasrasQ

asQsV

asQ

asVasrs

ta

tt

tttttt

a

a

a

)),,((ˆmax),(:),(ˆ aasQasrasQa

Page 17: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Q Learning algorithm (in deterministic worlds)

• For each (s,a) initialise table entry• Observe current state s• Do forever:

– Select an action a and execute it– Receive immediate reward r– Observe new state s’– Update table entry as follows:

– s:=s’

0:),(ˆ asQ

)),,((ˆmax:),(ˆ aasQrasQa

Page 18: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Example updating Q

90

}100,81,63max{9.00

)',(ˆmax:),(ˆ2

'1

asQrrightsQa

given the Q values from a previous iteration on the arrows

Page 19: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Sketch of the convergence proof of Q-learning

• Consider the case of deterministic world, where each (s,a) is visited infinitely often.

• Define a full interval as an interval during which each (s,a) is visited. It can be easily shown, that during any such interval, the absolute value of the largest error in table is reduced by a factor of .

• Consequently, as <1, then after infinitely many updates, the largest error converges to zero.

• Go through the details from [Mitchell, sec. 13.3,6.]

Page 20: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

An Observation

• As a consequence of the convergence proof, Q-learning need not train on optimal action sequences in order to converge to the optimal policy. It can learn the Q function (and hence the optimal policy) while training from actions chosen at random as long as the resulting training sequence visits every ordered (state, action) pair infinitely often.

Page 21: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Exploration versus Exploitation• The Q-learning algorithm doesn’t say how we could

choose an action• If we choose an action that maximises our estimate of Q

we could end up not exploring better alternatives• To converge on the true Q values we must favour higher

estimated Q values but still have a chance of choosing worse estimated Q values for exploration (see the convergence proof of the Q-learning algorithm in [Mitchell, sec. 13.3.4.]). An action selection function of the following form may employed, where k>0:

'

),(ˆ

),(ˆ

'

)|(

k

asQ

asQ

kk

k

k

ksaP

Page 22: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Nondeterministic case• What if the reward and the state transition are not

deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice!

• Then V and Q need to be redefined using their expected values:

• Similar reasoning and convergent update iteration will apply• Will continue next week.

))],((),([),(

][...][)(

*

02

21

asVasrEasQ

rErrrEsVi

iti

ttt

Page 23: Reinforcement Learning (1) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Summary

• Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance

• The goal of a reinforcement learning program is to maximise the eventual reward

• Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment