Top Banner
© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2007 David Page 2007 CS 760 – Machine Learning (UW- CS 760 – Machine Learning (UW- Madison) Madison) RL Lecture, Slide RL Lecture, Slide 1 Reinforcement Reinforcement Learning (RL) Learning (RL) Consider an “agent” Consider an “agent” embedded embedded in an environment in an environment Task of the agent Task of the agent Repeat forever: Repeat forever: 1) 1) sense world sense world 2) 2) reason reason 3) 3) choose an action to perform choose an action to perform
35

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

Mar 31, 2015

Download

Documents

Rowan Frankson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 11

Reinforcement Reinforcement Learning (RL)Learning (RL)

• Consider an “agent” embedded Consider an “agent” embedded in an environmentin an environment

• Task of the agentTask of the agentRepeat forever:Repeat forever:

1)1) sense worldsense world

2)2) reasonreason

3)3) choose an action to performchoose an action to perform

Page 2: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 22

Definition of RLDefinition of RL

• Assume the world (ie, Assume the world (ie, environment) periodically environment) periodically provides provides rewardsrewards or or punishmentspunishments (“reinforcements”)(“reinforcements”)

• Based on reinforcements Based on reinforcements received, learn how to better received, learn how to better choose actionschoose actions

Page 3: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 33

Sequential Decision Sequential Decision ProblemsProblemsCourtesy of A.G. Barto, April 2000Courtesy of A.G. Barto, April 2000

• Decisions are made in stagesDecisions are made in stages• The outcome of each decision is not fully The outcome of each decision is not fully

predictable but can be observed before predictable but can be observed before the next decision is madethe next decision is made

• The objective is to maximize a numerical The objective is to maximize a numerical measure of total reward (or equivalently, to measure of total reward (or equivalently, to minimize a measure of total cost)minimize a measure of total cost)

• Decisions cannot be viewed in isolation: Decisions cannot be viewed in isolation: need to balance desire for need to balance desire for immediate rewardimmediate reward with with possibility of possibility of high future rewardhigh future reward

Page 4: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 44

Reinforcement Reinforcement Learning vs Learning vs Supervised LearningSupervised Learning• How would we use SL to train an How would we use SL to train an

agent in an environment?agent in an environment?• Show action to choose in sample of Show action to choose in sample of

world states – “I/O pairs”world states – “I/O pairs”• RL requires much less of teacherRL requires much less of teacher

• Must set up “reward structure”Must set up “reward structure”• Learner “works out the details” Learner “works out the details”

– i.e. writes a program to maximize – i.e. writes a program to maximize rewards received rewards received

Page 5: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 55

Embedded Learning Embedded Learning Systems: Systems: FormalizationFormalization• SSEE = the set of = the set of statesstates of the world of the world

• e.g., an e.g., an NN -dimensional vector-dimensional vector• ““sensors”sensors”

• AAEE = the set of possible = the set of possible actionsactions an an agent can performagent can perform• ““effectors”effectors”

• W = the W = the worldworld• R = the immediate R = the immediate rewardreward structure structure

W and R are the environment, W and R are the environment, can be can be probabilisticprobabilistic functions functions

Page 6: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 66

Embedded learning Embedded learning Systems Systems (formalization)(formalization)W: SW: SEE x A x AEE S SEE

The world maps a state and an action The world maps a state and an action and produces a new stateand produces a new state

R: SR: SEE x A x AEE “reals” “reals”Provides rewards (a number) as a Provides rewards (a number) as a function of state and action (as in function of state and action (as in textbook). Can equivalently formalize textbook). Can equivalently formalize as a function of state (next state) as a function of state (next state) alone.alone.

Page 7: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 77

A Graphical View of RLA Graphical View of RL

• Note that both the world and the agent can Note that both the world and the agent can be be probabilisticprobabilistic, so , so WW and and RR could produce could produce probability distributionsprobability distributions..

• For now, assume deterministic problemsFor now, assume deterministic problems

The real world, W

The Agent

anactionsensory

info

R, reward(a scalar) - indirect teacher

Page 8: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 88

Common ConfusionCommon Confusion

StateState need need notnot be solely the be solely the currentcurrent sensor readings sensor readings

• Markov AssumptionMarkov AssumptionValue of state is Value of state is independentindependent of of path taken to reach that statepath taken to reach that state

• Can have Can have memorymemory of the past of the pastCan always create Can always create MarkovianMarkovian task by task by remembering remembering entireentire past history past history

Page 9: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 99

Need for Memory: Need for Memory: Simple ExampleSimple Example

““out of sight, but not out of mind”out of sight, but not out of mind”

T=1

learning agent

W A L L

opponent

T=2

learning agent

W A L L

opponent

Seems reasonable toremember opponentrecently seen

Page 10: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1010

State vs. State vs. Current Sensor Current Sensor ReadingsReadingsRememberRemember

state is what is in one’s head state is what is in one’s head (past memories, etc)(past memories, etc)

not not ONLYONLY what one currently what one currently sees/hears/smells/etcsees/hears/smells/etc

Page 11: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1111

PoliciesPolicies

The agent needs to The agent needs to learnlearn a policy a policy

E : SE AE

Given a world state, SE, which action, AE, should be chosen?

The policy, E, function Remember: The agent’s task Remember: The agent’s task is to maximize the is to maximize the totaltotal reward received during its reward received during its lifetimelifetime

Page 12: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1212

Policies (cont.)Policies (cont.)

To construct To construct EE, we will assign a , we will assign a utilityutility (U)(U)(a number) to each state.(a number) to each state.

1

1 ),,()(t

Et tsRsU

E

- - is a positive constant is a positive constant < 1< 1

- R(s, R(s, EE, t), t) is the reward received at time is the reward received at time tt, , assuming the agent follows policy assuming the agent follows policy EE and and starts in state starts in state ss at at tt=0=0

- Note: future rewards are Note: future rewards are discounteddiscounted by by t-1t-1

Page 13: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1313

The Action-Value The Action-Value FunctionFunction

We want to choose the “best” action in the current We want to choose the “best” action in the current statestate

So, pick the one that leads to the best next stateSo, pick the one that leads to the best next state (and include any immediate reward) (and include any immediate reward)

LetLet )),(( )),((),( asWUasWRasQEE

immediate immediate reward reward received for received for going to state going to state W(s,a)W(s,a)

Future reward from Future reward from further actions further actions (discounted due to (discounted due to 1-step delay)1-step delay)

Page 14: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1414

The Action-Value The Action-Value Function Function (cont.(cont.))

If we can accurately learn If we can accurately learn QQ (the (the action-value function), choosing action-value function), choosing actions is easyactions is easy

Choose Choose aa, where, where)',(maxarg

'

asQaactionsa

Page 15: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1515

QQ vs. vs. UU VisuallyVisually

U(1)U(3)

U(2)

U(4)

U(5)

U(6)

state

Q(1,i)

Q(1,ii)

Q(1,iii)

state actionKey

states

actions

U’s “stored” on U’s “stored” on statesstates

Q’s “stored” on Q’s “stored” on arcsarcs

Page 16: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1616

Q-Learning Q-Learning (Watkins PhD, 1989)(Watkins PhD, 1989)

Let Let QQtt be our current estimate of the optimal be our current estimate of the optimal QQ

Our current policy isOur current policy is

,)( ast such that)],([),( max bsQasQ t

actionsknownb

t

Our current utility-function estimate isOur current utility-function estimate is

))(,()( ssQsU ttt - hence, the U table is embedded in the Q - hence, the U table is embedded in the Q

table and we don’t need to store bothtable and we don’t need to store both

Page 17: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1717

Q-Learning (cont.)Q-Learning (cont.)

Assume we are in state Assume we are in state SStt

““Run the program”Run the program”(1)(1) for awhile ( for awhile (nn steps) steps)

Determine Determine actualactual reward and reward and compare compare to to predictedpredicted reward reward Adjust prediction to reduce Adjust prediction to reduce errorerror

(1 ) I.e., follow the current policy(1 ) I.e., follow the current policy

Page 18: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1818

How Many Actions Should How Many Actions Should We Take Before Updating We Take Before Updating QQ ??Why not do so after Why not do so after eacheach action? action?

• ““1 – Step Q learning”1 – Step Q learning”• Most common approachMost common approach

Page 19: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 1919

Exploration vs. Exploration vs. ExploitationExploitation

In order to learn about better In order to learn about better alternatives, we can’t always follow alternatives, we can’t always follow

the current policy the current policy (“exploitation”)(“exploitation”)

Sometimes, need to try Sometimes, need to try “random” moves “random” moves (“exploration”)(“exploration”)

Page 20: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2020

Exploration vs. Exploration vs. Exploitation (cont)Exploitation (cont)

ApproachesApproaches1) 1) pp percent of the time, make a percent of the time, make a

random move; could letrandom move; could let

2) Prob(picking action2) Prob(picking action

AA in state in state SS ))

mademovesp

_#

1

actionsi

iSQ

ASQ

const

const,

,

Exponentia-ting

gets rid of negative values

Page 21: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2121

0.0. SS initial state initial state

1. 1. If random # If random # P Pthen then aa = random choice = random choice

Else Else aa = = tt((SS))

2. 2. SSnewnew W( W(SS, , aa))

RRimmedimmed R( R(SSnewnew) )

3.3. Q(Q(SS, , aa) ) RRimmedimmed + + maxmaxa’a’ Q( Q(SSnewnew, , a’a’))

4.4. S S SSnewnew

• Go to 1Go to 1

One-Step Q-Learning One-Step Q-Learning AlgoAlgo

Act on world and get reward

Page 22: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2222

A Simple Example A Simple Example (of Q-learning (of Q-learning - with updates after each step, ie - with updates after each step, ie N N =1)=1)

statenextnew QRQ maxRepeat

(deterministic world, so α=1)

AlgoAlgo: Pick State +Action: Pick State +Action

S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 0

Q = 0

Q = 0

Q = 0

Q = 0

Q = 0

Page 23: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2323

A Simple Example (Step A Simple Example (Step 1)1)

SS00 S S22

statenextnew QRQ maxRepeat

(deterministic world, so α=1)

AlgoAlgo: Pick State +Action: Pick State +Action

S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 0

Q = 0

Q = 0

Q = 0

Q = 0

Q = -1

Page 24: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2424

A Simple Example (Step A Simple Example (Step 2)2)

SS22 S S44

statenextnew QRQ maxRepeat

(deterministic world, so α=1)

AlgoAlgo: Pick State +Action: Pick State +Action

S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 0

Q = 3

Q = 0

Q = 0

Q = 0

Q = -1

Page 25: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2525

A Simple Example (Step A Simple Example (Step ))

statenextnew QRQ maxRepeat

(deterministic world, so α=1)

AlgoAlgo: Pick State +Action: Pick State +Action

S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 1

Q = 3

Q = 0

Q = 0

Q = 0

Q = 1

Page 26: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2626

Q-Learning: Q-Learning: Implementation DetailsImplementation Details

Remember, conceptually we are filling in a Remember, conceptually we are filling in a huge tablehuge table

S0 S1 S2 . . . Sn

a

b

c

.

.

.

z

. . . Q(S2, c)

.

.

.Tables are a very verbose representation of a function

States

Actions

Page 27: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2727

Q-Learning: Q-Learning: Convergence Proof Convergence Proof • Applies to Q Applies to Q tablestables and and deterministicdeterministic, ,

Markovian worlds. Initialize Q’s 0 or random Markovian worlds. Initialize Q’s 0 or random finite.finite.

• TheoremTheorem: if every state-action pair visited : if every state-action pair visited infinitely often, 0infinitely often, 0≤≤<<1, and |rewards| 1, and |rewards| ≤ ≤ C C (some constant), then(some constant), then

s, as, a the the approxapprox. Q table (Q) the . Q table (Q) the truetrue Q table (Q) Q table (Q)

lim ( , ) ( , )actualttQ s a Q s a

^

Page 28: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2828

Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)

• Consider the max error in the approx. Q-Consider the max error in the approx. Q-table at step table at step tt : :

• The max is finite since |r| The max is finite since |r| ≤≤ C, C, so max so max || ||

• Since finite, we have Since finite, we have finitefinite, ,

i.e. i.e. initial max error is finiteinitial max error is finite

( , )actualQ s a

actualQ0 1

i

i

CC

0| |Q

,max | ( , ) ( , ) |t t actuals a

Q s a Q s a

0

Page 29: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 2929

Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)Let Let s’s’ be the state that results from doing be the state that results from doing action action aa in state in state ss. Consider what happens when . Consider what happens when we visit we visit ss and do and do aa at step at step tt + 1: + 1:

1' "

( , ) ( , ) max ( ', ') max ( ', ")tta a

Q s a Q s a R Q s a R Q s a

Current state

Next stateBy Q-learning rule (one step)

By def’n of Q (notice best a in s’ might be different)

Page 30: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 3030

Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)

By algebra

Since 1 2 1 2

'max ( ) max ( ') max ( ) ( )a a af a f a f a f a

Max at s’ ≤ max at any sPlugging in defn of Δt

Trickiest step, can prove by

contradiction

= = | max | maxa’ a’ QQtt(s’, a’) – max(s’, a’) – maxa’’a’’ Q(s’, a’’) Q(s’, a’’) | |

= = ΔΔtt

≤ ≤ maxmaxa’’’ a’’’ | Q| Qtt(s’, a’’’) – Q(s’, a’’’) |(s’, a’’’) – Q(s’, a’’’) |

≤ ≤ maxmaxs’’,a’’’ s’’,a’’’ | Q| Qtt(s’’, a’’’) – Q(s’’, a’’’) |(s’’, a’’’) – Q(s’’, a’’’) |^

^

^

Page 31: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 3131

Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)• Hence, every time, after Hence, every time, after tt, we visit an , we visit an

<s, a><s, a>, its Q value differs from the , its Q value differs from the correct answer by no more than correct answer by no more than ΔΔtt

• Let Let TToo=t=too (i.e. the start) and (i.e. the start) and TTNN be the be the first time since first time since TTN-1N-1 where where everyevery <s, a><s, a> visited at least once visited at least once

• Call the time between Call the time between TTN-1N-1 and and TTNN, a , a complete intervalcomplete interval

ClearlyClearly ΔΔTTNN ≤ ΔΔTTN-1N-1

Page 32: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 3232

Q-Learning Q-Learning Convergence Proof Convergence Proof (concluded)(concluded)• That is, every That is, every complete intervalcomplete interval, ,

ΔΔt t is reduced by at least is reduced by at least

• Since we assumed every Since we assumed every <s, a><s, a> pair visited pair visited infinitely often, we will have an infinitely often, we will have an infinite infinite number of complete intervalsnumber of complete intervals

Hence, lim Hence, lim ΔΔt t = 0= 0 t t

Page 33: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 3333

.

. .

.

.

Q (S, a)

Q (S, b)

Q (S, z)

Representing Q Representing Q Functions Functions More CompactlyMore CompactlyWe can use some other function representationWe can use some other function representation(eg, neural net) to compactly encode this big table(eg, neural net) to compactly encode this big table

An encoding of the state (S)

Second argument is a constant

Or could have one net for each possible action

Each input unit encodes a property of the state (eg, a sensor value)

Page 34: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 3434

Q Tables vs Q NetsQ Tables vs Q Nets

GivenGiven: 100 Boolean-valued features: 100 Boolean-valued features

10 possible actions10 possible actions

Size of Q tableSize of Q table

10 * 2 to the power of 10010 * 2 to the power of 100

Size of Q net (100 HU’s)Size of Q net (100 HU’s)

100 * 100 + 100 * 10 = 11,000100 * 100 + 100 * 10 = 11,000

# of possible states

Weights between inputs

and HU’s

Weights between HU’s and outputs

Page 35: © Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 3535

Why Use a Compact Why Use a Compact Q-Function?Q-Function?

1.1. Full Q table may not fit in Full Q table may not fit in memory for realistic problemsmemory for realistic problems

2.2. Can Can generalize across statesgeneralize across states, , thereby speeding up thereby speeding up convergenceconvergence

i.e., one example “fills” i.e., one example “fills” manymany cells in the Q tablecells in the Q table