© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) RL Lecture, Slide RL Lecture, Slide 11

Reinforcement Reinforcement Learning (RL)Learning (RL)

• Consider an “agent” embedded Consider an “agent” embedded in an environmentin an environment

• Task of the agentTask of the agentRepeat forever:Repeat forever:

1)1) sense worldsense world

2)2) reasonreason

3)3) choose an action to performchoose an action to perform



Definition of RLDefinition of RL

• Assume the world (ie, Assume the world (ie, environment) periodically environment) periodically provides provides rewardsrewards or or punishmentspunishments (“reinforcements”)(“reinforcements”)

• Based on reinforcements Based on reinforcements received, learn how to better received, learn how to better choose actionschoose actions



Sequential Decision Sequential Decision ProblemsProblemsCourtesy of A.G. Barto, April 2000Courtesy of A.G. Barto, April 2000

• Decisions are made in stagesDecisions are made in stages• The outcome of each decision is not fully The outcome of each decision is not fully

predictable but can be observed before predictable but can be observed before the next decision is madethe next decision is made

• The objective is to maximize a numerical The objective is to maximize a numerical measure of total reward (or equivalently, to measure of total reward (or equivalently, to minimize a measure of total cost)minimize a measure of total cost)

• Decisions cannot be viewed in isolation: Decisions cannot be viewed in isolation: need to balance desire for need to balance desire for immediate rewardimmediate reward with with possibility of possibility of high future rewardhigh future reward



Reinforcement Reinforcement Learning vs Learning vs Supervised LearningSupervised Learning• How would we use SL to train an How would we use SL to train an

agent in an environment?agent in an environment?• Show action to choose in sample of Show action to choose in sample of

world states – “I/O pairs”world states – “I/O pairs”• RL requires much less of teacherRL requires much less of teacher

• Must set up “reward structure”Must set up “reward structure”• Learner “works out the details” Learner “works out the details”

– i.e. writes a program to maximize – i.e. writes a program to maximize rewards received rewards received



Embedded Learning Embedded Learning Systems: Systems: FormalizationFormalization• SSEE = the set of = the set of statesstates of the world of the world

• e.g., an e.g., an NN -dimensional vector-dimensional vector• ““sensors”sensors”

• AAEE = the set of possible = the set of possible actionsactions an an agent can performagent can perform• ““effectors”effectors”

• W = the W = the worldworld• R = the immediate R = the immediate rewardreward structure structure

W and R are the environment, W and R are the environment, can be can be probabilisticprobabilistic functions functions



Embedded learning Embedded learning Systems Systems (formalization)(formalization)W: SW: SEE x A x AEE S SEE

The world maps a state and an action The world maps a state and an action and produces a new stateand produces a new state

R: SR: SEE x A x AEE “reals” “reals”Provides rewards (a number) as a Provides rewards (a number) as a function of state and action (as in function of state and action (as in textbook). Can equivalently formalize textbook). Can equivalently formalize as a function of state (next state) as a function of state (next state) alone.alone.



A Graphical View of RLA Graphical View of RL

• Note that both the world and the agent can Note that both the world and the agent can be be probabilisticprobabilistic, so , so WW and and RR could produce could produce probability distributionsprobability distributions..

• For now, assume deterministic problemsFor now, assume deterministic problems

The real world, W

The Agent

anactionsensory

info

R, reward(a scalar) - indirect teacher



Common ConfusionCommon Confusion

StateState need need notnot be solely the be solely the currentcurrent sensor readings sensor readings

• Markov AssumptionMarkov AssumptionValue of state is Value of state is independentindependent of of path taken to reach that statepath taken to reach that state

• Can have Can have memorymemory of the past of the pastCan always create Can always create MarkovianMarkovian task by task by remembering remembering entireentire past history past history



Need for Memory: Need for Memory: Simple ExampleSimple Example

““out of sight, but not out of mind”out of sight, but not out of mind”

T=1

learning agent

W A L L

opponent

T=2

learning agent

W A L L

opponent

Seems reasonable toremember opponentrecently seen



State vs. State vs. Current Sensor Current Sensor ReadingsReadingsRememberRemember

state is what is in one’s head state is what is in one’s head (past memories, etc)(past memories, etc)

not not ONLYONLY what one currently what one currently sees/hears/smells/etcsees/hears/smells/etc



PoliciesPolicies

The agent needs to The agent needs to learnlearn a policy a policy

E : SE AE

Given a world state, SE, which action, AE, should be chosen?

The policy, E, function Remember: The agent’s task Remember: The agent’s task is to maximize the is to maximize the totaltotal reward received during its reward received during its lifetimelifetime



Policies (cont.)Policies (cont.)

To construct To construct EE, we will assign a , we will assign a utilityutility (U)(U)(a number) to each state.(a number) to each state.

1

1 ),,()(t

Et tsRsU

E

- - is a positive constant is a positive constant < 1< 1

- R(s, R(s, EE, t), t) is the reward received at time is the reward received at time tt, , assuming the agent follows policy assuming the agent follows policy EE and and starts in state starts in state ss at at tt=0=0

- Note: future rewards are Note: future rewards are discounteddiscounted by by t-1t-1



The Action-Value The Action-Value FunctionFunction

We want to choose the “best” action in the current We want to choose the “best” action in the current statestate

So, pick the one that leads to the best next stateSo, pick the one that leads to the best next state (and include any immediate reward) (and include any immediate reward)

LetLet )),(( )),((),( asWUasWRasQEE

immediate immediate reward reward received for received for going to state going to state W(s,a)W(s,a)

Future reward from Future reward from further actions further actions (discounted due to (discounted due to 1-step delay)1-step delay)



The Action-Value The Action-Value Function Function (cont.(cont.))

If we can accurately learn If we can accurately learn QQ (the (the action-value function), choosing action-value function), choosing actions is easyactions is easy

Choose Choose aa, where, where)',(maxarg

'

asQaactionsa



QQ vs. vs. UU VisuallyVisually

U(1)U(3)

U(2)

U(4)

U(5)

U(6)

state

Q(1,i)

Q(1,ii)

Q(1,iii)

state actionKey

states

actions

U’s “stored” on U’s “stored” on statesstates

Q’s “stored” on Q’s “stored” on arcsarcs



Q-Learning Q-Learning (Watkins PhD, 1989)(Watkins PhD, 1989)

Let Let QQtt be our current estimate of the optimal be our current estimate of the optimal QQ

Our current policy isOur current policy is

,)( ast such that)],([),( max bsQasQ t

actionsknownb

t

Our current utility-function estimate isOur current utility-function estimate is

))(,()( ssQsU ttt - hence, the U table is embedded in the Q - hence, the U table is embedded in the Q

table and we don’t need to store bothtable and we don’t need to store both



Q-Learning (cont.)Q-Learning (cont.)

Assume we are in state Assume we are in state SStt

““Run the program”Run the program”(1)(1) for awhile ( for awhile (nn steps) steps)

Determine Determine actualactual reward and reward and compare compare to to predictedpredicted reward reward Adjust prediction to reduce Adjust prediction to reduce errorerror

(1 ) I.e., follow the current policy(1 ) I.e., follow the current policy



How Many Actions Should How Many Actions Should We Take Before Updating We Take Before Updating QQ ??Why not do so after Why not do so after eacheach action? action?

• ““1 – Step Q learning”1 – Step Q learning”• Most common approachMost common approach



Exploration vs. Exploration vs. ExploitationExploitation

In order to learn about better In order to learn about better alternatives, we can’t always follow alternatives, we can’t always follow

the current policy the current policy (“exploitation”)(“exploitation”)

Sometimes, need to try Sometimes, need to try “random” moves “random” moves (“exploration”)(“exploration”)



Exploration vs. Exploration vs. Exploitation (cont)Exploitation (cont)

ApproachesApproaches1) 1) pp percent of the time, make a percent of the time, make a

random move; could letrandom move; could let

2) Prob(picking action2) Prob(picking action

AA in state in state SS ))

mademovesp

_#

1

actionsi

iSQ

ASQ

const

const,

,

Exponentia-ting

gets rid of negative values



0.0. SS initial state initial state

1. 1. If random # If random # P Pthen then aa = random choice = random choice

Else Else aa = = tt((SS))

2. 2. SSnewnew W( W(SS, , aa))

RRimmedimmed R( R(SSnewnew) )

3.3. Q(Q(SS, , aa) ) RRimmedimmed + + maxmaxa’a’ Q( Q(SSnewnew, , a’a’))

4.4. S S SSnewnew

• Go to 1Go to 1

One-Step Q-Learning One-Step Q-Learning AlgoAlgo

Act on world and get reward



A Simple Example A Simple Example (of Q-learning (of Q-learning - with updates after each step, ie - with updates after each step, ie N N =1)=1)

statenextnew QRQ maxRepeat

(deterministic world, so α=1)

AlgoAlgo: Pick State +Action: Pick State +Action

S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 0

Q = 0

Q = 0

Q = 0

Q = 0

Q = 0



A Simple Example (Step A Simple Example (Step 1)1)

SS00 S S22




S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 0

Q = 0

Q = 0

Q = 0

Q = 0

Q = -1



A Simple Example (Step A Simple Example (Step 2)2)

SS22 S S44




S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 0

Q = 3

Q = 0

Q = 0

Q = 0

Q = -1



A Simple Example (Step A Simple Example (Step ))




S0

R = 0

S1

R = 1

S3

R = 0S2

R = -1

S4

R = 3

Let Let = 2/3 = 2/3Q = 1

Q = 3

Q = 0

Q = 0

Q = 0

Q = 1



Q-Learning: Q-Learning: Implementation DetailsImplementation Details

Remember, conceptually we are filling in a Remember, conceptually we are filling in a huge tablehuge table

S0 S1 S2 . . . Sn

a

b

c

.

.

.

z

. . . Q(S2, c)

.

.

.Tables are a very verbose representation of a function

States

Actions



Q-Learning: Q-Learning: Convergence Proof Convergence Proof • Applies to Q Applies to Q tablestables and and deterministicdeterministic, ,

Markovian worlds. Initialize Q’s 0 or random Markovian worlds. Initialize Q’s 0 or random finite.finite.

• TheoremTheorem: if every state-action pair visited : if every state-action pair visited infinitely often, 0infinitely often, 0≤≤<<1, and |rewards| 1, and |rewards| ≤ ≤ C C (some constant), then(some constant), then

s, as, a the the approxapprox. Q table (Q) the . Q table (Q) the truetrue Q table (Q) Q table (Q)

lim ( , ) ( , )actualttQ s a Q s a

^



Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)

• Consider the max error in the approx. Q-Consider the max error in the approx. Q-table at step table at step tt : :

• The max is finite since |r| The max is finite since |r| ≤≤ C, C, so max so max || ||

• Since finite, we have Since finite, we have finitefinite, ,

i.e. i.e. initial max error is finiteinitial max error is finite

( , )actualQ s a

actualQ0 1

i

i

CC

0| |Q

,max | ( , ) ( , ) |t t actuals a

Q s a Q s a

0



Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)Let Let s’s’ be the state that results from doing be the state that results from doing action action aa in state in state ss. Consider what happens when . Consider what happens when we visit we visit ss and do and do aa at step at step tt + 1: + 1:

1' "

( , ) ( , ) max ( ', ') max ( ', ")tta a

Q s a Q s a R Q s a R Q s a

Current state

Next stateBy Q-learning rule (one step)

By def’n of Q (notice best a in s’ might be different)



Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)

By algebra

Since 1 2 1 2

'max ( ) max ( ') max ( ) ( )a a af a f a f a f a

Max at s’ ≤ max at any sPlugging in defn of Δt

Trickiest step, can prove by

contradiction

= = | max | maxa’ a’ QQtt(s’, a’) – max(s’, a’) – maxa’’a’’ Q(s’, a’’) Q(s’, a’’) | |

= = ΔΔtt

≤ ≤ maxmaxa’’’ a’’’ | Q| Qtt(s’, a’’’) – Q(s’, a’’’) |(s’, a’’’) – Q(s’, a’’’) |

≤ ≤ maxmaxs’’,a’’’ s’’,a’’’ | Q| Qtt(s’’, a’’’) – Q(s’’, a’’’) |(s’’, a’’’) – Q(s’’, a’’’) |^

^

^



Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)• Hence, every time, after Hence, every time, after tt, we visit an , we visit an

<s, a><s, a>, its Q value differs from the , its Q value differs from the correct answer by no more than correct answer by no more than ΔΔtt

• Let Let TToo=t=too (i.e. the start) and (i.e. the start) and TTNN be the be the first time since first time since TTN-1N-1 where where everyevery <s, a><s, a> visited at least once visited at least once

• Call the time between Call the time between TTN-1N-1 and and TTNN, a , a complete intervalcomplete interval

ClearlyClearly ΔΔTTNN ≤ ΔΔTTN-1N-1



Q-Learning Q-Learning Convergence Proof Convergence Proof (concluded)(concluded)• That is, every That is, every complete intervalcomplete interval, ,

ΔΔt t is reduced by at least is reduced by at least

• Since we assumed every Since we assumed every <s, a><s, a> pair visited pair visited infinitely often, we will have an infinitely often, we will have an infinite infinite number of complete intervalsnumber of complete intervals

Hence, lim Hence, lim ΔΔt t = 0= 0 t t



.

. .

.

.

Q (S, a)

Q (S, b)

Q (S, z)

Representing Q Representing Q Functions Functions More CompactlyMore CompactlyWe can use some other function representationWe can use some other function representation(eg, neural net) to compactly encode this big table(eg, neural net) to compactly encode this big table

An encoding of the state (S)

Second argument is a constant

Or could have one net for each possible action

Each input unit encodes a property of the state (eg, a sensor value)



Q Tables vs Q NetsQ Tables vs Q Nets

GivenGiven: 100 Boolean-valued features: 100 Boolean-valued features

10 possible actions10 possible actions

Size of Q tableSize of Q table

10 * 2 to the power of 10010 * 2 to the power of 100

Size of Q net (100 HU’s)Size of Q net (100 HU’s)

100 * 100 + 100 * 10 = 11,000100 * 100 + 100 * 10 = 11,000

# of possible states

Weights between inputs

and HU’s

Weights between HU’s and outputs



Why Use a Compact Why Use a Compact Q-Function?Q-Function?

1.1. Full Q table may not fit in Full Q table may not fit in memory for realistic problemsmemory for realistic problems

2.2. Can Can generalize across statesgeneralize across states, , thereby speeding up thereby speeding up convergenceconvergence

i.e., one example “fills” i.e., one example “fills” manymany cells in the Q tablecells in the Q table

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

Documents

learning agent w

lifetime slide

seeshearssmellsetc slide

david page

jude shavlik

policy e

world r

performa e