This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Assume the world (ie, Assume the world (ie, environment) periodically environment) periodically provides provides rewardsrewards or or punishmentspunishments (“reinforcements”)(“reinforcements”)
• Based on reinforcements Based on reinforcements received, learn how to better received, learn how to better choose actionschoose actions
Sequential Decision Sequential Decision ProblemsProblemsCourtesy of A.G. Barto, April 2000Courtesy of A.G. Barto, April 2000
• Decisions are made in stagesDecisions are made in stages• The outcome of each decision is not fully The outcome of each decision is not fully
predictable but can be observed before predictable but can be observed before the next decision is madethe next decision is made
• The objective is to maximize a numerical The objective is to maximize a numerical measure of total reward (or equivalently, to measure of total reward (or equivalently, to minimize a measure of total cost)minimize a measure of total cost)
• Decisions cannot be viewed in isolation: Decisions cannot be viewed in isolation: need to balance desire for need to balance desire for immediate rewardimmediate reward with with possibility of possibility of high future rewardhigh future reward
Reinforcement Reinforcement Learning vs Learning vs Supervised LearningSupervised Learning• How would we use SL to train an How would we use SL to train an
agent in an environment?agent in an environment?• Show action to choose in sample of Show action to choose in sample of
world states – “I/O pairs”world states – “I/O pairs”• RL requires much less of teacherRL requires much less of teacher
• Must set up “reward structure”Must set up “reward structure”• Learner “works out the details” Learner “works out the details”
– i.e. writes a program to maximize – i.e. writes a program to maximize rewards received rewards received
Embedded Learning Embedded Learning Systems: Systems: FormalizationFormalization• SSEE = the set of = the set of statesstates of the world of the world
• e.g., an e.g., an NN -dimensional vector-dimensional vector• ““sensors”sensors”
• AAEE = the set of possible = the set of possible actionsactions an an agent can performagent can perform• ““effectors”effectors”
• W = the W = the worldworld• R = the immediate R = the immediate rewardreward structure structure
W and R are the environment, W and R are the environment, can be can be probabilisticprobabilistic functions functions
Embedded learning Embedded learning Systems Systems (formalization)(formalization)W: SW: SEE x A x AEE S SEE
The world maps a state and an action The world maps a state and an action and produces a new stateand produces a new state
R: SR: SEE x A x AEE “reals” “reals”Provides rewards (a number) as a Provides rewards (a number) as a function of state and action (as in function of state and action (as in textbook). Can equivalently formalize textbook). Can equivalently formalize as a function of state (next state) as a function of state (next state) alone.alone.
• Note that both the world and the agent can Note that both the world and the agent can be be probabilisticprobabilistic, so , so WW and and RR could produce could produce probability distributionsprobability distributions..
• For now, assume deterministic problemsFor now, assume deterministic problems
StateState need need notnot be solely the be solely the currentcurrent sensor readings sensor readings
• Markov AssumptionMarkov AssumptionValue of state is Value of state is independentindependent of of path taken to reach that statepath taken to reach that state
• Can have Can have memorymemory of the past of the pastCan always create Can always create MarkovianMarkovian task by task by remembering remembering entireentire past history past history
The agent needs to The agent needs to learnlearn a policy a policy
E : SE AE
Given a world state, SE, which action, AE, should be chosen?
The policy, E, function Remember: The agent’s task Remember: The agent’s task is to maximize the is to maximize the totaltotal reward received during its reward received during its lifetimelifetime
To construct To construct EE, we will assign a , we will assign a utilityutility (U)(U)(a number) to each state.(a number) to each state.
1
1 ),,()(t
Et tsRsU
E
- - is a positive constant is a positive constant < 1< 1
- R(s, R(s, EE, t), t) is the reward received at time is the reward received at time tt, , assuming the agent follows policy assuming the agent follows policy EE and and starts in state starts in state ss at at tt=0=0
- Note: future rewards are Note: future rewards are discounteddiscounted by by t-1t-1
The Action-Value The Action-Value FunctionFunction
We want to choose the “best” action in the current We want to choose the “best” action in the current statestate
So, pick the one that leads to the best next stateSo, pick the one that leads to the best next state (and include any immediate reward) (and include any immediate reward)
LetLet )),(( )),((),( asWUasWRasQEE
immediate immediate reward reward received for received for going to state going to state W(s,a)W(s,a)
Future reward from Future reward from further actions further actions (discounted due to (discounted due to 1-step delay)1-step delay)
The Action-Value The Action-Value Function Function (cont.(cont.))
If we can accurately learn If we can accurately learn QQ (the (the action-value function), choosing action-value function), choosing actions is easyactions is easy
Assume we are in state Assume we are in state SStt
““Run the program”Run the program”(1)(1) for awhile ( for awhile (nn steps) steps)
Determine Determine actualactual reward and reward and compare compare to to predictedpredicted reward reward Adjust prediction to reduce Adjust prediction to reduce errorerror
(1 ) I.e., follow the current policy(1 ) I.e., follow the current policy
How Many Actions Should How Many Actions Should We Take Before Updating We Take Before Updating QQ ??Why not do so after Why not do so after eacheach action? action?
• ““1 – Step Q learning”1 – Step Q learning”• Most common approachMost common approach
Q-Learning: Q-Learning: Convergence Proof Convergence Proof • Applies to Q Applies to Q tablestables and and deterministicdeterministic, ,
Markovian worlds. Initialize Q’s 0 or random Markovian worlds. Initialize Q’s 0 or random finite.finite.
• TheoremTheorem: if every state-action pair visited : if every state-action pair visited infinitely often, 0infinitely often, 0≤≤<<1, and |rewards| 1, and |rewards| ≤ ≤ C C (some constant), then(some constant), then
s, as, a the the approxapprox. Q table (Q) the . Q table (Q) the truetrue Q table (Q) Q table (Q)
Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)Let Let s’s’ be the state that results from doing be the state that results from doing action action aa in state in state ss. Consider what happens when . Consider what happens when we visit we visit ss and do and do aa at step at step tt + 1: + 1:
1' "
( , ) ( , ) max ( ', ') max ( ', ")tta a
Q s a Q s a R Q s a R Q s a
Current state
Next stateBy Q-learning rule (one step)
By def’n of Q (notice best a in s’ might be different)
Q-Learning Q-Learning Convergence Proof Convergence Proof (cont.)(cont.)• Hence, every time, after Hence, every time, after tt, we visit an , we visit an
<s, a><s, a>, its Q value differs from the , its Q value differs from the correct answer by no more than correct answer by no more than ΔΔtt
• Let Let TToo=t=too (i.e. the start) and (i.e. the start) and TTNN be the be the first time since first time since TTN-1N-1 where where everyevery <s, a><s, a> visited at least once visited at least once
• Call the time between Call the time between TTN-1N-1 and and TTNN, a , a complete intervalcomplete interval
Q-Learning Q-Learning Convergence Proof Convergence Proof (concluded)(concluded)• That is, every That is, every complete intervalcomplete interval, ,
ΔΔt t is reduced by at least is reduced by at least
• Since we assumed every Since we assumed every <s, a><s, a> pair visited pair visited infinitely often, we will have an infinitely often, we will have an infinite infinite number of complete intervalsnumber of complete intervals
Representing Q Representing Q Functions Functions More CompactlyMore CompactlyWe can use some other function representationWe can use some other function representation(eg, neural net) to compactly encode this big table(eg, neural net) to compactly encode this big table
An encoding of the state (S)
Second argument is a constant
Or could have one net for each possible action
Each input unit encodes a property of the state (eg, a sensor value)