Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David Silver’s Lecture 5. For additional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 1 / 58
54
Embed
Lecture 4: Model Free Control - Stanford University · Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lecture 4: Model Free Control
Emma Brunskill
CS234 Reinforcement Learning.
Winter 2020
Structure closely follows much of David Silver’s Lecture 5. Foradditional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 1 / 58
Refresh Your Knowledge 3. Piazza Poll
Which of the following equations express a TD update?1 V (st) = r(st , at) + γ
4 V (st) = (1− α)V (st) + αmaxa(r(st , a) + γV (st+1))5 Not sure
Bootstrapping is when1 Samples of (s,a,s’) transitions are used to approximate the true
expectation over next states2 An estimate of the next state value is used instead of the true next
state value3 Used in Monte-Carlo policy evaluation4 Not sure
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 2 / 58
Refresh Your Knowledge 3. Piazza Poll
Which of the following equations express a TD update?True. V (st) = (1− α)V (st) + α(r(st , at) + γV (st+1))
Bootstrapping is whenAn estimate of the next state value is used instead of the true nextstate value
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 3 / 58
Table of Contents
1 Generalized Policy Iteration
2 Importance of Exploration
3 Maximization Bias
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 4 / 58
Class Structure
Last time: Policy evaluation with no knowledge of how the worldworks (MDP model not given)
This time: Control (making decisions) without a model of how theworld works
Next time: Generalization – Value function approximation
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 5 / 58
Evaluation to Control
Last time: how good is a specific policy?
Given no access to the decision process model parametersInstead have to estimate from data / experience
Today: how can we learn a good policy?
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 6 / 58
Recall: Reinforcement Learning Involves
Optimization
Delayed consequences
Exploration
Generalization
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 7 / 58
Today: Learning to Control Involves
Optimization: Goal is to identify a policy with high expected rewards(similar to Lecture 2 on computing an optimal policy given decisionprocess models)
Delayed consequences: May take many time steps to evaluatewhether an earlier decision was good or not
Exploration: Necessary to try different actions to learn what actionscan lead to high rewards
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 8 / 58
Today: Model-free Control
Generalized policy improvement
Importance of exploration
Monte Carlo control
Model-free control with temporal difference (SARSA, Q-learning)
Maximization bias
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 9 / 58
Model-free Control Examples
Many applications can be modeled as a MDP: Backgammon, Go,Robot locomation, Helicopter flight, Robocup soccer, Autonomousdriving, Customer ad selection, Invasive species management, Patienttreatment
For many of these and other problems either:
MDP model is unknown but can be sampledMDP model is known but it is computationally infeasible to usedirectly, except through sampling
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 10 / 58
On and Off-Policy Learning
On-policy learning
Direct experienceLearn to estimate and evaluate a policy from experience obtained fromfollowing that policy
Off-policy learning
Learn to estimate and evaluate a policy using experience gathered fromfollowing a different policy
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 11 / 58
Table of Contents
1 Generalized Policy Iteration
2 Importance of Exploration
3 Maximization Bias
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 12 / 58
Recall Policy Iteration
Initialize policy π
Repeat:
Policy evaluation: compute V π
Policy improvement: update π
π′(s) = arg maxa
R(s, a) + γ∑s′∈S
P(s ′|s, a)V π(s ′) = arg maxa
Qπ(s, a)
Now want to do the above two steps without access to the truedynamics and reward models
Last lecture introduced methods for model-free policy evaluation
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 13 / 58
Model Free Policy Iteration
Initialize policy π
Repeat:
Policy evaluation: compute Qπ
Policy improvement: update π
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 14 / 58
MC for On Policy Q Evaluation
Initialize N(s, a) = 0, G (s, a) = 0, Qπ(s, a) = 0, ∀s ∈ S , ∀a ∈ ALoop
Using policy π sample episode i = si,1, ai,1, ri,1, si,2, ai,2, ri,2, . . . , si,Ti
What is new ε-greedy policy, if k = 3, ε = 1/kWith probability 2/3 choose π(s) else choose randomly. As anexample, π(s1) = a1 with prob (2/3) else randomly choose an action.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 29 / 58
GLIE Monte-Carlo Control
Theorem
GLIE Monte-Carlo control converges to the optimal state-action valuefunction Q(s, a)→ Q∗(s, a)
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 30 / 58
Model-free Policy Iteration
Initialize policy π
Repeat:
Policy evaluation: compute Qπ
Policy improvement: update π given Qπ
What about TD methods?
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 31 / 58
Model-free Policy Iteration with TD Methods
Use temporal difference methods for policy evaluation step
Initialize policy π
Repeat:
Policy evaluation: compute Qπ using temporal difference updatingwith ε-greedy policyPolicy improvement: Same as Monte carlo policy improvement, set πto ε-greedy (Qπ)
First consider SARSA, which is an on-policy algorithm.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 32 / 58
General Form of SARSA Algorithm
1: Set initial ε-greedy policy π randomly, t = 0, initial state st = s02: Take at ∼ π(st)3: Observe (rt , st+1)4: loop5: Take action at+1 ∼ π(st+1) // Sample action from policy6: Observe (rt+1, st+2)7: Update Q given (st , at , rt , st+1, at+1):
8: Perform policy improvement:
9: t = t + 110: end loop
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 33 / 58
General Form of SARSA Algorithm
1: Set initial ε-greedy policy π, t = 0, initial state st = s02: Take at ∼ π(st) // Sample action from policy3: Observe (rt , st+1)4: loop5: Take action at+1 ∼ π(st+1)6: Observe (rt+1, st+2)7: Q(st , at)← Q(st , at) + α(rt + γQ(st+1, at+1)− Q(st , at))8: π(st) = arg maxa Q(st , a) w.prob 1− ε, else random9: t = t + 1
10: end loop
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 34 / 58
Worked Example: SARSA for Mars Rover
1: Set initial ε-greedy policy π, t = 0, initial state st = s02: Take at ∼ π(st) // Sample action from policy3: Observe (rt , st+1)4: loop5: Take action at+1 ∼ π(st+1)6: Observe (rt+1, st+2)7: Q(st , at)← Q(st , at) + α(rt + γQ(st+1, at+1)− Q(st , at))8: π(st) = arg maxa Q(st , a) w.prob 1− ε, else random9: t = t + 1
10: end loop
Initialize ε = 1/k, k = 1, and α = 0.5, Q(−, a1) = [ 1 0 0 0 0 0 +10],Q(−, a2) =[ 1 0 0 0 0 0 +5], γ = 1Assume starting state is s6 and sample a1
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 35 / 58
Worked Example: SARSA for Mars Rover
1: Set initial ε-greedy policy π, t = 0, initial state st = s02: Take at ∼ π(st) // Sample action from policy3: Observe (rt , st+1)4: loop5: Take action at+1 ∼ π(st+1)6: Observe (rt+1, st+2)7: Q(st , at)← Q(st , at) + α(rt + γQ(st+1, at+1)− Q(st , at))8: π(st) = arg maxa Q(st , a) w.prob 1− ε, else random9: t = t + 1
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 39 / 58
Q-Learning with ε-greedy Exploration
1: Initialize Q(s, a),∀s ∈ S , a ∈ A t = 0, initial state st = s02: Set πb to be ε-greedy w.r.t. Q3: loop4: Take at ∼ πb(st) // Sample action from policy5: Observe (rt , st+1)6: Q(st , at)← Q(st , at) + α(rt + γmaxa Q(st+1, a)− Q(st , at))7: π(st) = arg maxa Q(st , a) w.prob 1− ε, else random8: t = t + 19: end loop
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 40 / 58
Worked Example: ε-greedy Q-Learning Mars
1: Initialize Q(s, a),∀s ∈ S , a ∈ A t = 0, initial state st = s02: Set πb to be ε-greedy w.r.t. Q3: loop4: Take at ∼ πb(st) // Sample action from policy5: Observe (rt , st+1)6: Q(st , at)← Q(st , at) + α(rt + γmaxa Q(st+1, a)− Q(st , at))7: π(st) = arg maxa Q(st , a) w.prob 1− ε, else random8: t = t + 19: end loop
Initialize ε = 1/k, k = 1, and α = 0.5, Q(−, a1) = [ 1 0 0 0 0 0 +10],Q(−, a2) =[ 1 0 0 0 0 0 +5], γ = 1Like in SARSA example, start in s6 and take a1.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 41 / 58
Worked Example: ε-greedy Q-Learning Mars
1: Initialize Q(s, a),∀s ∈ S , a ∈ A t = 0, initial state st = s02: Set πb to be ε-greedy w.r.t. Q3: loop4: Take at ∼ πb(st) // Sample action from policy5: Observe (rt , st+1)6: Q(st , at)← Q(st , at) + α(rt + γmaxa Q(st+1, a)− Q(st , at))7: π(st) = arg maxa Q(st , a) w.prob 1− ε, else random8: t = t + 19: end loop
′)− 0) = .5*10 = 5Recall that in the SARSA update we saw Q(s6, a1) = 2.5 because we usedthe actual action taken at s7 instead of the maxDoes how Q is initialized matter (initially? asymptotically?)?Asymptotically no, under mild condiditions, but at the beginning, yes
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 42 / 58
1 Both SARSA and Q-learning may update their policy after every step
2 If ε = 0 for all time steps, and Q is initialized randomly, a SARSA Qstate update will be the same as a Q-learning Q state update
3 Not sure
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 43 / 58
Q-Learning with ε-greedy Exploration
What conditions are sufficient to ensure that Q-learning with ε-greedyexploration converges to optimal Q∗?Visit all (s, a) pairs infinitely often, and the step-sizes αt satisfy theRobbins-Munro sequence. Note: the algorithm does not have to begreedy in the limit of infinite exploration (GLIE) to satisfy this (couldkeep ε large).
What conditions are sufficient to ensure that Q-learning with ε-greedyexploration converges to optimal π∗?The algorithm is GLIE, along with the above requirement to ensurethe Q value estimates converge to the optimal Q.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 44 / 58
Q-Learning with ε-greedy Exploration
What conditions are sufficient to ensure that Q-learning with ε-greedyexploration converges to optimal Q∗?
Visit all (s, a) pairs infinitely often, and the step-sizes αt satisfy theRobbins-Munro sequence. Note: the algorithm does not have to begreedy in the limit of infinite exploration (GLIE) to satisfy this (couldkeep ε large).
What conditions are sufficient to ensure that Q-learning with ε-greedyexploration converges to optimal π∗?
The algorithm is GLIE, along with the above requirement to ensurethe Q value estimates converge to the optimal Q.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 45 / 58
Table of Contents
1 Generalized Policy Iteration
2 Importance of Exploration
3 Maximization Bias
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 46 / 58
Maximization Bias1
Consider single-state MDP (|S | = 1) with 2 actions, and both actions have 0-meanrandom rewards, (E(r |a = a1) = E(r |a = a2) = 0).
Then Q(s, a1) = Q(s, a2) = 0 = V (s)
Assume there are prior samples of taking action a1 and a2
Let Q̂(s, a1), Q̂(s, a2) be the finite sample estimate of Q
Use an unbiased estimator for Q: e.g. Q̂(s, a1) = 1n(s,a1)
∑n(s,a1)i=1 ri (s, a1)
Let π̂ = arg maxa Q̂(s, a) be the greedy policy w.r.t. the estimated Q̂
1Example from Mannor, Simester, Sun and Tsitsiklis. Bias and VarianceApproximation in Value Function Estimates. Management Science 2007
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 47 / 58
Maximization Bias2 Proof
Consider single-state MDP (|S | = 1) with 2 actions, and both actions have0-mean random rewards, (E(r |a = a1) = E(r |a = a2) = 0).
Then Q(s, a1) = Q(s, a2) = 0 = V (s)
Assume there are prior samples of taking action a1 and a2
Let Q̂(s, a1), Q̂(s, a2) be the finite sample estimate of Q
Use an unbiased estimator for Q: e.g. Q̂(s, a1) = 1n(s,a1)
∑n(s,a1)i=1 ri (s, a1)
Let π̂ = arg maxa Q̂(s, a) be the greedy policy w.r.t. the estimated Q̂
Even though each estimate of the state-action values is unbiased, the
estimate of π̂’s value V̂ π̂ can be biased:
V̂ π̂(s) = E[max Q̂(s, a1), Q̂(s, a2)]≥ max[E[Q̂(s, a1)], [Q̂(s, a2)]]= max [0, 0] = V π,where the inequality comes from Jensen’s inequality.
2Example from Mannor, Simester, Sun and Tsitsiklis. Bias and VarianceApproximation in Value Function Estimates. Management Science 2007
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 48 / 58
Double Q-Learning
The greedy policy w.r.t. estimated Q values can yield a maximizationbias during finite-sample learning
Avoid using max of estimates as estimate of max of true values
Instead split samples and use to create two independent unbiasedestimates of Q1(s1, ai ) and Q2(s1, ai ) ∀a.
Use one estimate to select max action: a∗ = arg maxa Q1(s1, a)Use other estimate to estimate value of a∗: Q2(s, a∗)Yields unbiased estimate: E(Q2(s, a∗)) = Q(s, a∗)
Why does this yield an unbiased estimate of the max state-actionvalue?Using independent samples to estimate the value
If acting online, can alternate samples used to update Q1 and Q2,using the other to select the action chosen
Next slides extend to full MDP case (with more than 1 state)
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 49 / 58
Double Q-Learning
1: Initialize Q1(s, a) and Q2(s, a),∀s ∈ S , a ∈ A t = 0, initial state st = s02: loop3: Select at using ε-greedy π(s) = arg maxa Q1(st , a) + Q2(st , a)4: Observe (rt , st+1)5: if (with 0.5 probability) then6: Q1(st , at)← Q1(st , at) + α(rt + γmaxa Q2(st+1, a)− Q1(st , at))7: else8: Q2(st , at)← Q2(st , at) + α(rt + γmaxa Q1(st+1, a)− Q2(st , at))9: end if
10: t = t + 111: end loop
Compared to Q-learning, how does this change the: memory requirements,
computation requirements per step, amount of data required?
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 50 / 58
Double Q-Learning
1: Initialize Q1(s, a) and Q2(s, a),∀s ∈ S , a ∈ A t = 0, initial state st = s02: loop3: Select at using ε-greedy π(s) = arg maxa Q1(st , a) + Q2(st , a)4: Observe (rt , st+1)5: if (with 0.5 probability) then6: Q1(st , at)← Q1(st , at) + α(rt + γmaxa Q2(st+1, a)− Q1(st , at))7: else8: Q2(st , at)← Q2(st , at) + α(rt + γmaxa Q1(st+1, a)− Q2(st , at))9: end if
10: t = t + 111: end loop
Compared to Q-learning, how does this change the: memory requirements,computation requirements per step, amount of data required?
Doubles the memory, same computation requirements, data requirements are
subtle– might reduce amount of exploration needed due to lower biasEmma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 51 / 58
Double Q-Learning (Figure 6.7 in Sutton and Barto 2018)
Due to the maximization bias, Q-learning spends much more timeselecting suboptimal actions than double Q-learning.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 52 / 58
What You Should Know
Be able to implement MC on policy control and SARSA andQ-learning
Compare them according to properties of how quickly they update,(informally) bias and variance, computational cost
Define conditions for these algorithms to converge to the optimal Qand optimal π and give at least one way to guarantee such conditionsare met.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 53 / 58
Class Structure
Last time: Policy evaluation with no knowledge of how the worldworks (MDP model not given)
This time: Control (making decisions) without a model of how theworld works
Next time: Generalization – Value function approximation
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 54 / 58