Zongqing Lu Peking University 2020/7/27 Introduc)on to Reinforcement Learning and Value-based Methods Reinforcement Learning China Summer School
ZongqingLuPekingUniversity
2020/7/27
Introduc)ontoReinforcementLearningandValue-basedMethods
Reinforcement Learning China Summer School
RLChinaSummerSchool Intro to RL and Value-based Methods
Content
❖ Introduc:ontoReinforcementLearning
• AboutRL
• RLproblem
• MarkovDecisionProcesses
❖ Value-basedMethods
• DynamicProgramming
• MonteCarlo
• TDLearning
• Off-policyLearning
• DQNanditsvariants
RLChinaSummerSchool Intro to RL and Value-based Methods
ManyFacesofReinforcementLearning
3
ZongqingLu
ClassicalCondi)oning
ComputerScience
Neuroscience
BoundedRa)onality
Opera)onsResearch
RewardSystem
MachineLearning
Op)malControl
Engineering
Psychology
Economics
Mathema)cs
ReinforcementLearning
About RL
RLChinaSummerSchool Intro to RL and Value-based Methods
BranchesofMachineLearning
4
MachineLearning
SupervisedLearning
UnsupervisedLearning
ReinforcementLearning
About RL
RLChinaSummerSchool Intro to RL and Value-based Methods
Characteris)csofReinforcementLearning
5
❖ Whatmakesreinforcementlearningdifferentfromothermachinelearningparadigms?
• Thereisnosupervisor,onlyarewardsignal,trial-and-error
• Feedbackisdelay,notinstantaneous
• TimereallymaTers(sequen:al,non-iiddata)
• Agent’sac:onaffectsthesubsequentdataitreceives
About RL
RLChinaSummerSchool Intro to RL and Value-based Methods
TheRLProblem
6
Theagentistomaximizethecumula)vereward
ac:onat
statest
rewardrt+1
The RL Problem
Allgoalsandpurposescanbedescribedbythemaximiza)onoftheexpectedcumula)vereward
RLChinaSummerSchool Intro to RL and Value-based Methods
InsideAnRLAgent
7
❖ AnRLagentmayincludeoneormoreofthesecomponents
• Policy:agent’sbehaviorfunc:on
• Valuefunc:on:howgoodisastateand/orac:on
• Model:agent’srepresenta:onoftheenvironment
The RL Problem
RLChinaSummerSchool Intro to RL and Value-based Methods
CategorizingRLAgents
8
❖ Valuebased
• Nopolicy
• Valuefunc:on
❖ Policybased
• Policy
• Novaluefunc:on
❖ ActorCri:c
• Policy
• Valuefunc:on
❖ Modelfree• Policyand/orvaluefunc:on
• Nomodel
❖ Modelbased• Policyand/orvaluefunc:on
• Model
The RL Problem
RLChinaSummerSchool Intro to RL and Value-based Methods
MarkovDecisionProcesses
9
❖ Markovdecisionprocessesformallydescribeanenvironmentforreinforcementlearning
• Theenvironmentisfullyobservable
❖ AlmostallRLproblemscanbeformalizedasMDPs
• Op:malcontrolprimarilydealswithcon:nuousMDPs
• Par:allyobservableproblemscanbeconvertedintoMDPs
• BanditsareMDPswithonestate
❖ AllstatesinMDPhas“Markov”property
•
• Currentstateencapsulatesallthesta:s:csweneedtodecidethefuture
ℙ[St+1 |St] = ℙ[St+1 |S1, …, St]
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
MDP
10
❖ -asetofstates
❖ -asetofac:ons
❖ -transi:onprobabilityfunc:on,
•
❖ -rewardfunc:on,
•
❖ -discoun:ngfactorforfuturereward,
𝒮
𝒜
𝒫𝒫a
ss′ = ℙ [St+1 = s′ |St = s, At = a]
ℛℛa
s = 𝔼 [Rt+1 |St = s, At = a]γ γ ∈ [0,1]
MDPs
⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩
RLChinaSummerSchool Intro to RL and Value-based Methods
Policy
11
❖ Apolicy isadistribu:onoverac:onsgivenstates
•
❖ GivenaMDP andapolicy
• Thestatesequence isaMarkovprocess
• Thestateandrewardsequence isaMarkovrewardprocess ,where
ππ(a |s) = ℙ[At = a |St = s]
ℳ = ⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩ πS1, S2, … ⟨𝒮, 𝒫π⟩
S1, R2, S2, …⟨𝒮, 𝒫π, ℛπ, γ⟩
𝒫πs,s′ = ∑
a∈𝒜
π(a |s)𝒫ass′
ℛπs = ∑
a∈𝒜
π(a |s)ℛas
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
ValueFunc)on
12
❖ Thestate-valuefunc:on ofanMDPistheexpectedreturnstar:ngfromstate ,andthenfollowingpolicy
❖ Theac:on-valuefunc:on ofanMDPistheexpectedreturnstar:ngfromstate ,takingac:on andthenfollowingpolicy
vπ(s)s π
qπ(s, a)s a π
vπ(s) = 𝔼π [Gt |St = s]= 𝔼π [Rt+1 + γRt+2 + γ2Rt+3 + … |St = s]= 𝔼π [Rt+1 + γ (Rt+2 + γRt+3 + …) |St = s]= 𝔼π [Rt+1 + γGt+1 |St = s]= 𝔼π [Rt+1 + γvπ (St+1) |St = s]
Qπ(s, a) = 𝔼π [Rt+1 + γv (St+1) |St = s, At = a]= 𝔼π [Rt+1 + γ𝔼a′ ∼πQ (St+1, a′ ) |St = s, At = a]
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
BellmanExpecta)onEqua)ons
13
vπ(s) = ∑a∈𝒜
π(a |s)qπ(s, a)
vπ(s) ← s
qπ(s, a) ← a
r
vπ(s′ ) ← s′ qπ(s, a) = ℛa
s + γ ∑s′ ∈𝒮
𝒫ass′
vπ (s′ )
vπ(s) = ∑a∈𝒜
π(a |s)(ℛas + γ ∑
s′ ∈𝒮
𝒫ass′
vπ (s′ ))qπ(s, a) = ℛa
s + γ ∑s′ ∈𝒮
𝒫ass′ ∑
a′ ∈𝒜
π(a′ |s′ )qπ(s′ , a′ )
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
Op)malValueFunc)on
14
❖ Theop:malstate-valuefunc:on isthemaximumvaluefunc:onoverallpolicies
❖ Theop:malac:on-valuefunc:on isthemaximumvaluefunc:onoverallpolicies
v*(s)
q*(s, a)
v*(s) = maxπ
vπ(s)
q*(s, a) = maxπ
qπ(s, a)
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
Op)malPolicy
15
❖ Define
• Thereexistsanop:malpolicy thatisbeTertanorequaltoallotherpolicies,
• Allop:malpoliciesachievetheop:malvaluefunc:on,
• Allop:malpoliciesachievetheop:malac:on-valuefunc:on,
• Ifweknow ,weimmediatelyhavetheop:malpolicy
π ≥ π′ ifvπ(s) ≥ vπ′ (s), ∀s
π*π* ≥ π′ , ∀π
vπ*(s) = v*(s)
qπ*(s, a) = q*(s, a)
q*(s, a)
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
BellmanOp)malityEqua)ons
16
v*(s) ← s
q*(s, a) ← a
r
v*(s′ ) ← s′
v*(s) = maxa
q*(s, a)
q*(s, a) = ℛas + γ ∑
s′ ∈𝒮
𝒫ass′
v* (s′ )
v*(s) = maxa (ℛa
s + γ ∑s′ ∈𝒮
𝒫ass′
v* (s′ ))q*(s, a) = ℛa
s + γ ∑s′ ∈𝒮
𝒫ass′
maxa′
q*(s′ , a′ )
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
SolvingtheBellmanOp)malityEqua)on
17
❖ Ifwehavecompleteinforma:onoftheenvironment,thisturnsintoaplanningproblem,solvablebyDynamicProgramming
❖ Unfortunately,inmostscenarios,wedonotknow or ,wecannotsolveMDPsbydirectlyapplyingBellmanequa:on
𝒫ass′
ℛas
MDPs
RLChinaSummerSchool Intro to RL and Value-based Methods
Outline
❖ Introduc:ontoReinforcementLearning
• AboutRL
• RLproblem
• MarkovDecisionProcesses
❖ Value-basedMethods
• DynamicProgramming
• MonteCarlo
• TDLearning
• Off-policyLearning
• DQNanditsvariants
RLChinaSummerSchool Intro to RL and Value-based Methods
DynamicProgramming
19
❖ DPassumesfullknowledgeoftheMDP
❖ PlanninginanMDP
❖ Forpredic:on:
• Input:MDP andpolicy
• Output:valuefunc:on
❖ Forcontrol:
• Input:MDP
• Output:op:malvaluefunc:on and
⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩ πvπ
⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩v* π*
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyEvalua)on
20
❖ Problem:computethevaluefunc:onforagivenpolicy
❖ Solu:on:itera:onofBellmanExpecta:onbackup
❖
• Ateachitera:on
• Synchronousupdate from ,where isasuccessorstateof
π
v1 → v2 → … → vπk + 1
vk+1(s) vk(s′ ), ∀s ∈ 𝒮 s′
s
vk+1(s) ← s
r
vk(s′ ) ← s′
vk+1(s) = ∑a∈𝒜
π(a |s)(ℛas + γ ∑
s′ ∈𝒮
𝒫ass′ vk (s′ ))
vk+1 = ℛπ + γ𝒫πvk
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
Evalua)ngaRandomPolicy
21
❖ UndiscountedepisodicMDP
❖ Ac:onsleadingoutofthegridmakestateunchanged
❖ Twoterminalstates
❖ Auniformrandompolicy
π(𝗎𝗉 | ⋅ ) = π(𝖽𝗈𝗐𝗇 | ⋅ ) = π(𝗋𝗂𝗀𝗁𝗍 | ⋅ ) = π(𝗅𝖾𝖿𝗍 | ⋅ ) = 0.25
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyEvalua)oninGridworld
22
fortherandompolicyvk
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyImprovement
23
❖ Consideradeterminis:cpolicy,
❖ Improvethepolicybyac:nggreedily
❖ Thisimprovesthevaluefromanystate overonestep
❖ Itthereforeimprovesthevaluefunc:on,
a = π(s)
s
vπ′ (s) ≥ vπ(s)
π′ (s) = argmaxa∈𝒜
qπ(s, a)
qπ (s, π′ (s)) = maxa∈𝒜
qπ(s, a) ≥ qπ(s, π(s)) = vπ(s)
vπ(s) ≤ qπ (s, π′ (s)) = 𝔼π′ [Rt+1 + γvπ (St+1) |St = s]≤ 𝔼π′ [Rt+1 + γqπ (St+1, π′ (St+1)) |St = s]≤ 𝔼π′ [Rt+1 + γRt+2 + γ2qπ (St+2, π′ (St+2)) |St = s]≤ 𝔼π′ [Rt+1 + γRt+2 + … |St = s] = vπ′ (s)
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyImprovement(cont’d)
24
❖ Ifimprovementstops,
❖ ThentheBellmanop:malityequa:onhasbeensa:sfied
❖ Therefore
❖ isanop:malpolicy
vπ(s) = v*(s), ∀s ∈ 𝒮
π
qπ (s, π′ (s)) = maxa∈𝒜
qπ(s, a) = qπ(s, π(s)) = vπ(s)
vπ(s) = maxa∈𝒜
qπ(s, a)
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyImprovementinGridworld
25
fortherandompolicy
Vk fortherandompolicy
Vkgreedypolicyw.r.t.Vk
greedypolicyw.r.t.Vk
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyItera)on
26
❖
❖ Policyevalua:on:es:mate
• Itera:vepolicyevalua:on
❖ Policyimprovement:generate
• Greedypolicyimprovement
π0E⟶ vπ0
I⟶ π1E⟶ vπ1
I⟶ π2E⟶ ⋯ I⟶ π*
E⟶ v*
vπ
π′ ≥ π
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyItera)on(cont’d)
27
❖ Doespolicyevalua:onneedtoconvergeto ?
❖ Stopacer itera:onsofpolicyevalua:on
• issufficientforthegridworldexample
❖ Whynotupdatepolicyeveryitera:on?i.e.,
• Thisisvalueitera*on
vπ
kk = 3
k = 1
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
ValueItera)on
28
❖ Anyop:malpolicycanbesubdividedintotwocomponents:
• Anop:malfirstac:on
• Followedbyanop:malpolicyfromsuccessorstate
❖ Determinis:cValueItera:on
• Ifweknowthesolu:ontosubproblems
• Thensolu:on canbefoundbyone-steplookahead
• Valueitera:onisapplyingtheseupdatesitera:vely
• Intui:on:startwithfinalrewardsandworkbackwards
A*
s′
v*(s′ )v*(s)
v*(s) ← maxa∈𝒜 (ℛa
s + γ ∑s′ ∈𝒮
𝒫ass′ v* (s′ ))
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
ValueItera)on
29
❖ Problem:findop:malpolicy
❖ Solu:on:itera:onofBellmanop:malitybackup
❖
❖ Eachitera:on update from
❖ Convergenceto
❖ Unlikepolicyitera:on,thereisnoexplicitpolicy
π
v1 → v2 → ⋯ → v*
k + 1,∀s ∈ 𝒮, vk+1(s) vk(s′ )
v*
vk+1(s) ← s
r
vk(s′ ) ← s′
avk+1(s) = maxa∈𝒜 (ℛa
s + γ ∑s′ ∈𝒮
𝒫ass′ vk (s′ ))
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
ShortPathExample
30
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
GeneralizedPolicyItera)on
31
❖
❖ Policyevalua:on:es:mate
• Anypolicyevalua:onalgorithm
❖ Policyimprovement:genera
• Anypolicyimprovementalgorithm
π0E⟶ vπ0
I⟶ π1E⟶ vπ1
I⟶ π2E⟶ ⋯ I⟶ π*
E⟶ v*
vπ
π′ ≥ π
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
Ques)ons
32
❖ Doesitera:vepolicyevalua:onconvergesto ?
❖ Doespolicyitera:onconvergesto ?
❖ Doesvalueitera:onconvergesto ?
❖ Isthesolu:onunique?
❖ Howfastdothesealgorithmsconverge?
vπ
v*
v*
DP
Yes, can be proved based on contraction mapping theory !
RLChinaSummerSchool Intro to RL and Value-based Methods
Contrac)onMapping
33
❖ Valuefunc:onspace
• Consideravectorspace with dimensionsovervaluefunc:ons
• Eachpointinthisspacefullyspecifiesastate-valuefunc:on
❖ Valuefunc:on -norm
• Thedistancebetweenstate-valuefunc:ons and ismeasuredby -norm
❖ DefinetheBellmanexpecta:onbackupoperator
❖ Theoperatorisa -contrac:on,i.e.,itmakesvaluefunc:onscloserbyatleast
𝒱 |𝒮 |v
∞u v ∞
𝒯π
γ γ
∥u − v∥∞ = maxs∈𝒮
|u(s) − v(s) |
𝒯π(v) = ℛπ + γ𝒫πv
𝒯π(u) − 𝒯π(v)∞
= (ℛπ + γ𝒫πu) − (ℛπ + γ𝒫πv)∞
= γ𝒫π(u − v)∞
≤ γ𝒫π u − v ∞ ∞
≤ γ∥u − v∥∞
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
Contrac)onMappingTheorem
34
❖ Foranymetricspace thatiscompleteunderanoperator,where isa -contrac:on
• convergestoauniquefixedpoint
• Atalinearconvergencerateof
𝒱𝒯(v) 𝒯 γ
𝒯γ
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
ConvergenceofPolicyEvalua)onandPolicyItera)on
35
❖ TheBellmanexpecta:onoperator hasauniquefixedpoint
❖ isafixedpointof (byBellmanexpecta:onequa:on)
❖ Itera:vepolicyevalua:onconvergeson
❖ Policyitera:onconvergeson
𝒯π
vπ 𝒯π
vπ
v*
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
ConvergenceofValueItera)on
36
❖ DefinetheBellmanop:malitybackupoperator
❖ Theoperatorisa -contrac:on(similartopreviousproof)
❖ TheBellmanop:malityoperator hasauniquefixedpoint
❖ isafixedpointof (byBellmanop:malityequa:on)
❖ Valueitera:onconvergeson
𝒯*
γ
𝒯*
v* 𝒯*
v*
𝒯*(v) = maxa∈𝒜
ℛa + γ𝒫av
𝒯*(u) − 𝒯*(v)∞
≤ γ∥u − v∥∞
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
DPSummary
37
❖ Algorithmsarebasedonstate-valuefunc:on or
❖ Complexity peritera:on,for ac:onsand states
❖ DPusesfull-widthbackups,andforeachbackup
• Everysuccessorstateandac:onisconsidered
• UsingknowledgeoftheMDPtransi:onsandrewardfunc:on
• Forlargeproblem,DPsuffersfromthecurseofdimensionality
vπ(s) v*(π)
O(mn2) m n
DP
RLChinaSummerSchool Intro to RL and Value-based Methods
Outline
❖ Introduc:ontoReinforcementLearning
• AboutRL
• RLproblem
• InsideanRLagent
• MarkovDecisionProcesses
❖ Value-basedMethods
• DynamicProgramming
• MonteCarlo
• TDLearning
• Off-policyLearning
• DQNanditsvariants
RLChinaSummerSchool Intro to RL and Value-based Methods
Monte-CarloMethods
39
❖ MCmethodslearndirectlyfromepisodesofexperience
❖ MCismodel-free:noknowledgeofMDPtransi:ons/rewards
❖ MClearnsfromcompleteepisodes:nobootstrapping
❖ MCusesthesimplestpossibleidea:value=meanreturn
❖ CanonlyapplyMCtoepisodicMDPs
• Allepisodesmustterminate
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
Monte-CarloPolicyEvalua)on
40
❖ Goal:learn fromepisodesofexperienceunderpolicy
❖ Recallthatthereturnisthetotaldiscountedrewards:
❖ Recallthatthevaluefunc:onistheexpectedreturnfrom
❖ Monte-Carlopolicyevalua:onusesempiricalmeanreturninsteadofexpectedreturn
vπ π
s
S1, A1, R2, …, Sk ∼ π
Gt = Rt+1 + γRt+2 + … + γT−1RT
vπ(s) = 𝔼π [Gt |St = s]
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
Monte-CarloPolicyEvalua)on(cont’d)
41
❖ Toevaluatestate
❖ Thefirst/every:mestep thatstate isvisitedinanepisode
❖ Incrementcounter
❖ Incrementtotalreturn
❖ Valueises:matedbymeanreturn
❖ Bylawoflargenumbers, as
s
t s
N(s) ← N(s) + 1
S(s) ← S(s) + Gt
V(s) ← S(s)/N(s)
V(s) → vπ(s) N(s) → ∞
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
IncrementalMonte-CarloUpdates
42
❖ Incrementalmean
❖ Update incrementallyaceranepisode
❖ Foreachstate withreturn
❖ Innon-sta:onaryproblems,itcanbeusedtotrackarunningmean,i.e.,forgetoldepisodes
V(s)
st Gt
μk =1k
k
∑j=1
xj =1k (xk + (k − 1)μk−1) = μk−1 +
1k (xk − μk−1)
N(st) ← N(st) + 1
V (st) ← V (st) +1
N (st) (Gt − V (st))
V (st) ← V (st) + α (Gt − V (st))
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
Monte-CarloPolicyItera)on
43
❖ GeneralizedPolicyItera:onviaMonte-CarloEvalua:on
• Policyevalua:on:Monte-Carlopolicyevalua:on, ?
• Policyimprovement:Greedypolicyimprovement?
V = vπ
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
PolicyItera)onUsingAc)on-ValueFunc)on
44
❖ Greedypolicyimprovementover requiresmodelofMDPs
❖ Greedypolicyimprovementover ismodel-free
V(s)
Q(s, a)
π′ (s) = arg maxa∈𝒜
(ℛas + 𝒫a
ss′ V(s′ ))
π′ (s) = arg maxa∈𝒜
Q(s, a)
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
Monte-CarloPolicyItera)on
45
• Policyevalua:on:Monte-Carlopolicyevalua:on,
• Policyimprovement:Greedypolicyimprovement?
Q = qπ
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
-greedyExplora)onϵ
46
❖ Simplestideaforensuringcon:nualexplora:on
❖ All ac:onsaretriedwithnon-zeroprobability
❖ Withprobability choosethegreedyac:on
❖ Withprobability chooseanrandomac:on
|𝒜 |
1 − ϵ
ϵ
π(a |s) = {ϵ/m + 1 − ϵ if a* = arg max
a∈𝒜Q(s, a)
ϵ/m otherwise
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
-greedyPolicyImprovementϵ
47
❖ Forany -greedypolicy ,the -greedypolicy withrespectto isanimprovement,
❖ Frompolicyimprovementtheorem,
ϵ π ϵ π′ qπvπ′ ≥ vπ
vπ′ ≥ vπ
qπ (s, π′ (s)) = ∑a∈𝒜
π′ (a |s)qπ(s, a)
= ϵ/m ∑a∈𝒜
qπ(s, a) + (1 − ϵ) maxa∈𝒜
qπ(s, a)
≥ ϵ/m ∑a∈𝒜
qπ(s, a) + (1 − ϵ) ∑a∈𝒜
π(a |s) − ϵ/m1 − ϵ
qπ(s, a)
= ∑a∈𝒜
π(a |s)qπ(s, a)
= vπ(s)
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
Monte-CarloControl
48
❖ Everyepisode:
• Policyevalua:on:Monte-Carlopolicyevalua:on,
• Policyimprovement: -greedypolicyimprovement
Q = qπ
ϵ
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
GreedyinthelimitwithInfiniteExplora)on
49
❖ GLIE
• Allstate-ac:onpairsareexploredinfinitelyocen
• Thepolicyconvergesonagreedypolicy
❖ -greedyisGLIEif reducestozeroatϵ ϵ ϵk =1k
limk→∞
Nk(s, a) = ∞
limk→∞
πk(a |s) = 1(a = arg maxa′ ∈𝒜
Qk(s, a′ ))
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
GLIEMonte-CarloControl
50
❖ Sample thepisodeusing :
❖ Foreachstate-ac:onpairintheepisode
❖ Improvethepolicybasedonnewac:on-valuefunc:on
❖ GLIEMonte-Carlocontrolconvergestotheop:malac:on-valuefunc:on,
k π {S1, A1, R2, …, ST} ∼ π
Q(s, a) → q*(s, a)
N (St, At) ← N (St, At) + 1
Q (St, At) ← Q (St, At) +1
N (St, At) (Gt − Q (St, At))
ϵ ← 1/kπ ← ϵ − greedy(Q)
MC
RLChinaSummerSchool Intro to RL and Value-based Methods
Outline
❖ Introduc:ontoReinforcementLearning
• AboutRL
• RLproblem
• MarkovDecisionProcesses
❖ Value-basedMethods
• DynamicProgramming
• MonteCarlo
• TDLearning
• Off-policyLearning
• DQNanditsvariants
RLChinaSummerSchool Intro to RL and Value-based Methods
Temporal-DifferenceLearning
52
❖ TDmethodslearndirectlyfromepisodesofexperience
❖ TDismodel-free:noknowledgeofMDPtransi:ons/rewards
❖ TDlearnsfromincompleteepisodes:bybootstrapping
❖ TDupdatesaguesstowardsaguess
TD
RLChinaSummerSchool Intro to RL and Value-based Methods
MCandTDPolicyEvalua)on
53
❖ Goal:learn onlinefromexperienceunderpolicy
❖ Incrementalevery-visitMC
• isupdatedtowardactualreturn
❖ SimplestTDlearning
• isupdatedtowardes:matedreturn
• iscalledtheTDtarget
• iscalledtheTDerror
vπ π
V(St) Gt
V(St) Rt+1 + γV(St+1)
Rt+1 + γV(St+1)δt = Rt+1 + γV(St+1) − V(St)
V(St) ← V(St) + α(Gt − V(St))
V(St) ← V(St) + α(Rt+1 + γV(St+1) − V(St))
TD
RLChinaSummerSchool Intro to RL and Value-based Methods
AdvantagesandDisadvantagesofMCandTD
54
❖ TDcanlearnbeforeknowingthefinaloutcome
• TDcanlearnonlineacereverystep
• MCmustwaitun:lendofepisodeacerreturnisknown
❖ TDcanlearnwithoutthefinaloutcome
• TDworksincon:nuing(non-termina:ng)environments
• MConlyworksforepisodic(termina:ng)environments
TD
RLChinaSummerSchool Intro to RL and Value-based Methods
Bias/VarianceTradeoff
55
❖ Return isunbiasedes:mateof
❖ TrueTDtarget isunbiasedes:mateof
❖ TDtarget isbiasedes:mateof
❖ TDtargetismuchlowervariancethanthereturn
• Returndependsonmanyac:ons,transi:ons,rewards
• TDtargetdependsononlyoneac:on,transi:on,reward
Gt = Rt+1 + γRt+2 + …, γT−1RTvπ(St)
Rt+1 + γvπ(St+1) vπ(St)
Rt+1 + γV(St+1) vπ(St)
TD
RLChinaSummerSchool Intro to RL and Value-based Methods
TDControl:SARSA
56
❖ Every:mestep:
• Policyevalua:on:
• Policyimprovement: -greedypolicyimprovement
❖
Q ≈ qπ
ϵ
Q(S, A) ← Q(S, A) + α (R + γQ(S′ , A′ ) − Q(S, A))
S, A
R
S′
A
TD
RLChinaSummerSchool Intro to RL and Value-based Methods
SARSAAlgorithm
57
TD
RLChinaSummerSchool Intro to RL and Value-based Methods
Outline
❖ Introduc:ontoReinforcementLearning
• AboutRL
• RLproblem
• InsideanRLagent
• MarkovDecisionProcesses
❖ Value-basedMethods
• DynamicProgramming
• MonteCarlo
• TDLearning
• Off-policyLearning
• DQNanditsvariants
RLChinaSummerSchool Intro to RL and Value-based Methods
Off-PolicyLearning
59
❖ Off-policylearning
• Evaluatetargetpolicy tocompute or ,whilefollowingbehaviorpolicy
❖ Whyisthisimportant?
• Learnfromobservinghumansorotheragents
• Reuseexperiencegeneratedfromoldpolicies
• Learnaboutop*malpolicywhilefollowingexploratorypolicy
• Learnaboutmul*plepolicieswhilefollowingonepolicy
π(a |s) vπ(s) qπ(s, a)μ(a |s)
π1, π2, …, πt−1
{S1, A1, R2, …, ST} ∼ μ
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
ImportanceSampling
60
❖ Es:matetheexpecta:onofadifferencedistribu:on
𝔼X∼P[ f(X)] = ∑ P(X)f(X)
= ∑ Q(X)P(X)Q(X)
f(X)
= 𝔼X∼Q [ P(X)Q(X)
f(X)]
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
ImportanceSamplingforOff-PolicyMC
61
❖ Usereturnsgeneratedfrom toevaluate
❖ Weightreturn accordingtosimilaritybetweenpolicies
❖ Mul:plyimportancesamplingcorrec:onsalongwholeepisode
❖ Updatevaluestowardscorrectedreturn
❖ Importancesamplingcandrama:callyincreasevariance
μ π
Gt
Gπ/μt =
π(At |St)μ(At |St)
π(At+1St+1)μ(At+1 |St+1)
⋯π(AT |ST)μ(AT |ST)
Gt
V(St) ← V(St) + α (Gπ/μt − V(St))
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
ImportanceSamplingforOff-PolicyTD
62
❖ UseTDtargetsgeneratedfrom toevaluate
❖ WeightedTDtarget byimportancesampling
❖ Onlyneedasingleimportancecorrec:on
❖ MuchlowervariancethanMonte-Carloimportancesampling
μ π
R + γV(S′ )
V(St) ← V(St) + α ( π(At |St)μ(At |St)
(Rt+1 + γV(St+1))−V(St))
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
QLearning
63
❖ Nowconsideringoff-policylearningofac:on-values
❖ Noimportancesamplingisrequired
❖ Ac:onischosenusingbehaviorpolicy
❖ Butweconsideralterna:vesuccessorac:on
❖ Update towardsvalueofalterna:veac:on
Q(s, a)
At ∼ μ( ⋅ |St)
A′ ∼ π( ⋅ |St+1)
Q(St, At)
Q(St, At) ← Q(St, At) + α(Rt+1 + γQ(St+1, A′ ) − Q(St, At))
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
Off-PolicyControlwithQ-Learning
64
❖ Thetargetpolicy isgreedyw.r.t.
❖ Thebehaviorpolicy ise.g. -greedyw.r.t
❖ TheQ-learningtargetthensimplifies:
π Q(s, a)
μ ϵ Q(s, a)
π(St+1) = arg maxa′
Q(St+1, a′ )
Rt+1 + γQ (St+1, A′ )
= Rt+1 + γQ (St+1, argmaxa′
Q (St+1, a′ ))= Rt+1 + max
a′ γQ (St+1, a′ )
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
Off-PolicyControlwithQ-Learning(cont’d)
65
❖ Q-learningconvergestotheop:malac:on-valuefunc:on
S, A
R
S′
A′
Q(s, a) → q*(s, a)
Off-Policy Learning
max
RLChinaSummerSchool Intro to RL and Value-based Methods
Off-PolicyControlwithQ-Learning(cont’d)
66
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
CliffWalkingExample
67
Off-Policy Learning
RLChinaSummerSchool Intro to RL and Value-based Methods
SummaryonDP,MC,andTD
68
Summary
MC backup TD backup
DP backup
RLChinaSummerSchool Intro to RL and Value-based Methods
UnifiedViewofReinforcementLearning
69
Summary
RLChinaSummerSchool Intro to RL and Value-based Methods
Outline
❖ Introduc:ontoReinforcementLearning
• AboutRL
• RLproblem
• InsideanRLagent
• MarkovDecisionProcesses
❖ Value-basedMethods
• DynamicProgramming
• MonteCarlo
• TDLearning
• Off-policyLearning
• DQNanditsvariants
RLChinaSummerSchool Intro to RL and Value-based Methods
ReinforcementLearninginPrac)cal
71
❖ Reinforcementlearningcansolvelargeproblems,e.g.,
• Backgammon: states
• ComputerGo: states
• Robots:con:nuousstatespace
❖ Howcanwescaleupthemodel-freemethods?
1020
10170
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
ValueFunc)onApproxima)on
72
❖ Wehaverepresentedvaluefunc:onbyalookuptable
• Everystate hasanentry
• Everystate-ac:onpair hasanentry
❖ ProblemwithlargeMDPs• Therearetoomanystatesandac:onstostoreinmemory
• Itistooslowtolearnthevalueofeachstateindividually
❖ Solu:onforlargeMDPs• Es:matevaluefunc:onwithfunc:onapproxima:on
• Generalizefromseenstatestounseenstates
• Updateparameter usingMCorTDlearning
s V(s)s, a Q(s, a)
w
v(s, w) ≈ vπ(s)q(s, a, w) ≈ qπ(s, a)
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
TypesofValueFunc)onApproxima)on
73
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
ControlwithValueFunc)onApproxima)on
74
❖ Policyevalua:on:approximatepolicyevalua:on,
❖ Policyimprovement: -greedypolicyimprovementϵ
q( ⋅ , ⋅ ,w) ≈ qπ
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
Ac)on-ValueFunc)onApproxima)on
75
❖ Approximatetheac:on-valuefunc:on
❖ Minimizemean-squarederrorbetweenapproximateac:on-valuefunc:on andtruefunc:on
❖ Usingstochas:cgradientdescenttofindalocalminimum
q(S, A, w) qπ(S, A)
q(S, A, w) ≈ qπ(S, A)
J(w) = 𝔼π[(qπ(S, A) − q(S, A, w))2]
Δw = α(qπ(S, A) − q(S, A, w))∇w q(S, A, w)
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
❖ Wedonotknow
❖ Wesubs:tuteatargetfor
• ForMC,thetargetisthereturn
• ForTD,thetargetistheTDtarget
qπ(S, A)
qπ(S, A)Gt
Rt+1 + γ q(St+1, At+1w)
Ac)on-ValueFunc)onApproxima)on(cont’d)
76
Δw = α(Gt− q(St, At, w))∇w q(St, At, w)
Δw = α(Rt+1 + γ q(St+1, At+1, w)− q(St, At, w))∇w q(St, At, w)
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
OnlineQLearning
77
1. Takeac:on accordingto -greedypolicyandobserve
2.
a ϵ(s, a, r, s′ )
Δw = α(r + γ maxa′
q(s′ , a′ , w) − q(s, a, w))∇w q(s, a, w)
-Sequen)alstatesarestronglycorrelated!
-TargetValueisalwayschanging!
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
DeepQNetwork
78
❖ DQNusesexperiencereplayandtargetnetwork
• Take -greedyac:onw.r.t.
• Storetransi:on inreplaymemory
• Samplerandommini-batchoftransi:ons from
• Op:mizeMSEbetweenQ-networkandQ-learningtargets
• UsingvariantofSGD
• Every :mesteps,
ϵ Q(s, a; w)(st, at, rt+1, st+1) 𝒟
(s, a, r, s′ ) 𝒟
N w− = w
ℒ (w) = 𝔼s,a,r,s′ ∼𝒟 [(r + γ maxa′
Q (s′ , a′ ; w−) − Q (s, a; w))2
]
DQN and its variants
Mnihetal.,Human-levelcontrolthroughdeepreinforcementlearning.Nature,2015,518(7540):529-533.
RLChinaSummerSchool Intro to RL and Value-based Methods
DQNinAtari
79
❖ End-to-endlearningof frompixels
❖ Inputstate isstackofrawpixelsfromlast4frames
❖ Outputis for18joys:ck/buTonposi:ons
❖ Rewardischangeinscoreforthatstep
Q(s, a) s
s
Q(s, a)
DQN and its variants
Mnihetal.,Human-levelcontrolthroughdeepreinforcementlearning.Nature,2015,518(7540):529-533.
RLChinaSummerSchool Intro to RL and Value-based Methods
DQNResultsinAtari
80
DQN and its variants
Mnihetal.,Human-levelcontrolthroughdeepreinforcementlearning.Nature,2015,518(7540):529-533.
RLChinaSummerSchool Intro to RL and Value-based Methods
HowmuchdoesDQNhelp?
81
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
HowaccuratearetheQ-values?
82
❖ Overes:ma:on
• Targetvalue:
•
• Forbootstrapping(learninges:matesfromes:mates),suchoveres:ma:oncanbeproblema:c.
𝔼[max(X1, X2)] ≥ max (𝔼[X1], 𝔼[X2])
r + γ maxa′
Q(s′ , a′ ; w−)
This is the problem!
DQN and its variants
Vanetal.,Deepreinforcementlearningwithdoubleq-learning,AAAI2016.
RLChinaSummerSchool Intro to RL and Value-based Methods
DoubleDQN
83
❖
• Ac:onselectedaccordingto
• Valuealsofrom
❖ Usedifferentnetworkstochooseac:onandevaluatevalue
• Currentnetworktoevaluateac:on
• Targetnetworktoevaluatevalue
maxa′
Q(s′ , a′ ; w−) = Q(s′ , arg maxa′
Q(s′ , a′ ; w−); w−)
Q( ⋅ , ⋅ ; w−)Q( ⋅ , ⋅ ; w−)
r + γQ(s′ , arg maxa′
Q(s′ , a′ ; w); w−)
DQN and its variants
Vanetal.,Deepreinforcementlearningwithdoubleq-learning,AAAI2016.
RLChinaSummerSchool Intro to RL and Value-based Methods
DoubleDQN
84
❖ DQNusesexperiencereplayandtargetnetworks
• Takeac:on accordingto -greedypolicy
• Storetransi:on inreplaymemory
• Samplerandommini-batchoftransi:ons from
• Op:mizeMSEbetweenQ-networkandQ-learningtargets
• UsingvariantofSGD
•
at ϵ(st, at, rt+1, st+1) 𝒟
(s, a, r, s′ ) 𝒟
w− = (1 − τ)w + τw−
ℒ (w) = 𝔼s,a,r,s′ ∼𝒟 (r + γQ(s′ , arg maxa′
Q(s′ , a′ ; w); w−) − Q (s, a; w))2
DQN and its variants
Vanetal.,Deepreinforcementlearningwithdoubleq-learning,AAAI2016.
RLChinaSummerSchool Intro to RL and Value-based Methods
DuelingDQN
85
❖ Mo)va)on:itisunnecessarytoknowtheexactvalueofeachac:onatevery:mestep
❖ DuelingDQNlearnsstatevaluewithouthavingtolearntheeffectofeachac:onforeachstate
❖ “uniden:fiable”
❖
❖
Q(s, a; θ, α, β) = V(s; θ, β) + A(s, a; θ, α)
Q(s, a; θ, α, β) = V(s; θ, β) + (A(s, a; θ, α) − maxa′ ∈𝒜
A(s, a′ ; θ, α))
Vπ(s)
Aπ(s, a) = Qπ(s, a) − Vπ(s)
Q(s, a; θ, α, β) = V(s; θ, β) + (A(s, a; θ, α) −1
|𝒜 | ∑a′ ∈|𝒜|
A(s, a′ ; θ, α))
DQN and its variants
Wangetl.,Duelingnetworkarchitecturesfordeepreinforcementlearning,ICML2016
RLChinaSummerSchool Intro to RL and Value-based Methods
-StepDQNn
86
❖
• LessbiasedtargetvalueswhenQvaluesareinaccurate
• Fasterlearningwhenearlyon
• But,onlyactuallycorrectwhenlearningon-policy
❖ Howtodealwithit?
• IgnoretheproblemOcenworksinprac:ce
• DynamicallychooseNtogetonlyon-policydataWorkwellwhendataison-policyandac:onspaceissmall
• Importancesampling
Rt+1 + γRt+2 + ⋯ + γn−1Rt+n + γn maxa
Q(St+n, a; θ)
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
Distribu)onalDQN
87
❖
❖
❖
Qπ(s, a) = 𝔼[Gt]
Q*(s, a) = r + γ maxa′
Q*(s′ , a′ )
Z(s, a) = r + γZ(s′ , a′ )
DQN and its variants
Bellemareetal.,Adistribu:onalperspec:veonreinforcementlearning,ICML2017
RLChinaSummerSchool Intro to RL and Value-based Methods
Priori)zedExperienceReplay
88
❖ Isuniformsamplingfromreplybufferanefficientwaytolearn?Doeseachsamplecontributetothelearningequally?
❖ ThehighertheTDerror,thegreaterthedegreeofupdatethereistotheneuralnetworkweights
❖ SamplingexperiencesaccordingtoTDerror
• ,(1) ;(2)
• Biasedsampling
•
• BeTerthanDQNin41of49Atari2600games
P(i) =pα
i
∑k pαk
pi = 1/rank(δi) pi = |δi | + ϵ
wi = ( 1N
⋅1
P(i) )β
β
DQN and its variants
Schauletal.,Priori:zedexperiencereplay,arXiv:1511.05952,2015
RLChinaSummerSchool Intro to RL and Value-based Methods
Rainbow
89
DQN and its variants
Hesseletal.,Rainbow:CombiningImprovementsinDeepReinforcementLearning,AAAI2018
RLChinaSummerSchool Intro to RL and Value-based Methods
CanDQNworkwithCon)nuousAc)ons?
90
❖ DQNusesexperiencereplayandtargetnetwork
• Takeac:on oractrandomly
• Storetransi:on inreplaymemory
• Samplerandommini-batchoftransi:ons from
• Op:mizeMSEbetweenQ-networkandQ-learningtargets
• UsingvariantofSGD
• Every :mesteps,
at = arg maxat
Q(s, a; w)
(st, at, rt+1, st+1) 𝒟(s, a, r, s′ ) 𝒟
N w− = w
ℒ (w) = 𝔼s,a,r,s′ ∼𝒟 [(r + γ maxa′
Q (s′ , a′ ; w−) − Q (s, a; w))2
]How to perform the max?
DQN and its variants
RLChinaSummerSchool Intro to RL and Value-based Methods
DDPG
91
❖ Learninganapproximatemaximizer
•
• Trainanothernetwork suchthat
• Solve
• Bychainrule:
• Newtarget:
maxa
Q(s, a; ϕ) = Q(s, arg maxa
Q(s, a); ϕ)
μ(s; θ) μ(s; θ) ≈ arg maxaQ(s, a; ϕ)
θ = arg maxθQ(s, μ(s; θ); ϕ)dQϕ
dθ=
dQϕ
dadadθ
r + γQ(s′ , μ(s′ ; θ−); ϕ−)
DQN and its variants
Lillicrapetal.,Con:nuouscontrolwithdeepreinforcementlearning,ICLR2016
RLChinaSummerSchool Intro to RL and Value-based Methods
DDPG
92
❖ DDPGusesexperiencereplayandtargetnetworks
• Takeac:on (explora:on)
• Storetransi:on inreplaymemory
• Samplerandommini-batchoftransi:ons from
• Compute
•
•
• Socupdate and
at = μ(s, θ) + 𝒩t
(st, at, rt+1, st+1) 𝒟(s, a, r, s′ ) 𝒟
δ = r + γQ (s′ , μ(s′ ; θ−); ϕ−)−Q (s, a; ϕ)Δϕ ← αδ
dQ(s, a; ϕ)dϕ
Δθ ← βdQ(s, a; ϕ)
dadμ(s; θ)
dθϕ− θ−
DQN and its variants
Lillicrapetal.,Con:nuouscontrolwithdeepreinforcementlearning,ICLR2016
RLChinaSummerSchool Intro to RL and Value-based Methods
TD3:TwinDelayedDDPG
93
❖ ClippeddoubleQlearningforoveres:ma:on
• TwoQfunc:onsuseasingletarget
• TDtarget:
❖ Delayedpolicyupdate
• Valueandpolicyupdatesaredeeplycoupled
• PolicyupdateslessfrequentlythanQ(stabletarget)
❖ Targetpolicysmoothing
• Determinis:cpolicycanoverfittonarrowpeaksdevelopedbyQfunc:onapproximator
•
y = r + γ mini=1,2
Qθ′ i(s′ , μϕ′ (s′ ))
aclip(s′ ) = μϕ′ (s′ ) + clip(𝒩(0,σ), − c, + c)
y = r + γ mini=1,2
Qθ′ (s′ , aclip)
DQN and its variants
Fujimotoetal.,Addressingfunc:onapproxima:onerrorinactor-cri:cmethods,ICML2018.
RLChinaSummerSchool Intro to RL and Value-based Methods
Prac)calTipsforDeepQNetworks
94
❖ DQNtakescarestostabilize
• Testoneasytasksfirst,makesureyourimplementa:oniscorrect
❖ Largereplaybuffershelpimprovestability
❖ Ittakes:me,bepa:ent-mightbenobeTerthanrandomforawhile
❖ Startwithhighexplora:on( )andgraduallyreduce,alsoforlearningrate
❖ DDQNhelpsalotinprac:ce,simpleandnodownsides
❖ Runmul:plerandomseeds,itisveryinconsistentbetweenruns
ϵ
RLChinaSummerSchool Intro to RL and Value-based Methods
ConvergenceofControlAlgorithms
95
DQN
Algorithm Table Lookup Linear Non-linear
MC Control ✓ (✓) ✗
SARSA ✓ (✓) ✗
Q-learning ✓ ✗ ✗
(✓) = chatters around near-optimal value function
RLChinaSummerSchool Intro to RL and Value-based Methods
References
96
• RichardSuTonandAndrewBarto.Reinforcementlearning:Anintroduc:on.MITpress,2018(someslidesareborrowedfrom)
• DavidSilver,ReinforcementLearningCourse,UCL(someslidesareborrowedfrom)• SergeyLevin,DeepReinforcementLearningCourse,UCBerkeley(someslidesareborrowed
from)• Mnihetal.,Human-levelcontrolthroughdeepreinforcementlearning.Nature,2015,
518(7540):529-533.• Vanetal.,Deepreinforcementlearningwithdoubleq-learning,AAAI2016.• Wangetl.,Duelingnetworkarchitecturesfordeepreinforcementlearning,ICML2016• Bellemareetal.,Adistribu:onalperspec:veonreinforcementlearning,ICML2017• Schauletal.,Priori:zedexperiencereplay,arXiv:1511.05952,2015• Hesseletal.,Rainbow:CombiningImprovementsinDeepReinforcementLearning,AAAI2018• Lillicrapetal.,Con:nuouscontrolwithdeepreinforcementlearning,ICLR2016• Fujimotoetal.,Addressingfunc:onapproxima:onerrorinactor-cri:cmethods,ICML2018.