Introduc)on to Reinforcement Learning and Value-based …RL China Summer School Intro to RL and Value-based Methods Markov Decision Processes 9 Markov decision processes formally describe

ZongqingLuPekingUniversity

2020/7/27

Introduc)ontoReinforcementLearningandValue-basedMethods

Reinforcement Learning China Summer School

RLChinaSummerSchool Intro to RL and Value-based Methods

Content

❖ Introduc:ontoReinforcementLearning

• AboutRL

• RLproblem

• MarkovDecisionProcesses

❖ Value-basedMethods

• DynamicProgramming

• MonteCarlo

• TDLearning

• Off-policyLearning

• DQNanditsvariants


ManyFacesofReinforcementLearning

3

ZongqingLu

ClassicalCondi)oning

ComputerScience

Neuroscience

BoundedRa)onality

Opera)onsResearch

RewardSystem

MachineLearning

Op)malControl

Engineering

Psychology

Economics

Mathema)cs

ReinforcementLearning

About RL


BranchesofMachineLearning

4

MachineLearning

SupervisedLearning

UnsupervisedLearning

ReinforcementLearning

About RL


Characteris)csofReinforcementLearning

5

❖ Whatmakesreinforcementlearningdifferentfromothermachinelearningparadigms?

• Thereisnosupervisor,onlyarewardsignal,trial-and-error

• Feedbackisdelay,notinstantaneous

• TimereallymaTers(sequen:al,non-iiddata)

• Agent’sac:onaffectsthesubsequentdataitreceives

About RL


TheRLProblem

6

Theagentistomaximizethecumula)vereward

ac:onat

statest

rewardrt+1

The RL Problem

Allgoalsandpurposescanbedescribedbythemaximiza)onoftheexpectedcumula)vereward


InsideAnRLAgent

7

❖ AnRLagentmayincludeoneormoreofthesecomponents

• Policy:agent’sbehaviorfunc:on

• Valuefunc:on:howgoodisastateand/orac:on

• Model:agent’srepresenta:onoftheenvironment

The RL Problem


CategorizingRLAgents

8

❖ Valuebased

• Nopolicy

• Valuefunc:on

❖ Policybased

• Policy

• Novaluefunc:on

❖ ActorCri:c

• Policy

• Valuefunc:on

❖ Modelfree• Policyand/orvaluefunc:on

• Nomodel

❖ Modelbased• Policyand/orvaluefunc:on

• Model

The RL Problem


MarkovDecisionProcesses

9

❖ Markovdecisionprocessesformallydescribeanenvironmentforreinforcementlearning

• Theenvironmentisfullyobservable

❖ AlmostallRLproblemscanbeformalizedasMDPs

• Op:malcontrolprimarilydealswithcon:nuousMDPs

• Par:allyobservableproblemscanbeconvertedintoMDPs

• BanditsareMDPswithonestate

❖ AllstatesinMDPhas“Markov”property

•

• Currentstateencapsulatesallthesta:s:csweneedtodecidethefuture

ℙ[St+1 |St] = ℙ[St+1 |S1, …, St]

MDPs


MDP

10

❖ -asetofstates

❖ -asetofac:ons

❖ -transi:onprobabilityfunc:on,

•

❖ -rewardfunc:on,

•

❖ -discoun:ngfactorforfuturereward,

𝒮

𝒜

𝒫𝒫a

ss′ = ℙ [St+1 = s′ |St = s, At = a]

ℛℛa

s = 𝔼 [Rt+1 |St = s, At = a]γ γ ∈ [0,1]

MDPs

⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩


Policy

11

❖ Apolicy isadistribu:onoverac:onsgivenstates

•

❖ GivenaMDP andapolicy

• Thestatesequence isaMarkovprocess

• Thestateandrewardsequence isaMarkovrewardprocess ,where

ππ(a |s) = ℙ[At = a |St = s]

ℳ = ⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩ πS1, S2, … ⟨𝒮, 𝒫π⟩

S1, R2, S2, …⟨𝒮, 𝒫π, ℛπ, γ⟩

𝒫πs,s′ = ∑

a∈𝒜

π(a |s)𝒫ass′

ℛπs = ∑

a∈𝒜

π(a |s)ℛas

MDPs


ValueFunc)on

12

❖ Thestate-valuefunc:on ofanMDPistheexpectedreturnstar:ngfromstate ,andthenfollowingpolicy

❖ Theac:on-valuefunc:on ofanMDPistheexpectedreturnstar:ngfromstate ,takingac:on andthenfollowingpolicy

vπ(s)s π

qπ(s, a)s a π

vπ(s) = 𝔼π [Gt |St = s]= 𝔼π [Rt+1 + γRt+2 + γ2Rt+3 + … |St = s]= 𝔼π [Rt+1 + γ (Rt+2 + γRt+3 + …) |St = s]= 𝔼π [Rt+1 + γGt+1 |St = s]= 𝔼π [Rt+1 + γvπ (St+1) |St = s]

Qπ(s, a) = 𝔼π [Rt+1 + γv (St+1) |St = s, At = a]= 𝔼π [Rt+1 + γ𝔼a′ ∼πQ (St+1, a′ ) |St = s, At = a]

MDPs


BellmanExpecta)onEqua)ons

13

vπ(s) = ∑a∈𝒜

π(a |s)qπ(s, a)

vπ(s) ← s

qπ(s, a) ← a

r

vπ(s′ ) ← s′ qπ(s, a) = ℛa

s + γ ∑s′ ∈𝒮

𝒫ass′

vπ (s′ )

vπ(s) = ∑a∈𝒜

π(a |s)(ℛas + γ ∑

s′ ∈𝒮

𝒫ass′

vπ (s′ ))qπ(s, a) = ℛa

s + γ ∑s′ ∈𝒮

𝒫ass′ ∑

a′ ∈𝒜

π(a′ |s′ )qπ(s′ , a′ )

MDPs


Op)malValueFunc)on

14

❖ Theop:malstate-valuefunc:on isthemaximumvaluefunc:onoverallpolicies

❖ Theop:malac:on-valuefunc:on isthemaximumvaluefunc:onoverallpolicies

v*(s)

q*(s, a)

v*(s) = maxπ

vπ(s)

q*(s, a) = maxπ

qπ(s, a)

MDPs


Op)malPolicy

15

❖ Define

• Thereexistsanop:malpolicy thatisbeTertanorequaltoallotherpolicies,

• Allop:malpoliciesachievetheop:malvaluefunc:on,

• Allop:malpoliciesachievetheop:malac:on-valuefunc:on,

• Ifweknow ,weimmediatelyhavetheop:malpolicy

π ≥ π′ ifvπ(s) ≥ vπ′ (s), ∀s

π*π* ≥ π′ , ∀π

vπ*(s) = v*(s)

qπ*(s, a) = q*(s, a)

q*(s, a)

MDPs


BellmanOp)malityEqua)ons

16

v*(s) ← s

q*(s, a) ← a

r

v*(s′ ) ← s′

v*(s) = maxa

q*(s, a)

q*(s, a) = ℛas + γ ∑

s′ ∈𝒮

𝒫ass′

v* (s′ )

v*(s) = maxa (ℛa

s + γ ∑s′ ∈𝒮

𝒫ass′

v* (s′ ))q*(s, a) = ℛa

s + γ ∑s′ ∈𝒮

𝒫ass′

maxa′

q*(s′ , a′ )

MDPs


SolvingtheBellmanOp)malityEqua)on

17

❖ Ifwehavecompleteinforma:onoftheenvironment,thisturnsintoaplanningproblem,solvablebyDynamicProgramming

❖ Unfortunately,inmostscenarios,wedonotknow or ,wecannotsolveMDPsbydirectlyapplyingBellmanequa:on

𝒫ass′

ℛas

MDPs


Outline


• AboutRL

• RLproblem




• MonteCarlo

• TDLearning




DynamicProgramming

19

❖ DPassumesfullknowledgeoftheMDP

❖ PlanninginanMDP

❖ Forpredic:on:

• Input:MDP andpolicy

• Output:valuefunc:on

❖ Forcontrol:

• Input:MDP

• Output:op:malvaluefunc:on and

⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩ πvπ

⟨𝒮, 𝒜, 𝒫, ℛ, γ⟩v* π*

DP


PolicyEvalua)on

20

❖ Problem:computethevaluefunc:onforagivenpolicy

❖ Solu:on:itera:onofBellmanExpecta:onbackup

❖

• Ateachitera:on

• Synchronousupdate from ,where isasuccessorstateof

π

v1 → v2 → … → vπk + 1

vk+1(s) vk(s′ ), ∀s ∈ 𝒮 s′

s

vk+1(s) ← s

r

vk(s′ ) ← s′

vk+1(s) = ∑a∈𝒜

π(a |s)(ℛas + γ ∑

s′ ∈𝒮

𝒫ass′ vk (s′ ))

vk+1 = ℛπ + γ𝒫πvk

DP


Evalua)ngaRandomPolicy

21

❖ UndiscountedepisodicMDP

❖ Ac:onsleadingoutofthegridmakestateunchanged

❖ Twoterminalstates

❖ Auniformrandompolicy

π(𝗎𝗉 | ⋅ ) = π(𝖽𝗈𝗐𝗇 | ⋅ ) = π(𝗋𝗂𝗀𝗁𝗍 | ⋅ ) = π(𝗅𝖾𝖿𝗍 | ⋅ ) = 0.25

DP


PolicyEvalua)oninGridworld

22

fortherandompolicyvk

DP


PolicyImprovement

23

❖ Consideradeterminis:cpolicy,

❖ Improvethepolicybyac:nggreedily

❖ Thisimprovesthevaluefromanystate overonestep

❖ Itthereforeimprovesthevaluefunc:on,

a = π(s)

s

vπ′ (s) ≥ vπ(s)

π′ (s) = argmaxa∈𝒜

qπ(s, a)

qπ (s, π′ (s)) = maxa∈𝒜

qπ(s, a) ≥ qπ(s, π(s)) = vπ(s)

vπ(s) ≤ qπ (s, π′ (s)) = 𝔼π′ [Rt+1 + γvπ (St+1) |St = s]≤ 𝔼π′ [Rt+1 + γqπ (St+1, π′ (St+1)) |St = s]≤ 𝔼π′ [Rt+1 + γRt+2 + γ2qπ (St+2, π′ (St+2)) |St = s]≤ 𝔼π′ [Rt+1 + γRt+2 + … |St = s] = vπ′ (s)

DP


PolicyImprovement(cont’d)

24

❖ Ifimprovementstops,

❖ ThentheBellmanop:malityequa:onhasbeensa:sfied

❖ Therefore

❖ isanop:malpolicy

vπ(s) = v*(s), ∀s ∈ 𝒮

π

qπ (s, π′ (s)) = maxa∈𝒜

qπ(s, a) = qπ(s, π(s)) = vπ(s)

vπ(s) = maxa∈𝒜

qπ(s, a)

DP


PolicyImprovementinGridworld

25

fortherandompolicy

Vk fortherandompolicy

Vkgreedypolicyw.r.t.Vk

greedypolicyw.r.t.Vk

DP


PolicyItera)on

26

❖

❖ Policyevalua:on:es:mate

• Itera:vepolicyevalua:on

❖ Policyimprovement:generate

• Greedypolicyimprovement

π0E⟶ vπ0

I⟶ π1E⟶ vπ1

I⟶ π2E⟶ ⋯ I⟶ π*

E⟶ v*

vπ

π′ ≥ π

DP


PolicyItera)on(cont’d)

27

❖ Doespolicyevalua:onneedtoconvergeto ?

❖ Stopacer itera:onsofpolicyevalua:on

• issufficientforthegridworldexample

❖ Whynotupdatepolicyeveryitera:on?i.e.,

• Thisisvalueitera*on

vπ

kk = 3

k = 1

DP


ValueItera)on

28

❖ Anyop:malpolicycanbesubdividedintotwocomponents:

• Anop:malfirstac:on

• Followedbyanop:malpolicyfromsuccessorstate

❖ Determinis:cValueItera:on

• Ifweknowthesolu:ontosubproblems

• Thensolu:on canbefoundbyone-steplookahead

• Valueitera:onisapplyingtheseupdatesitera:vely

• Intui:on:startwithfinalrewardsandworkbackwards

A*

s′

v*(s′ )v*(s)

v*(s) ← maxa∈𝒜 (ℛa

s + γ ∑s′ ∈𝒮

𝒫ass′ v* (s′ ))

DP


ValueItera)on

29

❖ Problem:findop:malpolicy

❖ Solu:on:itera:onofBellmanop:malitybackup

❖

❖ Eachitera:on update from

❖ Convergenceto

❖ Unlikepolicyitera:on,thereisnoexplicitpolicy

π

v1 → v2 → ⋯ → v*

k + 1,∀s ∈ 𝒮, vk+1(s) vk(s′ )

v*

vk+1(s) ← s

r

vk(s′ ) ← s′

avk+1(s) = maxa∈𝒜 (ℛa

s + γ ∑s′ ∈𝒮

𝒫ass′ vk (s′ ))

DP


ShortPathExample

30

DP


GeneralizedPolicyItera)on

31

❖

❖ Policyevalua:on:es:mate

• Anypolicyevalua:onalgorithm

❖ Policyimprovement:genera

• Anypolicyimprovementalgorithm

π0E⟶ vπ0

I⟶ π1E⟶ vπ1

I⟶ π2E⟶ ⋯ I⟶ π*

E⟶ v*

vπ

π′ ≥ π

DP


Ques)ons

32

❖ Doesitera:vepolicyevalua:onconvergesto ?

❖ Doespolicyitera:onconvergesto ?

❖ Doesvalueitera:onconvergesto ?

❖ Isthesolu:onunique?

❖ Howfastdothesealgorithmsconverge?

vπ

v*

v*

DP

Yes, can be proved based on contraction mapping theory !


Contrac)onMapping

33

❖ Valuefunc:onspace

• Consideravectorspace with dimensionsovervaluefunc:ons

• Eachpointinthisspacefullyspecifiesastate-valuefunc:on

❖ Valuefunc:on -norm

• Thedistancebetweenstate-valuefunc:ons and ismeasuredby -norm

❖ DefinetheBellmanexpecta:onbackupoperator

❖ Theoperatorisa -contrac:on,i.e.,itmakesvaluefunc:onscloserbyatleast

𝒱 |𝒮 |v

∞u v ∞

𝒯π

γ γ

∥u − v∥∞ = maxs∈𝒮

|u(s) − v(s) |

𝒯π(v) = ℛπ + γ𝒫πv

𝒯π(u) − 𝒯π(v)∞

= (ℛπ + γ𝒫πu) − (ℛπ + γ𝒫πv)∞

= γ𝒫π(u − v)∞

≤ γ𝒫π u − v ∞ ∞

≤ γ∥u − v∥∞

DP


Contrac)onMappingTheorem

34

❖ Foranymetricspace thatiscompleteunderanoperator,where isa -contrac:on

• convergestoauniquefixedpoint

• Atalinearconvergencerateof

𝒱𝒯(v) 𝒯 γ

𝒯γ

DP


ConvergenceofPolicyEvalua)onandPolicyItera)on

35

❖ TheBellmanexpecta:onoperator hasauniquefixedpoint

❖ isafixedpointof (byBellmanexpecta:onequa:on)

❖ Itera:vepolicyevalua:onconvergeson

❖ Policyitera:onconvergeson

𝒯π

vπ 𝒯π

vπ

v*

DP


ConvergenceofValueItera)on

36

❖ DefinetheBellmanop:malitybackupoperator

❖ Theoperatorisa -contrac:on(similartopreviousproof)

❖ TheBellmanop:malityoperator hasauniquefixedpoint

❖ isafixedpointof (byBellmanop:malityequa:on)

❖ Valueitera:onconvergeson

𝒯*

γ

𝒯*

v* 𝒯*

v*

𝒯*(v) = maxa∈𝒜

ℛa + γ𝒫av

𝒯*(u) − 𝒯*(v)∞

≤ γ∥u − v∥∞

DP


DPSummary

37

❖ Algorithmsarebasedonstate-valuefunc:on or

❖ Complexity peritera:on,for ac:onsand states

❖ DPusesfull-widthbackups,andforeachbackup

• Everysuccessorstateandac:onisconsidered

• UsingknowledgeoftheMDPtransi:onsandrewardfunc:on

• Forlargeproblem,DPsuffersfromthecurseofdimensionality

vπ(s) v*(π)

O(mn2) m n

DP


Outline


• AboutRL

• RLproblem

• InsideanRLagent




• MonteCarlo

• TDLearning




Monte-CarloMethods

39

❖ MCmethodslearndirectlyfromepisodesofexperience

❖ MCismodel-free:noknowledgeofMDPtransi:ons/rewards

❖ MClearnsfromcompleteepisodes:nobootstrapping

❖ MCusesthesimplestpossibleidea:value=meanreturn

❖ CanonlyapplyMCtoepisodicMDPs

• Allepisodesmustterminate

MC


Monte-CarloPolicyEvalua)on

40

❖ Goal:learn fromepisodesofexperienceunderpolicy

❖ Recallthatthereturnisthetotaldiscountedrewards:

❖ Recallthatthevaluefunc:onistheexpectedreturnfrom

❖ Monte-Carlopolicyevalua:onusesempiricalmeanreturninsteadofexpectedreturn

vπ π

s

S1, A1, R2, …, Sk ∼ π

Gt = Rt+1 + γRt+2 + … + γT−1RT

vπ(s) = 𝔼π [Gt |St = s]

MC


Monte-CarloPolicyEvalua)on(cont’d)

41

❖ Toevaluatestate

❖ Thefirst/every:mestep thatstate isvisitedinanepisode

❖ Incrementcounter

❖ Incrementtotalreturn

❖ Valueises:matedbymeanreturn

❖ Bylawoflargenumbers, as

s

t s

N(s) ← N(s) + 1

S(s) ← S(s) + Gt

V(s) ← S(s)/N(s)

V(s) → vπ(s) N(s) → ∞

MC


IncrementalMonte-CarloUpdates

42

❖ Incrementalmean

❖ Update incrementallyaceranepisode

❖ Foreachstate withreturn

❖ Innon-sta:onaryproblems,itcanbeusedtotrackarunningmean,i.e.,forgetoldepisodes

V(s)

st Gt

μk =1k

k

∑j=1

xj =1k (xk + (k − 1)μk−1) = μk−1 +

1k (xk − μk−1)

N(st) ← N(st) + 1

V (st) ← V (st) +1

N (st) (Gt − V (st))

V (st) ← V (st) + α (Gt − V (st))

MC


Monte-CarloPolicyItera)on

43

❖ GeneralizedPolicyItera:onviaMonte-CarloEvalua:on

• Policyevalua:on:Monte-Carlopolicyevalua:on, ?

• Policyimprovement:Greedypolicyimprovement?

V = vπ

MC


PolicyItera)onUsingAc)on-ValueFunc)on

44

❖ Greedypolicyimprovementover requiresmodelofMDPs

❖ Greedypolicyimprovementover ismodel-free

V(s)

Q(s, a)

π′ (s) = arg maxa∈𝒜

(ℛas + 𝒫a

ss′ V(s′ ))

π′ (s) = arg maxa∈𝒜

Q(s, a)

MC


Monte-CarloPolicyItera)on

45

• Policyevalua:on:Monte-Carlopolicyevalua:on,

• Policyimprovement:Greedypolicyimprovement?

Q = qπ

MC


-greedyExplora)onϵ

46

❖ Simplestideaforensuringcon:nualexplora:on

❖ All ac:onsaretriedwithnon-zeroprobability

❖ Withprobability choosethegreedyac:on

❖ Withprobability chooseanrandomac:on

|𝒜 |

1 − ϵ

ϵ

π(a |s) = {ϵ/m + 1 − ϵ if a* = arg max

a∈𝒜Q(s, a)

ϵ/m otherwise

MC


-greedyPolicyImprovementϵ

47

❖ Forany -greedypolicy ,the -greedypolicy withrespectto isanimprovement,

❖ Frompolicyimprovementtheorem,

ϵ π ϵ π′ qπvπ′ ≥ vπ

vπ′ ≥ vπ

qπ (s, π′ (s)) = ∑a∈𝒜

π′ (a |s)qπ(s, a)

= ϵ/m ∑a∈𝒜

qπ(s, a) + (1 − ϵ) maxa∈𝒜

qπ(s, a)

≥ ϵ/m ∑a∈𝒜

qπ(s, a) + (1 − ϵ) ∑a∈𝒜

π(a |s) − ϵ/m1 − ϵ

qπ(s, a)

= ∑a∈𝒜

π(a |s)qπ(s, a)

= vπ(s)

MC


Monte-CarloControl

48

❖ Everyepisode:

• Policyevalua:on:Monte-Carlopolicyevalua:on,

• Policyimprovement: -greedypolicyimprovement

Q = qπ

ϵ

MC


GreedyinthelimitwithInfiniteExplora)on

49

❖ GLIE

• Allstate-ac:onpairsareexploredinfinitelyocen

• Thepolicyconvergesonagreedypolicy

❖ -greedyisGLIEif reducestozeroatϵ ϵ ϵk =1k

limk→∞

Nk(s, a) = ∞

limk→∞

πk(a |s) = 1(a = arg maxa′ ∈𝒜

Qk(s, a′ ))

MC


GLIEMonte-CarloControl

50

❖ Sample thepisodeusing :

❖ Foreachstate-ac:onpairintheepisode

❖ Improvethepolicybasedonnewac:on-valuefunc:on

❖ GLIEMonte-Carlocontrolconvergestotheop:malac:on-valuefunc:on,

k π {S1, A1, R2, …, ST} ∼ π

Q(s, a) → q*(s, a)

N (St, At) ← N (St, At) + 1

Q (St, At) ← Q (St, At) +1

N (St, At) (Gt − Q (St, At))

ϵ ← 1/kπ ← ϵ − greedy(Q)

MC


Outline


• AboutRL

• RLproblem




• MonteCarlo

• TDLearning




Temporal-DifferenceLearning

52

❖ TDmethodslearndirectlyfromepisodesofexperience

❖ TDismodel-free:noknowledgeofMDPtransi:ons/rewards

❖ TDlearnsfromincompleteepisodes:bybootstrapping

❖ TDupdatesaguesstowardsaguess

TD


MCandTDPolicyEvalua)on

53

❖ Goal:learn onlinefromexperienceunderpolicy

❖ Incrementalevery-visitMC

• isupdatedtowardactualreturn

❖ SimplestTDlearning

• isupdatedtowardes:matedreturn

• iscalledtheTDtarget

• iscalledtheTDerror

vπ π

V(St) Gt

V(St) Rt+1 + γV(St+1)

Rt+1 + γV(St+1)δt = Rt+1 + γV(St+1) − V(St)

V(St) ← V(St) + α(Gt − V(St))

V(St) ← V(St) + α(Rt+1 + γV(St+1) − V(St))

TD


AdvantagesandDisadvantagesofMCandTD

54

❖ TDcanlearnbeforeknowingthefinaloutcome

• TDcanlearnonlineacereverystep

• MCmustwaitun:lendofepisodeacerreturnisknown

❖ TDcanlearnwithoutthefinaloutcome

• TDworksincon:nuing(non-termina:ng)environments

• MConlyworksforepisodic(termina:ng)environments

TD


Bias/VarianceTradeoff

55

❖ Return isunbiasedes:mateof

❖ TrueTDtarget isunbiasedes:mateof

❖ TDtarget isbiasedes:mateof

❖ TDtargetismuchlowervariancethanthereturn

• Returndependsonmanyac:ons,transi:ons,rewards

• TDtargetdependsononlyoneac:on,transi:on,reward

Gt = Rt+1 + γRt+2 + …, γT−1RTvπ(St)

Rt+1 + γvπ(St+1) vπ(St)

Rt+1 + γV(St+1) vπ(St)

TD


TDControl:SARSA

56

❖ Every:mestep:

• Policyevalua:on:

• Policyimprovement: -greedypolicyimprovement

❖

Q ≈ qπ

ϵ

Q(S, A) ← Q(S, A) + α (R + γQ(S′ , A′ ) − Q(S, A))

S, A

R

S′

A

TD


SARSAAlgorithm

57

TD


Outline


• AboutRL

• RLproblem

• InsideanRLagent




• MonteCarlo

• TDLearning




Off-PolicyLearning

59

❖ Off-policylearning

• Evaluatetargetpolicy tocompute or ,whilefollowingbehaviorpolicy

❖ Whyisthisimportant?

• Learnfromobservinghumansorotheragents

• Reuseexperiencegeneratedfromoldpolicies

• Learnaboutop*malpolicywhilefollowingexploratorypolicy

• Learnaboutmul*plepolicieswhilefollowingonepolicy

π(a |s) vπ(s) qπ(s, a)μ(a |s)

π1, π2, …, πt−1

{S1, A1, R2, …, ST} ∼ μ

Off-Policy Learning


ImportanceSampling

60

❖ Es:matetheexpecta:onofadifferencedistribu:on

𝔼X∼P[ f(X)] = ∑ P(X)f(X)

= ∑ Q(X)P(X)Q(X)

f(X)

= 𝔼X∼Q [ P(X)Q(X)

f(X)]

Off-Policy Learning


ImportanceSamplingforOff-PolicyMC

61

❖ Usereturnsgeneratedfrom toevaluate

❖ Weightreturn accordingtosimilaritybetweenpolicies

❖ Mul:plyimportancesamplingcorrec:onsalongwholeepisode

❖ Updatevaluestowardscorrectedreturn

❖ Importancesamplingcandrama:callyincreasevariance

μ π

Gt

Gπ/μt =

π(At |St)μ(At |St)

π(At+1St+1)μ(At+1 |St+1)

⋯π(AT |ST)μ(AT |ST)

Gt

V(St) ← V(St) + α (Gπ/μt − V(St))

Off-Policy Learning


ImportanceSamplingforOff-PolicyTD

62

❖ UseTDtargetsgeneratedfrom toevaluate

❖ WeightedTDtarget byimportancesampling

❖ Onlyneedasingleimportancecorrec:on

❖ MuchlowervariancethanMonte-Carloimportancesampling

μ π

R + γV(S′ )

V(St) ← V(St) + α ( π(At |St)μ(At |St)

(Rt+1 + γV(St+1))−V(St))

Off-Policy Learning


QLearning

63

❖ Nowconsideringoff-policylearningofac:on-values

❖ Noimportancesamplingisrequired

❖ Ac:onischosenusingbehaviorpolicy

❖ Butweconsideralterna:vesuccessorac:on

❖ Update towardsvalueofalterna:veac:on

Q(s, a)

At ∼ μ( ⋅ |St)

A′ ∼ π( ⋅ |St+1)

Q(St, At)

Q(St, At) ← Q(St, At) + α(Rt+1 + γQ(St+1, A′ ) − Q(St, At))

Off-Policy Learning


Off-PolicyControlwithQ-Learning

64

❖ Thetargetpolicy isgreedyw.r.t.

❖ Thebehaviorpolicy ise.g. -greedyw.r.t

❖ TheQ-learningtargetthensimplifies:

π Q(s, a)

μ ϵ Q(s, a)

π(St+1) = arg maxa′

Q(St+1, a′ )

Rt+1 + γQ (St+1, A′ )

= Rt+1 + γQ (St+1, argmaxa′

Q (St+1, a′ ))= Rt+1 + max

a′ γQ (St+1, a′ )

Off-Policy Learning


Off-PolicyControlwithQ-Learning(cont’d)

65

❖ Q-learningconvergestotheop:malac:on-valuefunc:on

S, A

R

S′

A′

Q(s, a) → q*(s, a)

Off-Policy Learning

max


Off-PolicyControlwithQ-Learning(cont’d)

66

Off-Policy Learning


CliffWalkingExample

67

Off-Policy Learning


SummaryonDP,MC,andTD

68

Summary

MC backup TD backup

DP backup


UnifiedViewofReinforcementLearning

69

Summary


Outline


• AboutRL

• RLproblem

• InsideanRLagent




• MonteCarlo

• TDLearning




ReinforcementLearninginPrac)cal

71

❖ Reinforcementlearningcansolvelargeproblems,e.g.,

• Backgammon: states

• ComputerGo: states

• Robots:con:nuousstatespace

❖ Howcanwescaleupthemodel-freemethods?

1020

10170

DQN and its variants


ValueFunc)onApproxima)on

72

❖ Wehaverepresentedvaluefunc:onbyalookuptable

• Everystate hasanentry

• Everystate-ac:onpair hasanentry

❖ ProblemwithlargeMDPs• Therearetoomanystatesandac:onstostoreinmemory

• Itistooslowtolearnthevalueofeachstateindividually

❖ Solu:onforlargeMDPs• Es:matevaluefunc:onwithfunc:onapproxima:on

• Generalizefromseenstatestounseenstates

• Updateparameter usingMCorTDlearning

s V(s)s, a Q(s, a)

w

v(s, w) ≈ vπ(s)q(s, a, w) ≈ qπ(s, a)



TypesofValueFunc)onApproxima)on

73



ControlwithValueFunc)onApproxima)on

74

❖ Policyevalua:on:approximatepolicyevalua:on,

❖ Policyimprovement: -greedypolicyimprovementϵ

q( ⋅ , ⋅ ,w) ≈ qπ



Ac)on-ValueFunc)onApproxima)on

75

❖ Approximatetheac:on-valuefunc:on

❖ Minimizemean-squarederrorbetweenapproximateac:on-valuefunc:on andtruefunc:on

❖ Usingstochas:cgradientdescenttofindalocalminimum

q(S, A, w) qπ(S, A)

q(S, A, w) ≈ qπ(S, A)

J(w) = 𝔼π[(qπ(S, A) − q(S, A, w))2]

Δw = α(qπ(S, A) − q(S, A, w))∇w q(S, A, w)



❖ Wedonotknow

❖ Wesubs:tuteatargetfor

• ForMC,thetargetisthereturn

• ForTD,thetargetistheTDtarget

qπ(S, A)

qπ(S, A)Gt

Rt+1 + γ q(St+1, At+1w)

Ac)on-ValueFunc)onApproxima)on(cont’d)

76

Δw = α(Gt− q(St, At, w))∇w q(St, At, w)

Δw = α(Rt+1 + γ q(St+1, At+1, w)− q(St, At, w))∇w q(St, At, w)



OnlineQLearning

77

1. Takeac:on accordingto -greedypolicyandobserve

2.

a ϵ(s, a, r, s′ )

Δw = α(r + γ maxa′

q(s′ , a′ , w) − q(s, a, w))∇w q(s, a, w)

-Sequen)alstatesarestronglycorrelated!

-TargetValueisalwayschanging!



DeepQNetwork

78

❖ DQNusesexperiencereplayandtargetnetwork

• Take -greedyac:onw.r.t.

• Storetransi:on inreplaymemory

• Samplerandommini-batchoftransi:ons from

• Op:mizeMSEbetweenQ-networkandQ-learningtargets

• UsingvariantofSGD

• Every :mesteps,

ϵ Q(s, a; w)(st, at, rt+1, st+1) 𝒟

(s, a, r, s′ ) 𝒟

N w− = w

ℒ (w) = 𝔼s,a,r,s′ ∼𝒟 [(r + γ maxa′

Q (s′ , a′ ; w−) − Q (s, a; w))2

]


Mnihetal.,Human-levelcontrolthroughdeepreinforcementlearning.Nature,2015,518(7540):529-533.


DQNinAtari

79

❖ End-to-endlearningof frompixels

❖ Inputstate isstackofrawpixelsfromlast4frames

❖ Outputis for18joys:ck/buTonposi:ons

❖ Rewardischangeinscoreforthatstep

Q(s, a) s

s

Q(s, a)




DQNResultsinAtari

80




HowmuchdoesDQNhelp?

81



HowaccuratearetheQ-values?

82

❖ Overes:ma:on

• Targetvalue:

•

• Forbootstrapping(learninges:matesfromes:mates),suchoveres:ma:oncanbeproblema:c.

𝔼[max(X1, X2)] ≥ max (𝔼[X1], 𝔼[X2])

r + γ maxa′

Q(s′ , a′ ; w−)

This is the problem!


Vanetal.,Deepreinforcementlearningwithdoubleq-learning,AAAI2016.


DoubleDQN

83

❖

• Ac:onselectedaccordingto

• Valuealsofrom

❖ Usedifferentnetworkstochooseac:onandevaluatevalue

• Currentnetworktoevaluateac:on

• Targetnetworktoevaluatevalue

maxa′

Q(s′ , a′ ; w−) = Q(s′ , arg maxa′

Q(s′ , a′ ; w−); w−)

Q( ⋅ , ⋅ ; w−)Q( ⋅ , ⋅ ; w−)

r + γQ(s′ , arg maxa′

Q(s′ , a′ ; w); w−)




DoubleDQN

84

❖ DQNusesexperiencereplayandtargetnetworks

• Takeac:on accordingto -greedypolicy





•

at ϵ(st, at, rt+1, st+1) 𝒟

(s, a, r, s′ ) 𝒟

w− = (1 − τ)w + τw−

ℒ (w) = 𝔼s,a,r,s′ ∼𝒟 (r + γQ(s′ , arg maxa′

Q(s′ , a′ ; w); w−) − Q (s, a; w))2




DuelingDQN

85

❖ Mo)va)on:itisunnecessarytoknowtheexactvalueofeachac:onatevery:mestep

❖ DuelingDQNlearnsstatevaluewithouthavingtolearntheeffectofeachac:onforeachstate

❖ “uniden:fiable”

❖

❖

Q(s, a; θ, α, β) = V(s; θ, β) + A(s, a; θ, α)

Q(s, a; θ, α, β) = V(s; θ, β) + (A(s, a; θ, α) − maxa′ ∈𝒜

A(s, a′ ; θ, α))

Vπ(s)

Aπ(s, a) = Qπ(s, a) − Vπ(s)

Q(s, a; θ, α, β) = V(s; θ, β) + (A(s, a; θ, α) −1

|𝒜 | ∑a′ ∈|𝒜|

A(s, a′ ; θ, α))


Wangetl.,Duelingnetworkarchitecturesfordeepreinforcementlearning,ICML2016


-StepDQNn

86

❖

• LessbiasedtargetvalueswhenQvaluesareinaccurate

• Fasterlearningwhenearlyon

• But,onlyactuallycorrectwhenlearningon-policy

❖ Howtodealwithit?

• IgnoretheproblemOcenworksinprac:ce

• DynamicallychooseNtogetonlyon-policydataWorkwellwhendataison-policyandac:onspaceissmall

• Importancesampling

Rt+1 + γRt+2 + ⋯ + γn−1Rt+n + γn maxa

Q(St+n, a; θ)



Distribu)onalDQN

87

❖

❖

❖

Qπ(s, a) = 𝔼[Gt]

Q*(s, a) = r + γ maxa′

Q*(s′ , a′ )

Z(s, a) = r + γZ(s′ , a′ )


Bellemareetal.,Adistribu:onalperspec:veonreinforcementlearning,ICML2017


Priori)zedExperienceReplay

88

❖ Isuniformsamplingfromreplybufferanefficientwaytolearn?Doeseachsamplecontributetothelearningequally?

❖ ThehighertheTDerror,thegreaterthedegreeofupdatethereistotheneuralnetworkweights

❖ SamplingexperiencesaccordingtoTDerror

• ,(1) ;(2)

• Biasedsampling

•

• BeTerthanDQNin41of49Atari2600games

P(i) =pα

i

∑k pαk

pi = 1/rank(δi) pi = |δi | + ϵ

wi = ( 1N

⋅1

P(i) )β

β


Schauletal.,Priori:zedexperiencereplay,arXiv:1511.05952,2015


Rainbow

89


Hesseletal.,Rainbow:CombiningImprovementsinDeepReinforcementLearning,AAAI2018


CanDQNworkwithCon)nuousAc)ons?

90

❖ DQNusesexperiencereplayandtargetnetwork

• Takeac:on oractrandomly





• Every :mesteps,

at = arg maxat

Q(s, a; w)

(st, at, rt+1, st+1) 𝒟(s, a, r, s′ ) 𝒟

N w− = w

ℒ (w) = 𝔼s,a,r,s′ ∼𝒟 [(r + γ maxa′

Q (s′ , a′ ; w−) − Q (s, a; w))2

]How to perform the max?



DDPG

91

❖ Learninganapproximatemaximizer

•

• Trainanothernetwork suchthat

• Solve

• Bychainrule:

• Newtarget:

maxa

Q(s, a; ϕ) = Q(s, arg maxa

Q(s, a); ϕ)

μ(s; θ) μ(s; θ) ≈ arg maxaQ(s, a; ϕ)

θ = arg maxθQ(s, μ(s; θ); ϕ)dQϕ

dθ=

dQϕ

dadadθ

r + γQ(s′ , μ(s′ ; θ−); ϕ−)


Lillicrapetal.,Con:nuouscontrolwithdeepreinforcementlearning,ICLR2016


DDPG

92

❖ DDPGusesexperiencereplayandtargetnetworks

• Takeac:on (explora:on)



• Compute

•

•

• Socupdate and

at = μ(s, θ) + 𝒩t

(st, at, rt+1, st+1) 𝒟(s, a, r, s′ ) 𝒟

δ = r + γQ (s′ , μ(s′ ; θ−); ϕ−)−Q (s, a; ϕ)Δϕ ← αδ

dQ(s, a; ϕ)dϕ

Δθ ← βdQ(s, a; ϕ)

dadμ(s; θ)

dθϕ− θ−


Lillicrapetal.,Con:nuouscontrolwithdeepreinforcementlearning,ICLR2016


TD3:TwinDelayedDDPG

93

❖ ClippeddoubleQlearningforoveres:ma:on

• TwoQfunc:onsuseasingletarget

• TDtarget:

❖ Delayedpolicyupdate

• Valueandpolicyupdatesaredeeplycoupled

• PolicyupdateslessfrequentlythanQ(stabletarget)

❖ Targetpolicysmoothing

• Determinis:cpolicycanoverfittonarrowpeaksdevelopedbyQfunc:onapproximator

•

y = r + γ mini=1,2

Qθ′ i(s′ , μϕ′ (s′ ))

aclip(s′ ) = μϕ′ (s′ ) + clip(𝒩(0,σ), − c, + c)

y = r + γ mini=1,2

Qθ′ (s′ , aclip)


Fujimotoetal.,Addressingfunc:onapproxima:onerrorinactor-cri:cmethods,ICML2018.


Prac)calTipsforDeepQNetworks

94

❖ DQNtakescarestostabilize

• Testoneasytasksfirst,makesureyourimplementa:oniscorrect

❖ Largereplaybuffershelpimprovestability

❖ Ittakes:me,bepa:ent-mightbenobeTerthanrandomforawhile

❖ Startwithhighexplora:on( )andgraduallyreduce,alsoforlearningrate

❖ DDQNhelpsalotinprac:ce,simpleandnodownsides

❖ Runmul:plerandomseeds,itisveryinconsistentbetweenruns

ϵ


ConvergenceofControlAlgorithms

95

DQN

Algorithm Table Lookup Linear Non-linear

MC Control ✓ (✓) ✗

SARSA ✓ (✓) ✗

Q-learning ✓ ✗ ✗

(✓) = chatters around near-optimal value function


References

96

• RichardSuTonandAndrewBarto.Reinforcementlearning:Anintroduc:on.MITpress,2018(someslidesareborrowedfrom)

• DavidSilver,ReinforcementLearningCourse,UCL(someslidesareborrowedfrom)• SergeyLevin,DeepReinforcementLearningCourse,UCBerkeley(someslidesareborrowed

from)• Mnihetal.,Human-levelcontrolthroughdeepreinforcementlearning.Nature,2015,

518(7540):529-533.• Vanetal.,Deepreinforcementlearningwithdoubleq-learning,AAAI2016.• Wangetl.,Duelingnetworkarchitecturesfordeepreinforcementlearning,ICML2016• Bellemareetal.,Adistribu:onalperspec:veonreinforcementlearning,ICML2017• Schauletal.,Priori:zedexperiencereplay,arXiv:1511.05952,2015• Hesseletal.,Rainbow:CombiningImprovementsinDeepReinforcementLearning,AAAI2018• Lillicrapetal.,Con:nuouscontrolwithdeepreinforcementlearning,ICLR2016• Fujimotoetal.,Addressingfunc:onapproxima:onerrorinactor-cri:cmethods,ICML2018.

Introduc)on to Reinforcement Learning and Value-based …RL China Summer School Intro to RL and Value-based Methods Markov Decision Processes 9 Markov decision processes formally describe

Documents