Reinforcement Learning Lecture 1 : Markov Decision Processes Emilie Kaufmann Ecole Centrale de Lille, 2019/2020 Emilie Kaufmann | CRIStAL -1
Reinforcement LearningLecture 1 : Markov Decision Processes
Emilie Kaufmann
Ecole Centrale de Lille, 2019/2020
Emilie Kaufmann |CRIStAL - 1
Outline of the class
• Lecture 1. Markov Decision Processes (MDP), a formalization forreinforcement learning problem(s)
• Lecture 2. One-state, several actions : solving multi-armed banditsUCB algorithms. Thompson Sampling
• Lecture 3. Solving a MDP with known parameters.Dynamic Programming, Value/Policy Iteration
• Lecture 4. First Reinforcement Learning algorithms.TD Learning, Q-Learning
• Lecture 5. Approximate Dynamic Programming
• Lecture 6. Deep Reinforcement Learning (O. Pietquin)
• Lecture 7. Policy Gradient Methods (O. Pietquin)
• Lecture 8. Bandit tools for RLBandit-based exploration, Monte-Carlo Tree Search Methods
Emilie Kaufmann |CRIStAL - 2
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 3
Markov Decision Process
A Markov Decision Process (MDP) models a situation in which repeateddecisions (= choices of actions) are made. MDP provides models for theconsequence of each decisions :
I in terms of reward
I in terms of the evoluation of the system’s state
In each (discrete) decision time t = 0, 1, 2, . . . , a learning agent
I selects an action at based on his current state st(or possibly all the previous observations),
I gets a reward rt ∈ R depending on his choice,
I transits to a new state st+1 depending on his choice.
Emilie Kaufmann |CRIStAL - 4
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
In each (discrete) decision time t = 0, 1, 2, . . . , a learning agent
I selects an action at based on his current state st(or possibly all the previous observations),
I gets a reward rt ∼ ν(st ,at)
I transits to a new state st+1 ∼ p(·|st , at)
[Bellman 1957, Howard 1960, Blackwell 70s...]
Emilie Kaufmann |CRIStAL - 5
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space (sometimes As for each s ∈ S)
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
In each (discrete) decision time t = 0, 1, 2, . . . , a learning agent
I selects an action at based on his current state st(or possibly all the previous observations),
I gets a reward rt ∼ ν(st ,at)
I transits to a new state st+1 ∼ p(·|st , at)
[Bellman 1957, Howard 1960, Blackwell 70s...]
Emilie Kaufmann |CRIStAL - 5
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
Goal : (made more precise later) select actions so as to maximize somenotion of expected cumulated rewards
Mean reward of action a in state s
r(s, a) = ER∼ν(s,a)[R]
Emilie Kaufmann |CRIStAL - 5
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
• The tabular case : finite state and action spaces
S = {1, . . . ,S}A = {1, . . . ,A}
For every s, s ′ ∈ S, a ∈ A, p(s ′|s, a) = P (st+1 = s ′|st = s, at = a).
Emilie Kaufmann |CRIStAL - 5
Markovian Dynamics
I Reminder : Markov chain
Definition
A Markov chain on a discrete space X is a stochastic process (Xt)t∈Nthat satisfies the Markov property :
P(Xt = xt |Xt−1 = xt−1, . . . ,X0 = x0) = P(Xt = xt |Xt−1 = xt−1)
for all t ∈ N and (x0, . . . , xt) ∈ X t+1. It is homogeneous if
P(Xt = y |Xt−1 = x) = P(Xt−1 = y |Xt−2 = x).
I An homogeneous Markov chain is characterized by its transitionprobabilities p(y |x) = P (Xt = y |Xt−1 = y) an its initial state.
I If X is continuous, this definition can be extended by mean of atransition kernel such that p(·|x) ∈ ∆(X ) for all x ∈ X .
Emilie Kaufmann |CRIStAL - 6
Markovian Dynamics
I Reminder : Markov chain
play
sleep
cry eat0.6
0.1
0.3
0.7
0.1
0.1
0.2
0.10.6
0.3
0.9
Figure – An example of 4 states Markov chain
Emilie Kaufmann |CRIStAL - 7
Markovian Dynamics
I Back to Markov Decision Processes
In a MDP, the sequence of sucessive states / actions / rewards
s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st
satisfies some extension of the Markov property :
P (st = s, rt−1 = r |s0, a0, r0, s1, a1, r1, . . . , st−1, at−1)
= P (st = s, rt−1 = r |st−1, at−1)
(discrete action and reward)
Emilie Kaufmann |CRIStAL - 8
Illustration of a MDP
credit : Ronan Fruit
I S = {s0, s1, s2}I A = {a0, a1}I the mean reward in state s1 :
r(s1, a0) = 0 and r(s1, a1) = rmax.
I the transition probabilities when performing action a1 in state s0 are
p(s1|s0, a1) = 0.1 and p(s2|s0, a1) = 0.9.
Emilie Kaufmann |CRIStAL - 9
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 10
The Centrale Student Dilemma
credit : Remi Munos, Alessandro Lazaric
Emilie Kaufmann |CRIStAL - 11
Tetris
• State : current board and nextblocks to add
• Action : orientation + position ofthe dropped block
• Reward : increment in the score/number of lines
• Transition : new board +randomness in the new block
Ü difficulty : large state space !
Emilie Kaufmann |CRIStAL - 12
The RiverSwim MDP
Two actions available in each state, materialized → and 99K :
sNsN−1
0.95r = 10.6
0.4
1
0.4
0.05
1
0.05
s1
0.6
0.4
0.05
1
0.6
1r = 0.05
s2
0.4
0.05
1
s3
0.6
0.4
0.05
1
credit : Sadegh Talebi
Ü difficulty : delayed, sparse, reward
Emilie Kaufmann |CRIStAL - 13
Grid worlds
I State : position of the robot
I Actions : ←,↑,→,↓I Transitions : (quasi)
deterministic
I Rewards : depends on thebehavior to incentivise(positive or negative rewardson some states / −1 for eachstep before a goal...)
Emilie Kaufmann |CRIStAL - 14
Retail Store Management (1/2)
You owe a bike store. During week t, the (random) demand is Dt units.On Monday morning you may choose to command at additional units :they are delivered immediately before the shop opens.
For each week :
I Maintenance cost : h per unit left in your stock
I Ordering cost : c per unit ordered + fix cost c0 if an order is placed
I Sales profit : p per unit sold
Constraints :
I your warehouse has a maximal capacity of M bikes(any additional bike gets stolen)
I you cannot sell bikes that you don’t have in stock
Emilie Kaufmann |CRIStAL - 15
Retail Store Management (2/2)
I State : number of bikes in stock on SundayState space : S = {0, . . . ,M}
I Action : number of bikes ordered at the beginning of the weekAction space : A = {0, . . . ,M}
I Reward = balance of the week : if you command At bikes,
rt = −c01(at>0) − c × at − h × st + p ×min(Dt , st + at ,M)
I Transition : you end the week with
st+1 = max(0,min(M, st + at)− Dt
)bikes
Ü Markov Decision Process
r(s, a)? p(·|s, a)?
Emilie Kaufmann |CRIStAL - 16
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 17
RL objective (informal)
Learn / Act according to a
Good Policy
in a potentially unknown MDP
Emilie Kaufmann |CRIStAL - 18
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
Ü An agent acting under policy π selects at round t the action
at ∼ πt(st)
I Remark : one could also consider history-dependent policiesπt : Ht → ∆(A), where the next action is chosen based on
ht = (s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st)
Emilie Kaufmann |CRIStAL - 19
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
Ü An agent acting under policy π selects at round t the action
at ∼ πt(st)
I Remark : one could also consider history-dependent policiesπt : Ht → ∆(A), where the next action is chosen based on
ht = (s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st)
Emilie Kaufmann |CRIStAL - 19
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
A policy may be
Deterministic Stochasticπt : S → A πt : S → ∆(A)
I Terminology : policy = strategy = decision rule = control
Emilie Kaufmann |CRIStAL - 19
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
A policy may be
Stationary Non-stationaryπ = (π, π, π, . . . ) π = (π0, π1, π2, . . . )
I Terminology : policy = strategy = decision rule = control
Emilie Kaufmann |CRIStAL - 19
Policies
Under stationary (deterministic) policy π : S → A, the random process(st)t∈N is a Markov chain, with transition probability
Pπ(st+1 = s ′|st = s) = P(st+1 = s ′|st = s, at = π(s)) = p(s ′|s, π(s))
(can be extended to stochastic policies and continuous spaces)
Ü A MDP is sometimes referred to as a controlled Markov chain
Emilie Kaufmann |CRIStAL - 20
Policies
Under stationary (deterministic) policy π : S → A, the random process(st)t∈N is a Markov chain, with transition probability
Pπ(st+1 = s ′|st = s) = P(st+1 = s ′|st = s, at = π(s)) = p(s ′|s, π(s))
(can be extended to stochastic policies and continuous spaces)
Ü A MDP is sometimes referred to as a controlled Markov chain
Emilie Kaufmann |CRIStAL - 20
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
À Finite horizon
Given a known horizon T ∈ N∗,
V π(s) = Eπ[
T−1∑t=0
rt + rT
∣∣∣∣∣ s0 = s
]
Ü When ? In the presence of a natural notion of duration of anepisode (e.g. maximal number of steps in a game)
Emilie Kaufmann |CRIStAL - 21
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
À Finite horizon
Given a known horizon T ∈ N∗,
V π(s) = Eπ[
T−1∑t=0
rt + rT
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at
Ü When ? In the presence of a natural notion of duration of anepisode (e.g. maximal number of steps in a game)
Emilie Kaufmann |CRIStAL - 21
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
À Finite horizon
Given a known horizon T ∈ N∗,
V π(s) = Eπ[
T−1∑t=0
rt + rT
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? In the presence of a natural notion of duration of anepisode (e.g. maximal number of steps in a game)
Emilie Kaufmann |CRIStAL - 21
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
Á Infinite time horizon with a discount parameter
Given a known discount parameter γ ∈ (0, 1),
V π(s) = Eπ[ ∞∑
t=0
γtrt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? To put more weight on short-term reward / when there is anatural notion of discount
Emilie Kaufmann |CRIStAL - 22
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
 Infinite time horizon with a terminal state
Given τ the random time at which we first reach a terminal state.
V π(s) = Eπ[
τ∑t=0
rt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? For tasks that have a natural notion of terminal state(e.g. achieve some goal)
Emilie Kaufmann |CRIStAL - 23
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
à Infinite time horizon with average reward
V π(s) = limT→∞
Eπ[
1
T
T−1∑t=0
rt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? The system should be controlled for a very long time
Ü slightly harder to work with (not mentioned much in this class)
c.f. [Puterman, Markov Decision Processes, 1994]
Emilie Kaufmann |CRIStAL - 24
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
à Infinite time horizon with average reward
V π(s) = limT→∞
Eπ[
1
T
T−1∑t=0
rt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? The system should be controlled for a very long time
Ü slightly harder to work with (not mentioned much in this class)
c.f. [Puterman, Markov Decision Processes, 1994]
Emilie Kaufmann |CRIStAL - 24
Optimal policy
Given a value function (À,Á, or Ã), one can define the following.
DefinitionThe optimal value in state s is given by
V ?(s) = maxπ
V π(s).
DefinitionAn optimal policy π? satisfies
π? ∈ argmaxπ
V π,
that is∀s ∈ S, π? ∈ argmax
πV π(s)
orV ? = V π?
.
Emilie Kaufmann |CRIStAL - 25
Optimal policy
Properties :
I there exists an optimal policy !(i.e. a policy maximizing the value in all states)
I there exists an optimal policy that is deterministic
I ... an even stationary (except with a finite horizon À )
DefinitionAn optimal policy π? satisfies
π? ∈ argmaxπ
V π,
that is∀s ∈ S, π∗ ∈ argmax
πV π(s)
orV ? = V π?
.
Emilie Kaufmann |CRIStAL - 26
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 27
Example : Cart-Pole
Task : maintain the pole as long as possible in a quasi-vertical position,by applying some force on the cart towards the left or right
Introductory notebook
Emilie Kaufmann |CRIStAL - 29
Back to Retail Store Management
I State : number of bikes in stock on SundayState space : S = {0, . . . ,M}
I Action : number of bikes ordered at the beginning of the weekAction space : A = {0, . . . ,M}
I Reward = balance of the week : if you command at bikes,
rt = −c01(at>0) − c × at − h × st + p ×min(Dt , st + at ,M)
I Transition : you end the week with
st+1 = max(0,min(M, st + at)− Dt
)bikes
Goal : From an initial stock s, maximize the sum of discounted rewards
V π(s) = Eπ[ ∞∑
t=0
γtrt
∣∣∣∣∣ s0 = s
]
Emilie Kaufmann |CRIStAL - 30
Possible policies
I Uniform policy (on reasonnable orders)
π(s) ∼ U({0, . . . ,M − s})
I Constant policy : always buy m0 machines
π(s) = max(M − s,m0)
I Threshold policy : whenever there are less than m1 bikes in stock,refill it up to m2 bikes. Otherwise, do not order.
π(s) = 1(s≤m1)(m2 − s)
Emilie Kaufmann |CRIStAL - 31
Simulations
Notebook
0 20 40 60 80 100weeks
0
2
4
6
8
10
stoc
k
Evolution of the stock under a threshold policy
Figure – Evolution of the stock st under a threshold policy (m1 = 4,m2 = 10)
Emilie Kaufmann |CRIStAL - 32
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 33
Questions
In an known Markov Decision Process
I can we compute an optimal policy ?(based on the explicit knowledge of r(s, a) and p(·|s, a))
I ... even with very large (or infinite) state and/or action spaces ?(e.g. based on a simulator for transitions)
Beyond :
I Can we learn a good policy in an unknown MDP, only by selectingactions and performing transitions ?
I ... and can we do it while maximizing reward ?
Broad goal of Reinforcement Learning
Learning an optimal policy in an unknown (or very large) MDP, byacting (=choosing action) and observing transitions.
Emilie Kaufmann |CRIStAL - 34
Questions
In an known Markov Decision Process
I can we compute an optimal policy ?(based on the explicit knowledge of r(s, a) and p(·|s, a))
I ... even with very large (or infinite) state and/or action spaces ?(e.g. based on a simulator for transitions)
Beyond :
I Can we learn a good policy in an unknown MDP, only by selectingactions and performing transitions ?
I ... and can we do it while maximizing reward ?
Broad goal of Reinforcement Learning
Learning an optimal policy in an unknown (or very large) MDP, byacting (=choosing action) and observing transitions.
Emilie Kaufmann |CRIStAL - 34
Reinforcement Learning
During learning, we increment a database of observed transition
Dt = {(s0, a0, r0, s1), (s1, a1, r1, s2), . . . , (st−1, at−1, rt−1, st)},
which is used to
Ü select the next action to perform, at
rt ∼ νst ,at st+1 ∼ p(·|st , at) Dt+1 = Dt ∪ {(st , at , rt , st+1)}
Ü possibly output a guess πt for π?
Possible goal : Policy Estimation
Make sure that the learnt policy πt is eventually a good policy
I sample complexity result. If t ≥ ...,∣∣V πt (s)− V ?(s)∣∣ ≤ ε.
Ü Exploration/Exploitation trade-off
Emilie Kaufmann |CRIStAL - 35
Reinforcement Learning
During learning, we increment a database of observed transition
Dt = {(s0, a0, r0, s1), (s1, a1, r1, s2), . . . , (st−1, at−1, rt−1, st)},
which is used to
Ü select the next action to perform, at
rt ∼ νst ,at st+1 ∼ p(·|st , at) Dt+1 = Dt ∪ {(st , at , rt , st+1)}
Ü possibly output a guess πt for π?
Possible goal : Rewards maximization
Maximize the rewards accumulated during learning : the value of thelearning policy should be close to the optimal value.
E
[t∑
s=0
(γs)rs
]
Ü Exploration/Exploitation trade-off
Emilie Kaufmann |CRIStAL - 35
Reinforcement Learning
During learning, we increment a database of observed transition
Dt = {(s0, a0, r0, s1), (s1, a1, r1, s2), . . . , (st−1, at−1, rt−1, st)},
which is used to
Ü select the next action to perform, at
rt ∼ νst ,at st+1 ∼ p(·|st , at) Dt+1 = Dt ∪ {(st , at , rt , st+1)}
Ü possibly output a guess πt for π?
Possible goal : Rewards maximization
Maximize the rewards accumulated during learning : the value of thelearning policy should be close to the optimal value.
E
[t∑
s=0
(γs)rs
]
Ü Exploration/Exploitation trade-offEmilie Kaufmann |CRIStAL - 35
Let’s get started
... with one state MDPsa.k.a. Multi-armed bandits
(next class)
Emilie Kaufmann |CRIStAL - 36