Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning

Reinforcement LearningLecture 1 : Markov Decision Processes

Emilie Kaufmann

Ecole Centrale de Lille, 2019/2020

Emilie Kaufmann |CRIStAL - 1

Outline of the class

• Lecture 1. Markov Decision Processes (MDP), a formalization forreinforcement learning problem(s)

• Lecture 2. One-state, several actions : solving multi-armed banditsUCB algorithms. Thompson Sampling

• Lecture 3. Solving a MDP with known parameters.Dynamic Programming, Value/Policy Iteration

• Lecture 4. First Reinforcement Learning algorithms.TD Learning, Q-Learning

• Lecture 5. Approximate Dynamic Programming

• Lecture 6. Deep Reinforcement Learning (O. Pietquin)

• Lecture 7. Policy Gradient Methods (O. Pietquin)

• Lecture 8. Bandit tools for RLBandit-based exploration, Monte-Carlo Tree Search Methods


1 Markov Decision Processes

2 Examples

3 Objectives : Policies and Values

4 Trying Policies

5 What is Reinforcement Learning ?


Markov Decision Process

A Markov Decision Process (MDP) models a situation in which repeateddecisions (= choices of actions) are made. MDP provides models for theconsequence of each decisions :

I in terms of reward

I in terms of the evoluation of the system’s state

In each (discrete) decision time t = 0, 1, 2, . . . , a learning agent

I selects an action at based on his current state st(or possibly all the previous observations),

I gets a reward rt ∈ R depending on his choice,

I transits to a new state st+1 depending on his choice.



A MDP is parameterized by a tuple (S,A,R,P) where

I S is the state space

I A is the action space

I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)

I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)



I gets a reward rt ∼ ν(st ,at)

I transits to a new state st+1 ∼ p(·|st , at)

[Bellman 1957, Howard 1960, Blackwell 70s...]





I A is the action space (sometimes As for each s ∈ S)





I gets a reward rt ∼ ν(st ,at)

I transits to a new state st+1 ∼ p(·|st , at)

[Bellman 1957, Howard 1960, Blackwell 70s...]








Goal : (made more precise later) select actions so as to maximize somenotion of expected cumulated rewards

Mean reward of action a in state s

r(s, a) = ER∼ν(s,a)[R]








• The tabular case : finite state and action spaces

S = {1, . . . ,S}A = {1, . . . ,A}

For every s, s ′ ∈ S, a ∈ A, p(s ′|s, a) = P (st+1 = s ′|st = s, at = a).


Markovian Dynamics

I Reminder : Markov chain

Definition

A Markov chain on a discrete space X is a stochastic process (Xt)t∈Nthat satisfies the Markov property :

P(Xt = xt |Xt−1 = xt−1, . . . ,X0 = x0) = P(Xt = xt |Xt−1 = xt−1)

for all t ∈ N and (x0, . . . , xt) ∈ X t+1. It is homogeneous if

P(Xt = y |Xt−1 = x) = P(Xt−1 = y |Xt−2 = x).

I An homogeneous Markov chain is characterized by its transitionprobabilities p(y |x) = P (Xt = y |Xt−1 = y) an its initial state.

I If X is continuous, this definition can be extended by mean of atransition kernel such that p(·|x) ∈ ∆(X ) for all x ∈ X .


Markovian Dynamics

I Reminder : Markov chain

play

sleep

cry eat0.6

0.1

0.3

0.7

0.1

0.1

0.2

0.10.6

0.3

0.9

Figure – An example of 4 states Markov chain


Markovian Dynamics

I Back to Markov Decision Processes

In a MDP, the sequence of sucessive states / actions / rewards

s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st

satisfies some extension of the Markov property :

P (st = s, rt−1 = r |s0, a0, r0, s1, a1, r1, . . . , st−1, at−1)

= P (st = s, rt−1 = r |st−1, at−1)

(discrete action and reward)


Illustration of a MDP

credit : Ronan Fruit

I S = {s0, s1, s2}I A = {a0, a1}I the mean reward in state s1 :

r(s1, a0) = 0 and r(s1, a1) = rmax.

I the transition probabilities when performing action a1 in state s0 are

p(s1|s0, a1) = 0.1 and p(s2|s0, a1) = 0.9.



2 Examples


4 Trying Policies



The Centrale Student Dilemma

credit : Remi Munos, Alessandro Lazaric


Tetris

• State : current board and nextblocks to add

• Action : orientation + position ofthe dropped block

• Reward : increment in the score/number of lines

• Transition : new board +randomness in the new block

Ü difficulty : large state space !


The RiverSwim MDP

Two actions available in each state, materialized → and 99K :

sNsN−1

0.95r = 10.6

0.4

1

0.4

0.05

1

0.05

s1

0.6

0.4

0.05

1

0.6

1r = 0.05

s2

0.4

0.05

1

s3

0.6

0.4

0.05

1

credit : Sadegh Talebi

Ü difficulty : delayed, sparse, reward


Grid worlds

I State : position of the robot

I Actions : ←,↑,→,↓I Transitions : (quasi)

deterministic

I Rewards : depends on thebehavior to incentivise(positive or negative rewardson some states / −1 for eachstep before a goal...)


Retail Store Management (1/2)

You owe a bike store. During week t, the (random) demand is Dt units.On Monday morning you may choose to command at additional units :they are delivered immediately before the shop opens.

For each week :

I Maintenance cost : h per unit left in your stock

I Ordering cost : c per unit ordered + fix cost c0 if an order is placed

I Sales profit : p per unit sold

Constraints :

I your warehouse has a maximal capacity of M bikes(any additional bike gets stolen)

I you cannot sell bikes that you don’t have in stock


Retail Store Management (2/2)

I State : number of bikes in stock on SundayState space : S = {0, . . . ,M}

I Action : number of bikes ordered at the beginning of the weekAction space : A = {0, . . . ,M}

I Reward = balance of the week : if you command At bikes,

rt = −c01(at>0) − c × at − h × st + p ×min(Dt , st + at ,M)

I Transition : you end the week with

st+1 = max(0,min(M, st + at)− Dt

)bikes

Ü Markov Decision Process

r(s, a)? p(·|s, a)?



2 Examples


4 Trying Policies



RL objective (informal)

Learn / Act according to a

Good Policy

in a potentially unknown MDP


Policies

Definition

A (Markovian) policy is a sequence π = (πt)t∈N of mappings

πt : S → ∆(A),

where ∆(A) is the set of probability distributions over the action space.

Ü An agent acting under policy π selects at round t the action

at ∼ πt(st)

I Remark : one could also consider history-dependent policiesπt : Ht → ∆(A), where the next action is chosen based on

ht = (s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st)


Policies

Definition


πt : S → ∆(A),


Ü An agent acting under policy π selects at round t the action

at ∼ πt(st)

I Remark : one could also consider history-dependent policiesπt : Ht → ∆(A), where the next action is chosen based on

ht = (s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st)


Policies

Definition


πt : S → ∆(A),


A policy may be

Deterministic Stochasticπt : S → A πt : S → ∆(A)

I Terminology : policy = strategy = decision rule = control


Policies

Definition


πt : S → ∆(A),


A policy may be

Stationary Non-stationaryπ = (π, π, π, . . . ) π = (π0, π1, π2, . . . )

I Terminology : policy = strategy = decision rule = control


Policies

Under stationary (deterministic) policy π : S → A, the random process(st)t∈N is a Markov chain, with transition probability

Pπ(st+1 = s ′|st = s) = P(st+1 = s ′|st = s, at = π(s)) = p(s ′|s, π(s))

(can be extended to stochastic policies and continuous spaces)

Ü A MDP is sometimes referred to as a controlled Markov chain


Policies

Under stationary (deterministic) policy π : S → A, the random process(st)t∈N is a Markov chain, with transition probability

Pπ(st+1 = s ′|st = s) = P(st+1 = s ′|st = s, at = π(s)) = p(s ′|s, π(s))

(can be extended to stochastic policies and continuous spaces)

Ü A MDP is sometimes referred to as a controlled Markov chain


What is a good policy ?

It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.

À Finite horizon

Given a known horizon T ∈ N∗,

V π(s) = Eπ[

T−1∑t=0

rt + rT

∣∣∣∣∣ s0 = s

]

Ü When ? In the presence of a natural notion of duration of anepisode (e.g. maximal number of steps in a game)




À Finite horizon


V π(s) = Eπ[

T−1∑t=0

rt + rT

∣∣∣∣∣ s0 = s

]

at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at





À Finite horizon


V π(s) = Eπ[

T−1∑t=0

rt + rT

∣∣∣∣∣ s0 = s

]

at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s





Á Infinite time horizon with a discount parameter

Given a known discount parameter γ ∈ (0, 1),

V π(s) = Eπ[ ∞∑

t=0

γtrt

∣∣∣∣∣ s0 = s

]


Ü When ? To put more weight on short-term reward / when there is anatural notion of discount




Â Infinite time horizon with a terminal state

Given τ the random time at which we first reach a terminal state.

V π(s) = Eπ[

τ∑t=0

rt

∣∣∣∣∣ s0 = s

]


Ü When ? For tasks that have a natural notion of terminal state(e.g. achieve some goal)




Ã Infinite time horizon with average reward

V π(s) = limT→∞

Eπ[

1

T

T−1∑t=0

rt

∣∣∣∣∣ s0 = s

]


Ü When ? The system should be controlled for a very long time

Ü slightly harder to work with (not mentioned much in this class)

c.f. [Puterman, Markov Decision Processes, 1994]




Ã Infinite time horizon with average reward

V π(s) = limT→∞

Eπ[

1

T

T−1∑t=0

rt

∣∣∣∣∣ s0 = s

]


Ü When ? The system should be controlled for a very long time

Ü slightly harder to work with (not mentioned much in this class)

c.f. [Puterman, Markov Decision Processes, 1994]


Optimal policy

Given a value function (À,Á,Â or Ã), one can define the following.

DefinitionThe optimal value in state s is given by

V ?(s) = maxπ

V π(s).

DefinitionAn optimal policy π? satisfies

π? ∈ argmaxπ

V π,

that is∀s ∈ S, π? ∈ argmax

πV π(s)

orV ? = V π?

.


Optimal policy

Properties :

I there exists an optimal policy !(i.e. a policy maximizing the value in all states)

I there exists an optimal policy that is deterministic

I ... an even stationary (except with a finite horizon À )

DefinitionAn optimal policy π? satisfies

π? ∈ argmaxπ

V π,

that is∀s ∈ S, π∗ ∈ argmax

πV π(s)

orV ? = V π?

.



2 Examples


4 Trying Policies



OpenAI Gym

/https://gym.openai.com/


https://gym.openai.com

Example : Cart-Pole

Task : maintain the pole as long as possible in a quasi-vertical position,by applying some force on the cart towards the left or right

Introductory notebook


http://chercheurs.lille.inria.fr/ekaufman/CartePole_gym.ipnyb

Back to Retail Store Management

I State : number of bikes in stock on SundayState space : S = {0, . . . ,M}

I Action : number of bikes ordered at the beginning of the weekAction space : A = {0, . . . ,M}

I Reward = balance of the week : if you command at bikes,

rt = −c01(at>0) − c × at − h × st + p ×min(Dt , st + at ,M)

I Transition : you end the week with

st+1 = max(0,min(M, st + at)− Dt

)bikes

Goal : From an initial stock s, maximize the sum of discounted rewards

V π(s) = Eπ[ ∞∑

t=0

γtrt

∣∣∣∣∣ s0 = s

]


Possible policies

I Uniform policy (on reasonnable orders)

π(s) ∼ U({0, . . . ,M − s})

I Constant policy : always buy m0 machines

π(s) = max(M − s,m0)

I Threshold policy : whenever there are less than m1 bikes in stock,refill it up to m2 bikes. Otherwise, do not order.

π(s) = 1(s≤m1)(m2 − s)


Simulations

Notebook

0 20 40 60 80 100weeks

0

2

4

6

8

10

stoc

k

Evolution of the stock under a threshold policy

Figure – Evolution of the stock st under a threshold policy (m1 = 4,m2 = 10)


http://chercheurs.lille.inria.fr/ekaufman/RetailStoreManagement.ipnyb


2 Examples


4 Trying Policies



Questions

In an known Markov Decision Process

I can we compute an optimal policy ?(based on the explicit knowledge of r(s, a) and p(·|s, a))

I ... even with very large (or infinite) state and/or action spaces ?(e.g. based on a simulator for transitions)

Beyond :

I Can we learn a good policy in an unknown MDP, only by selectingactions and performing transitions ?

I ... and can we do it while maximizing reward ?

Broad goal of Reinforcement Learning

Learning an optimal policy in an unknown (or very large) MDP, byacting (=choosing action) and observing transitions.


Questions

In an known Markov Decision Process

I can we compute an optimal policy ?(based on the explicit knowledge of r(s, a) and p(·|s, a))

I ... even with very large (or infinite) state and/or action spaces ?(e.g. based on a simulator for transitions)

Beyond :

I Can we learn a good policy in an unknown MDP, only by selectingactions and performing transitions ?

I ... and can we do it while maximizing reward ?

Broad goal of Reinforcement Learning

Learning an optimal policy in an unknown (or very large) MDP, byacting (=choosing action) and observing transitions.


Reinforcement Learning

During learning, we increment a database of observed transition

Dt = {(s0, a0, r0, s1), (s1, a1, r1, s2), . . . , (st−1, at−1, rt−1, st)},

which is used to

Ü select the next action to perform, at

rt ∼ νst ,at st+1 ∼ p(·|st , at) Dt+1 = Dt ∪ {(st , at , rt , st+1)}

Ü possibly output a guess πt for π?

Possible goal : Policy Estimation

Make sure that the learnt policy πt is eventually a good policy

I sample complexity result. If t ≥ ...,∣∣V πt (s)− V ?(s)∣∣ ≤ ε.

Ü Exploration/Exploitation trade-off





which is used to




Possible goal : Rewards maximization

Maximize the rewards accumulated during learning : the value of thelearning policy should be close to the optimal value.

E

[t∑

s=0

(γs)rs

]

Ü Exploration/Exploitation trade-off





which is used to




Possible goal : Rewards maximization

Maximize the rewards accumulated during learning : the value of thelearning policy should be close to the optimal value.

E

[t∑

s=0

(γs)rs

]

Ü Exploration/Exploitation trade-offEmilie Kaufmann |CRIStAL - 35

Let’s get started

... with one state MDPsa.k.a. Multi-armed bandits

(next class)


Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning

Documents