Reinforcement Learning Hanxiao Liu Carnegie Mellon University [email protected] September 20, 2016 0 Based on David Silver’s lectures on RL. 1 / 27
Aug 10, 2020
Reinforcement Learning
Hanxiao LiuCarnegie Mellon [email protected]
September 20, 2016
0Based on David Silver’s lectures on RL.1 / 27
Outline
Introduction
Markov Decision Process
Model-Free Prediction
Model-Free Control
Function Approximation
2 / 27
Examples
1. Helicopter Control
2. Atari Games
3. Learning Simple Algorithms
3 / 27
Basic Setups
History: Agent’s experience
Ht = O1, R1, A1, O2, R2, A2, . . . , At−1, Ot, Rt (1)
State: A summary of the history
St = f(Ht) (2)
Markov Property
Pr [St+1|St] = P [St+1|S1, . . . , St] (3)
4 / 27
Basic Setups
Key components of a RL agent
I Policy: Behavior of the agent
a = π(s) (4)
π(a|s) = Pr [At = a|St = s] (5)
I Value Function: A prediction of future reward
vπ(s) = Eπ[Rt+1 + γRt+2 + γ2Rt+3 + . . . |St = s
](6)
I Model: Agent’s representation of the environment.I We will primarily focus on model-free RL.
5 / 27
Basic Setups
Fundamental sequential decision making problems:
I Planning: Fully observed environment.
I RL: The environment is initially unknown.I Exploration and exploitation.
Tasks
I Prediction: Evaluate vπ(s) given π.
I Control: Optimize v∗(s) by refining π.
6 / 27
Outline
Introduction
Markov Decision Process
Model-Free Prediction
Model-Free Control
Function Approximation
7 / 27
Markov Decision Process
8 / 27
Markov Decision Process
A MDP M is defined by (S,A,P ,R, γ)
Pass′ = Pr [St+1 = s′|St = s, At = a] (7)
Ras = E [Rt+1|St = s, At = a] (8)
I Here we assume the environment has been fullyobserved—P and R are known.
I For any fixed policy, (S,Pπ) defines a Markov Process.
Pπss′ =∑a∈A
π(a|s)Pass′ (9)
9 / 27
Value Functions
Goal: Refining π to maximize future returns
Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . (10)
Value functions
I State-value function
vπ(s) = Eπ [Gt|St = s] (11)
I Action-value function
qπ(s, a) = Eπ [Gt|St = s, At = a] (12)
10 / 27
Bellman Expectation Equation
Q: How to evaluate any given π (obtain vπ)?
Value = immediate reward + discounted successor value
vπ(s) = Eπ [Rt+1 + γvπ(St+1)|St = s] (13)
qπ(s, a) = Eπ [Rt+1 + γqπ(St+1, At+1)|St = s, At = a] (14)
vπ(s) =∑a
π(a|s)[Ras + γ
∑s′
Pass′vπ(s′)︸ ︷︷ ︸qπ(s,a)
](15)
=⇒ vπ can be obtained by solving a linear system.
11 / 27
Bellman Optimality Equation
Q: How do we know if π is already the optimal?
Recall for any fixed π:
vπ(s) =∑a
π(a|s)[Ras + γ
∑s′
Pass′vπ(s′)]
(16)
For the optimal π∗:
vπ∗(s) = maxaRas + γ
∑s′
Pass′vπ∗(s′) (17)
No closed-form solution, but iterative solvers are available.
12 / 27
Control
Q: How to improve π?
Policy iteration
(a) Policy evaluation: compute vπ(s) given π.
(b) Getting an improved policy π′ = greedy(vπ)I vπ(s) =⇒ qπ(s, a)I π′(s) = argmaxa qπ(s, a)I Theorem: π′ � π.
Alternative approach: Value iteration
(a) Obtain v∗(s) by solving Bellman optimality equation.
(b) π∗ is implied by v∗.
13 / 27
Outline
Introduction
Markov Decision Process
Model-Free Prediction
Model-Free Control
Function Approximation
14 / 27
Model-Free Prediction
So far we’ve been assuming M is fully observed.
I We are informed about P and R, so estimatingvπ(s) = Eπ [Gt|St = s] is easy.
Can vπ(s) still be estimated if M is partially observed?
I Taking the full exception is not possible.
I However, we can sample from the environment.
15 / 27
Sampling Approaches
Monte Carlo (MC)
1. Estimate Gt by sampling from M following π.I We are sampling from a Markov (Reward) Process.
2. vπ(St)← vπ(St) + α(Gt − vπ(St))
Temporal Difference (TD)
V (St)← V (St) + α(Rt+1 + γV (St+1)− V (St)) (18)
Advantages of TD
I Lower variance.
I More efficient (by exploit the Markov property).
I Handle incomplete sequences.
16 / 27
A Unified View
0http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_
files/MC-TD.pdf
17 / 27
Outline
Introduction
Markov Decision Process
Model-Free Prediction
Model-Free Control
Function Approximation
18 / 27
Model-Free Control
Policy Iteration
1. Policy evaluation: compute Qπ(s, a) given π.I using TD or MC.
2. Policy refinement: π′ = ε-greedy(Qπ)
I π(a|s) =
{εm + 1− ε a∗ = argmaxa Q(s, a)εm otherwise
I Pure greedy is a bad idea—no exploration.
19 / 27
Outline
Introduction
Markov Decision Process
Model-Free Prediction
Model-Free Control
Function Approximation
20 / 27
Large-scale RL
Some real-world problems:
I Go: 10170 states
I Robot control: continuous (infinite) action space
Size of v(s) and/or q(s, a) becomes intractable.
Solution: Function approximation
v̂(s, w) ∼ vπ(s) (19)
q̂(s, a, w) ∼ qπ(s, a) (20)
Ideally v̂ and q̂ are both differentiable and expressive
I deep neural networks
21 / 27
Optimization
Recall in TD:
∆v = α(Rt+1 + γv(St+1)− v(St)) (21)
With function approximation:
∆w = α(Rt+1 + γv̂(St+1, w)− v̂(St, w))∇wv̂(St, w) (22)
≈ supervised learning using stochastic gradient descent
I Experience reply (in DQN): cache and reuse historicaltraining examples to refine w.
22 / 27
Policy-based RL
Alternatively, we can parameterize π instead of vπ (or qπ)
πθ(s, a) = Pr [a|s, θ] (23)
Then optimize θ w.r.t. some objective J(θ).
Advantage: no need to carry out argmaxaq(s, a)
I More efficient for high-dimensional/continuous A.
23 / 27
Policy Gradient
Consider a one-step MDP starting from s ∼ d(s)
J(θ) := Eπθ [r] =∑s
d(s)∑a
πθ(s, a)Ras (24)
∇θJ(θ) =∑s
d(s)∑a
πθ(s, a)∇θ log πθ(s, a)Ras (25)
= Eπθ [∇θ log πθ(s, a)r] (26)
The above allows us to access stochastic policy gradientwithout knowing the environment.
24 / 27
Actor-Critic Models
Policy gradient in more generic cases
∇θJ(θ) = Eπθ [∇θ log π(s, a)qπθ(s, a)] (27)
qπθ(s, a) can be approximated by q̂(s, a, w).
Critic updates w.
Actor updates θ based on critic’s suggestion.
25 / 27
Actor-Critic Models
Similar ideas in AlphaGo
26 / 27
The End
Other interesting topics
I Convergence.
I Exploration v.s. Exploitation.
I Credit assignment.
I Off-Policy RL.
27 / 27