Page 1
Topics in Reinforcement Learning:Lessons from AlphaZero for
(Sub)Optimal Control and Discrete Optimization
Arizona State UniversityCourse CSE 691, Spring 2022
Links to Class Notes, Videolectures, and Slides athttp://web.mit.edu/dimitrib/www/RLbook.html
Dimitri P. [email protected]
Lecture 2Stochastic Finite and Infinite Horizon DP
Bertsekas Reinforcement Learning 1 / 29
Page 2
Outline
1 Finite Horizon Deterministic Problem - Approximation in Value Space
2 Stochastic DP Algorithm
3 Linear Quadratic Problems - An Important Favorable Special Case
4 Infinite Horizon - An Overview of Theory and Algorithms
Bertsekas Reinforcement Learning 2 / 29
Page 3
Review - Finite Horizon Deterministic Problem
......
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Stage k Future Stges
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Stage k Future Stages
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Systemxk+1 = fk (xk , uk ), k = 0, 1, . . . ,N − 1
where xk : State, uk : Control chosen from some set Uk (xk )
Arbitrary state and control spaces
Cost function:
gN(xN) +N−1∑k=0
gk (xk , uk )
For given initial state x0, minimize over control sequences {u0, . . . , uN−1}
J(x0; u0, . . . , uN−1) = gN(xN) +N−1∑k=0
gk (xk , uk )
Optimal cost function J∗(x0) = min uk∈Uk (xk )k=0,...,N−1
J(x0; u0, . . . , uN−1)
Bertsekas Reinforcement Learning 4 / 29
Page 4
Review - DP Algorithm for Deterministic Problems
Go backward to compute the optimal costs J∗k (xk ) of the xk -tail subproblems
(off-line training - involves lots of computation)Start with
J∗N(xN) = gN(xN), for all xN ,
and for k = 0, . . . ,N − 1, let
J∗k (xk ) = minuk∈Uk (xk )
[gk (xk , uk ) + J∗k+1
(fk (xk , uk )
)], for all xk .
Then optimal cost J∗(x0) is obtained at the last step: J∗0 (x0) = J∗(x0).
Go forward to construct optimal control sequence {u∗0 , . . . ,u
∗N−1} (on-line play)
Start with
u∗0 ∈ arg minu0∈U0(x0)
[g0(x0, u0) + J∗1
(f0(x0, u0)
)], x∗1 = f0(x0, u∗0 ).
Sequentially, going forward, for k = 1, 2, . . . ,N − 1, set
u∗k ∈ arg minuk∈Uk (x∗k )
[gk (x∗k , uk ) + J∗k+1
(fk (x∗k , uk )
)], x∗k+1 = fk (x∗k , u
∗k ).
Bertsekas Reinforcement Learning 5 / 29
Page 5
Q-Factors for Deterministic Problems
An alternative (and equivalent) form of the DP algorithm
Generates the optimal Q-factors, defined for all (xk , uk ) and k by
Q∗k (xk , uk ) = gk (xk , uk ) + J∗k+1(fk (xk , uk )
)The optimal cost function J∗k can be recovered from the optimal Q-factor Q∗k
J∗k (xk ) = minuk∈Uk (xk )
Q∗k (xk , uk )
The DP algorithm can be written in terms of Q-factors
Q∗k (xk , uk ) = gk (xk , uk ) + minuk+1∈Uk+1(fk (xk ,uk ))
Q∗k+1(fk (xk , uk ), uk+1
)Exact and approximate forms of this and other related algorithms, form animportant class of RL methods known as Q-learning.
Bertsekas Reinforcement Learning 6 / 29
Page 6
Approximation in Value Space
We replace J∗k with an approximation Jk during on-line play
Start withu0 ∈ arg min
u0∈U0(x0)
[g0(x0, u0) + J1
(f0(x0, u0)
)]Set x1 = f0(x0, u0)
Sequentially, going forward, for k = 1, 2, . . . ,N − 1, set
uk ∈ arg minuk∈Uk (xk )
[gk (xk , uk ) + Jk+1
(fk (xk , uk )
)], xk+1 = fk (xk , uk )
How do we compute Jk+1(xk+1)? This is one of the principal issues in RL
Off-line problem approximation: Use as Jk+1 the optimal cost function of a simplerproblem, computed off-line by exact DP
On-line approximate optimization, e.g., solve on-line a shorter horizon problem bymultistep lookahead minimization and simple terminal cost (often done in MPC)
Parametric cost approximation: Obtain Jk+1(xk+1) from a parametric class offunctions J(xk+1, r), where r is a parameter, e.g., training using data and a NN.
Rollout with a heuristic: We will focus on this for the moment.Bertsekas Reinforcement Learning 7 / 29
Page 7
Rollout for Finite-State Deterministic Problems
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Start End Plus Terminal Cost Approximation S1 S2 S3 S` Sm�1 Sm
State i y(i) Ay(i) + b �1(i, v) �m(i, v) �2(i, v) J(i, v) = r0�(i, v)
Deterministic Transition xk+1 = fk(xk, uk)
Cost Vector r⇤ J1 = Corrected V Enlarged State Space Cost J0 Cost J1 Cost r⇤
Representative States Controls u are associated with states i
N Stages jN�1 jN i 2 Ix j 2 Iy
Sx1 Sx` Sxm x1 x` xm r⇤x1r⇤x`
r⇤xmFootprint Sets J(i) J(j) =
Py2A �jyr⇤y
minu2U(i)
nX
j=1
pij(u)�g(i, u, j) + ↵J(j)
�i = x Ix
⇡/4 Sample State xsk Sample Control us
k Sample Next State xsk+1 Sample Transition Cost gs
k Simulator
Representative States x (Coarse Grid) Critic Actor Approximate PI Aggregate Problem
pxy(u) =
nX
j=1
pxj(u)�jy g(x, u) =
nX
j=1
pxj(u)g(x, u, j)
Range of Weighted Projections J⇤(i) Original States to States (Fine Grid) Original State Space
dxi = 0 for i /2 Ix �jy = 1 for j 2 Iy �jy = 0 or 1 for all j and y Each j connects to a single x
x pxj1(u) pxj2(u) pxj3(u) �j1y1 �j1y2 �j1y3 �jy with Aggregation Probabilities �jy = 0 or 1
Relate to Rm r⇤m�1 r⇤m x0k+1
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk) Representative Features
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
1
Sec. 1.2 Deterministic Dynamic Programming 25
x0 0 x1 ) . . .
uk xk xNk xk+1
k u′k
′k u′′
k x+1 x′′
k+1
x′N
x′′N
. . . Q-Factors
-Factors Current State x
Current State xk
Nearest Neighbor Heuristic
Nearest Neighbor Heuristic
Nearest Neighbor Heuristic
Next Cities Next States
∗ x′k+1
Figure 1.2.9 Schematic illustration of rollout with one-step lookahead for a de-terministic problem. At state xk, for every pair (xk, uk), uk ∈ Uk(xk), the base
heuristic generates an approximate Q-factor
Qk(xk, uk) = gk(xk, uk) + Hk+1
(fk(xk, uk)
),
and selects the control µk(xk) with minimal Q-factor.
and the corresponding cost
Hk+1(xk+1) = gk+1(xk+1, uk+1) + · · · + gN−1(xN−1, uN−1) + gN (xN ).
The rollout algorithm then applies the control that minimizes over uk ∈Uk(xk) the tail cost expression for stages k to N :
gk(xk, uk) + Hk+1(xk+1).
Equivalently, and more succinctly, the rollout algorithm applies atstate xk the control µk(xk) given by the minimization
µk(xk) ∈ arg minuk∈Uk(xk)
Qk(xk, uk), (1.14)
where Qk(xk, uk) is the approximate Q-factor defined by
Qk(xk, uk) = gk(xk, uk) + Hk+1
(fk(xk, uk)
); (1.15)
see Fig. 1.2.9. Rollout defines a suboptimal policy π = {µ0, . . . , µN−1},referred to as the rollout policy, where for each xk and k, µk(xk) is thecontrol produced by the Q-factor minimization (1.14).
Note that the rollout algorithm requires running the base heuristicfor a number of times that is bounded by Nn, where n is an upper boundon the number of control choices available at each state. Thus if n issmall relative to N , it requires computation equal to a small multiple of Ntimes the computation time for a single application of the base heuristic.Similarly, if n is bounded by a polynomial in N , the ratio of the rolloutalgorithm computation time to the base heuristic computation time is apolynomial in N .
Cost approximation by running a heuristic from states of interest
We generate a single system trajectory {x0, x1, . . . , xN} by on-line play
Upon reaching xk , we compute for all uk ∈ Uk (xk ), the corresponding next statesxk+1 = fk (xk , uk )
From each of the next states xk+1 we run the heuristic and compute the heuristiccost Hk+1(xk+1)
We apply uk that minimizes over uk ∈ Uk (xk ), the (heuristic) Q-factor
gk (xk , uk ) + Hk+1(xk+1)
We generate the next state xk+1 = fk (xk , uk ) and repeatBertsekas Reinforcement Learning 8 / 29
Page 8
Traveling Salesman Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
Matrix of Intercity Travel Costs
Corrected J J J* Cost Jµ
�F (i), r
�of i ⇡ Jµ(i) Jµ(i) Feature Map
Jµ
�F (i), r
�: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
�F (i)
�of
F (i) =�F1(i), . . . , Fs(i)
�: Vector of Features of i
Jµ
�F (i)
�: Feature-based architecture Final Features
If Jµ
�F (i), r
�=Ps
`=1 F`(i)r` it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡
Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡
W+ =�J | J � J+, J(t) = 0
VI converges to J+ from within W+
Cost: g(xk, uk) � 0 VI converges to Jp from within Wp
1
Matrix of Intercity Travel Costs
Corrected J J J* Cost Jµ
�F (i), r
�of i ⇡ Jµ(i) Jµ(i) Feature Map
Jµ
�F (i), r
�: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
�F (i)
�of
F (i) =�F1(i), . . . , Fs(i)
�: Vector of Features of i
Jµ
�F (i)
�: Feature-based architecture Final Features
If Jµ
�F (i), r
�=Ps
`=1 F`(i)r` it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡
Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡
W+ =�J | J � J+, J(t) = 0
VI converges to J+ from within W+
Cost: g(xk, uk) � 0 VI converges to Jp from within Wp
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
Origin Node s Artificial Terminal Node t
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence
Decoder Decoded Sequence
1
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25 8 12
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25 8 12
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25 8 12 13
Initial State s Terminal State t
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25 8 12 13
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
6 13 14 24 27
Corrected J J J* Cost Jµ
(F (i), r
)of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
(F (i), r
): Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
(F (i)
)of
F (i) =(F1(i), . . . , Fs(i)
): Vector of Features of i
Jµ
(F (i)
): Feature-based architecture Final Features
If Jµ
(F (i), r
)=
∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ ={J | J ≥ J+, J(t) = 0
}
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
6 13 14 24 27
Corrected J J J* Cost Jµ
(F (i), r
)of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
(F (i), r
): Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
(F (i)
)of
F (i) =(F1(i), . . . , Fs(i)
): Vector of Features of i
Jµ
(F (i)
): Feature-based architecture Final Features
If Jµ
(F (i), r
)=
∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ ={J | J ≥ J+, J(t) = 0
}
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
6 13 14 24 27
Corrected J J J* Cost Jµ
(F (i), r
)of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
(F (i), r
): Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
(F (i)
)of
F (i) =(F1(i), . . . , Fs(i)
): Vector of Features of i
Jµ
(F (i)
): Feature-based architecture Final Features
If Jµ
(F (i), r
)=
∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ ={J | J ≥ J+, J(t) = 0
}
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
6 13 14 24 27 Rollout
Base Heuristic Corrected J J J* Cost Jµ
(F (i), r
)of i ≈ Jµ(i) Jµ(i)
Feature Map
Jµ
(F (i), r
): Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
(F (i)
)of
F (i) =(F1(i), . . . , Fs(i)
): Vector of Features of i
Jµ
(F (i)
): Feature-based architecture Final Features
If Jµ
(F (i), r
)=
∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ ={J | J ≥ J+, J(t) = 0
}
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
6 13 14 24 27 Rollout
Base Heuristic Corrected J J J* Cost Jµ
(F (i), r
)of i ≈ Jµ(i) Jµ(i)
Feature Map
Jµ
(F (i), r
): Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
(F (i)
)of
F (i) =(F1(i), . . . , Fs(i)
): Vector of Features of i
Jµ
(F (i)
): Feature-based architecture Final Features
If Jµ
(F (i), r
)=
∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ ={J | J ≥ J+, J(t) = 0
}
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
Cost-to-go approximation Expected value approximation
Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r
TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ
Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation
Simplified minimization
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20
Through TµJµ = TJµ Lookahead Minimization
Value iterations
Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µPolicy evaluations for µ and µ
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
1
Cost-to-go approximation Expected value approximation
Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r
TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ
Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation
Simplified minimization
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20
Through TµJµ = TJµ Lookahead Minimization
Value iterations
Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µPolicy evaluations for µ and µ
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
1
Cost-to-go approximation Expected value approximation
Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r
TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ
Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation
Simplified minimization
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20
Through TµJµ = TJµ Lookahead Minimization
Value iterations
Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µPolicy evaluations for µ and µ
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
1
Cost-to-go approximation Expected value approximation
Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r
TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ
Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation
Simplified minimization
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20
Through TµJµ = TJµ Lookahead Minimization
Value iterations
Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µPolicy evaluations for µ and µ
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
1
(u0, . . . , uk) Sk(u0, . . . , uk, uk+1), uk+1 ∈ Uk+1 (u0, . . . , uk, uk+1, . . . , uN−1)
Base Heuristic Expert Ranks Complete Solutions R2 40 23
R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r
TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ
Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation
Simplified minimization
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20
Through TµJµ = TJµ Lookahead Minimization
Value iterations Compares Complete Foldings
Expert Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µPolicy evaluations for µ and µ
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
1
T0 T1 T2
(u0, . . . , uk) Sk(u0, . . . , uk, uk+1), uk+1 ∈ Uk+1 (u0, . . . , uk, uk+1, . . . , uN−1)
Monotonicity Property Under Sequential Improvement
Cost of Tk ≥ Cost of Tk+1
Cost of R0 ≥ · · · ≥ Cost of Rk ≥ Cost of Rk+1 ≥ · · · ≥ Cost of RN .
Base Heuristic Expert Ranks Complete Solutions R2 40 23
R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r uk xk+1
Heuristic from AB
TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ x0 = x0
Optimal Base Rollout Terminal Score Approximation Current
Trajectory Tk Trajectory Tk+1 rollout policy µ Simplified mini-mization
Base Heuristic Cost Hk(xk) Base Heuristic Cost Hk+1(xk+1)
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20
Through TµJµ = TJµ Lookahead Minimization
Value iterations Compares Complete Foldings
Expert Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µPolicy evaluations for µ and µ
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
1
T0 T1 T2
(u0, . . . , uk) Sk(u0, . . . , uk, uk+1), uk+1 ∈ Uk+1 (u0, . . . , uk, uk+1, . . . , uN−1)
Monotonicity Property Under Sequential Improvement
Cost of Tk ≥ Cost of Tk+1
Cost of R0 ≥ · · · ≥ Cost of Rk ≥ Cost of Rk+1 ≥ · · · ≥ Cost of RN .
Base Heuristic Expert Ranks Complete Solutions R2 40 23
R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r uk xk+1
Heuristic from AB
TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ x0 = x0
Optimal Base Rollout Terminal Score Approximation Current
Trajectory Tk Trajectory Tk+1 rollout policy µ Simplified mini-mization
Base Heuristic Cost Hk(xk) Base Heuristic Cost Hk+1(xk+1)
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20
Through TµJµ = TJµ Lookahead Minimization
Value iterations Compares Complete Foldings
Expert Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µPolicy evaluations for µ and µ
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K) T2
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K)
T2 Cost 28 Cost 27 Cost 13
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K)
T2 Cost 28 Cost 27 Cost 13
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K)
T2 Cost 28 Cost 27 Cost 13
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min{c, a + J(2)
}
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial City Current Partial Tour Next Cities Nearest Neighbor
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
1
Initial City Current Partial Tour Next Cities Nearest Neighbor
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
1
Initial City Current Partial Tour Next Cities Nearest Neighbor
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
1
Initial City Current Partial Tour Next Cities Nearest Neighbor
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Complete Tours Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E{gk(xk, uk, wk)+Jk+1(xk+1)
}
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Start End Plus Terminal Cost Approximation S1 S2 S3 S` Sm�1 Sm
Generate Improved Policy µ Next Partial Tour
Generate “Improved” Policy µ by µ(i) 2 arg minu2U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b �1(i, v) �m(i, v) �2(i, v) J(i, v) = r0�(i, v)
Deterministic Transition xk+1 = fk(xk, uk)
Cost Vector r⇤ J1 = Corrected V Enlarged State Space Cost J0 Cost J1 Cost r⇤
Representative States Controls u are associated with states i
N Stages jN�1 jN i 2 Ix j 2 Iy
Sx1 Sx` Sxm x1 x` xm r⇤x1r⇤x`
r⇤xmFootprint Sets J(i) J(j) =
Py2A �jyr⇤y
minu2U(i)
nX
j=1
pij(u)�g(i, u, j) + ↵J(j)
�i = x Ix
⇡/4 Sample State xsk Sample Control us
k Sample Next State xsk+1 Sample Transition Cost gs
k Simulator
Representative States x (Coarse Grid) Critic Actor Approximate PI Aggregate Problem
pxy(u) =
nX
j=1
pxj(u)�jy g(x, u) =
nX
j=1
pxj(u)g(x, u, j)
Range of Weighted Projections J⇤(i) Original States to States (Fine Grid) Original State Space
dxi = 0 for i /2 Ix �jy = 1 for j 2 Iy �jy = 0 or 1 for all j and y Each j connects to a single x
x pxj1(u) pxj2(u) pxj3(u) �j1y1 �j1y2 �j1y3 �jy with Aggregation Probabilities �jy = 0 or 1
Relate to Rm r⇤m�1 r⇤m x0k+1
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk) Representative Features
1
Start End Plus Terminal Cost Approximation S1 S2 S3 S` Sm�1 Sm
Generate Improved Policy µ Next Partial Tours
Generate “Improved” Policy µ by µ(i) 2 arg minu2U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b �1(i, v) �m(i, v) �2(i, v) J(i, v) = r0�(i, v)
Deterministic Transition xk+1 = fk(xk, uk)
Cost Vector r⇤ J1 = Corrected V Enlarged State Space Cost J0 Cost J1 Cost r⇤
Representative States Controls u are associated with states i
N Stages jN�1 jN i 2 Ix j 2 Iy
Sx1 Sx` Sxm x1 x` xm r⇤x1r⇤x`
r⇤xmFootprint Sets J(i) J(j) =
Py2A �jyr⇤y
minu2U(i)
nX
j=1
pij(u)�g(i, u, j) + ↵J(j)
�i = x Ix
⇡/4 Sample State xsk Sample Control us
k Sample Next State xsk+1 Sample Transition Cost gs
k Simulator
Representative States x (Coarse Grid) Critic Actor Approximate PI Aggregate Problem
pxy(u) =
nX
j=1
pxj(u)�jy g(x, u) =
nX
j=1
pxj(u)g(x, u, j)
Range of Weighted Projections J⇤(i) Original States to States (Fine Grid) Original State Space
dxi = 0 for i /2 Ix �jy = 1 for j 2 Iy �jy = 0 or 1 for all j and y Each j connects to a single x
x pxj1(u) pxj2(u) pxj3(u) �j1y1 �j1y2 �j1y3 �jy with Aggregation Probabilities �jy = 0 or 1
Relate to Rm r⇤m�1 r⇤m x0k+1
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk) Representative Features
1
Bertsekas Reinforcement Learning 9 / 29
Page 9
Stochastic DP Problems - Perfect State Observation (We Know xk )
......Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Stage k Future Stages
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random cost gk(xk, uk, wk)
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function
Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J
u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)
x0 xk x1k+1 x2
k+1 x3k+1 x4
k+1 States xN Base Heuristic ik States ik+1 States ik+2
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
1
Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random cost gk(xk, uk, wk)
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function
Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J
u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)
x0 xk x1k+1 x2
k+1 x3k+1 x4
k+1 States xN Base Heuristic ik States ik+1 States ik+2
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
1
Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk)
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function
Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J
u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)
x0 xk x1k+1 x2
k+1 x3k+1 x4
k+1 States xN Base Heuristic ik States ik+1 States ik+2
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
1
Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk)
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function
Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J
u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)
x0 xk x1k+1 x2
k+1 x3k+1 x4
k+1 States xN Base Heuristic ik States ik+1 States ik+2
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
1
System xk+1 = fk (xk , uk ,wk ) with random “disturbance" wk (e.g., physical noise,market uncertainties, demand for inventory, unpredictable breakdowns, etc)
Cost function: E{
gN(xN) +∑N−1
k=0 gk (xk , uk ,wk )}
Policies π = {µ0, . . . , µN−1}, where µk is a “closed-loop control law" or “feedbackpolicy"/a function of xk . A “lookup table" for the control uk = µk (xk ) to apply at xk .
An important point: Using feedback (i.e., choosing controls with knowledge of thestate) is beneficial in view of the stochastic nature of the problem.
For given initial state x0, minimize over all π = {µ0, . . . , µN−1} the cost
Jπ(x0) = E
{gN(xN) +
N−1∑k=0
gk(xk , µk (xk ),wk
)}
Optimal cost function: J∗(x0) = minπ Jπ(x0). Optimal policy: Jπ∗(x0) = J∗(x0)
Bertsekas Reinforcement Learning 11 / 29
Page 10
The Stochastic DP Algorithm
Produces the optimal costs J∗k (xk ) of the tail subproblems that start at xk
Start with J∗N(xN) = gN(xN), and for k = 0, . . . ,N − 1, let
J∗k (xk ) = minuk∈Uk (xk )
Ewk
{gk (xk , uk ,wk ) + J∗k+1
(fk (xk , uk ,wk )
)}, for all xk .
The optimal cost J∗(x0) is obtained at the last step: J∗0 (x0) = J∗(x0).
The optimal policy component µ∗k can be constructed simultaneously with J∗k , andconsists of the minimizing u∗k = µ∗k (xk ) above.
Alternative on-line implementation of the optimal policy, given J∗1 , . . . , J
∗N−1
Sequentially, going forward, for k = 0, 1, . . . ,N − 1, observe xk and apply
u∗k ∈ arg minuk∈Uk (xk )
Ewk
{gk (xk , uk ,wk ) + J∗k+1
(fk (xk , uk ,wk )
)}.
Issues: Need to know J∗k+1, compute expectation for each uk , minimize over all uk
Approximation in value space: Use Jk in place of J∗k ; approximate E{·} and minuk .
Bertsekas Reinforcement Learning 12 / 29
Page 11
A Very Favorable Case: Linear-Quadratic Problems
An example of a linear-quadratic problemKeep car velocity constant (like oversimplified cruise control): xk+1 = xk + buk + wk
Here xk = vk − v is the deviation between the vehicle’s velocity vk at time k fromdesired level v , and b is given
uk is unconstrained; wk has 0-mean and variance σ2
Cost over N stages: qx2N +
∑N−1k=0 (qx2
k + ru2k ), where q > 0 and r > 0 are given
Consider a more general problem where the system is xk+1 = axk + buk + wk
The DP algorithm starts with J∗N(xN) = qx2N , and generates J∗k according to
J∗k (xk ) = minuk
Ewk
{qx2
k + ru2k + J∗k+1(axk + buk + wk )
}, k = 0, . . . ,N − 1
DP algorithm can be carried out in closed form to yieldJ∗k (xk ) = Kk x2
k + const, µ∗k (xk ) = Lk xk : Kk and Lk can be explicitly computed
The solution does not depend on the distribution of wk as long as it has 0 mean:Certainty Equivalence (a common approximation idea for other problems)
Bertsekas Reinforcement Learning 14 / 29
Page 12
Derivation - DP Algorithm starting from Terminal Cost J∗N(x) = qx2
J∗N−1(xN−1) = minuN−1
E{
qx2N−1 + ru2
N−1 + J∗N(axN−1 + buN−1 + wN−1)}
= minuN−1
E{
qx2N−1 + ru2
N−1 + q(axN−1 + buN−1 + wN−1)2}= min
uN−1
[qx2
N−1 + ru2N−1 + (axN−1 + buN−1)2 + 2 E{wN−1}︸ ︷︷ ︸
=0
(axN−1 + buN−1) + q E{w2N−1}︸ ︷︷ ︸
=σ2
]= qx2
N−1 + minuN−1
[ru2
N−1 + q(axN−1 + buN−1)2]+ qσ2
Minimize by setting to zero the derivative: 0 = 2ruN−1 + 2qb(axN−1 + buN−1), to obtain
µ∗N−1(xN−1) = LN−1xN−1 with LN−1 = − abqr + b2q
and by substitution, J∗N−1(xN−1) = PN−1x2N−1 + qσ2, where PN−1 = a2rq
r+b2q + q
Similarly, going backwards, we obtain for all k :
J∗k (xk ) = Pk x2k +σ2
N−1∑m=k
Pm+1, µ∗k (xk ) = Lk xk , Pk =
a2rPk+1
r + b2Pk+1+q, Lk = − abPk+1
r + b2Pk+1
Bertsekas Reinforcement Learning 15 / 29
Page 13
Linear-Quadratic Problems in General
Observations and generalizationsThe solution does not depend on the distribution of wk , only on the mean (which is0), i.e., we have certainty equivalence
Generalization to multidimensional problems, nonzero mean disturbances, etc
Generalization to infinite horizon
Generalization to problems where the state is observed partially through linearmeasurements: Optimal policy involves an extended form of certainty equivalence
Lk E{xk | measurements}
where E{xk | measurements} is provided by an estimator (e.g., Kalman filter)Linear systems and quadratic cost are a starting point for other lines ofinvestigations and approximations:
I Problems with safety/state constraints [Model Predictive Control (MPC)]I Problems with control constraints (MPC)I Unknown or changing system parameters (adaptive control)
Bertsekas Reinforcement Learning 16 / 29
Page 14
Approximation in Value Space - The Three Approximationsmin
uk,µk+1,...,µk+ℓ−1
E
{gk(xk, uk, wk) +
k+ℓ−1∑
m=k+1
gk
(xm, µm(xm), wm
)+ Jk+ℓ(xk+ℓ)
}
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
(xk(Ik)
)
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E{gk(xk, uk, wk) + Jk+1(xk+ℓ)
}
minuk,µk+1,...,µk+ℓ−1
E
{gk(xk, uk, wk) +
k+ℓ−1∑
m=k+1
gk
(xm, µm(xm), wm
)+ Jk+ℓ(xk+ℓ)
}
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
(xk(Ik)
)
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control) Computation of Jk+1:
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E{
gk(xk, uk, wk) + Jk+1(xk+1)}
minuk,µk+1,...,µk+ℓ−1
E
{gk(xk, uk, wk) +
k+ℓ−1∑
m=k+1
gk
(xm, µm(xm), wm
)+ Jk+ℓ(xk+ℓ)
}
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
(xk(Ik)
)
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Cost-to-go approximation Expected value approximation
Simplified minimization
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
1
Cost-to-go approximation Expected value approximation
Simplified minimization
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
1
Cost-to-go approximation Expected value approximation
Simplified minimization
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
1
uk = µk(xk, rk) µk(·, rk) µk(xk) xk At xk
µk(xk) Jk(xk) xsk, us
k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)
Motion equations xk+1 = fk(xk, uk) Current State x
Penalty for deviating from nominal trajectory
State and control constraints Keep state close to a trajectory
Control Probabilities Run the Base Policy
Truncated Horizon Rollout Terminal Cost Approximation J
J∗3 (x3) J∗
2 (x2) J∗1 (x1) Optimal Cost J∗
0 (x0) = J∗(x0)
Optimal Cost J∗k (xk) xk xk+1 x
′k+1 x
′′k+1
Opt. Cost J∗k+1(xk+1) Opt. Cost J∗
k+1(x′k+1) Opt. Cost J∗
k+1(x′′k+1)
xk uk u′k u
′′k Matrix of Intercity Travel Costs
Corrected J J J* Cost Jµ
(F (i), r
)of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
(F (i), r
): Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
(F (i)
)of
F (i) =(F1(i), . . . , Fs(i)
): Vector of Features of i
Jµ
(F (i)
): Feature-based architecture Final Features
1
J1 J2 J∗ = TJ∗
x pxx(u) pxy(u) pyx(u) pyy(u) pxt(u) pyt(u) x y
αpxx(u) αpxy(u) αpyx(u) αpyy(u) 1 − α
TJ = minµ TµJ Cost 0 Cost g(x, u, y) System State Data Control Parameter Estimation
Optimal cost Cost of rollout policy µ Cost of base policy µ
Cost E{g(x, u, y)
}Cost E
{g(i, u, j)
}“On-Line Play”
Value Network Current Policy Network Approximate Policy
Approximate Policy Evaluation Approximately Improved Policy Evaluation
TµJ Approximate Policy Evaluation for µ Approximate Policy Improvement
0 1 2 3 4 5 6
Deterministic Stochastic Rollout Continuous MPC Constrained Discrete Combinatorial Multiagent
MCTS Variance Reduction
Section 2.3 Section 2.4 Sections 2.5, 3.1 3.3 3.4 3.2, 3.3, 3.3.3 2.4.3 2.4.2 3.3, 3.4
2.4.3, 2.4.4 2.4.2 3.3, 3.4
Monte Carlo Tree Search ‘
minuk,µk+1,...,µk+ℓ−1
E
{gk(xk, uk, wk) +
k+ℓ−1∑
i=k+1
gi
(xi, µi(xi), wi
)+ Jk+ℓ(xk+ℓ)
}
1
Important variants: Use multistep lookahead, replace E{·} by limited simulation (e.g., a“certainty equivalent" of wk ), multiagent rollout (for multicomponent control problems)
An example: Truncated rollout with base policy and terminal costapproximation (however obtained, e.g., off-line training)uk = µk(xk, rk) µk(·, rk) µk(xk) xk
µk(xk) Jk(xk) xsk, us
k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)
Motion equations xk+1 = fk(xk, uk) Current State x
Penalty for deviating from nominal trajectory
State and control constraints Keep state close to a trajectory
Control Probabilities Run the Base Policy
Truncated Horizon Rollout
J∗3 (x3) J∗
2 (x2) J∗1 (x1) Optimal Cost J∗
0 (x0) = J∗(x0)
Optimal Cost J∗k (xk) xk xk+1 x
′k+1 x
′′k+1
Opt. Cost J∗k+1(xk+1) Opt. Cost J∗
k+1(x′k+1) Opt. Cost J∗
k+1(x′′k+1)
xk uk u′k u
′′k Matrix of Intercity Travel Costs
Corrected J J J* Cost Jµ
(F (i), r
)of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
(F (i), r
): Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
(F (i)
)of
F (i) =(F1(i), . . . , Fs(i)
): Vector of Features of i
Jµ
(F (i)
): Feature-based architecture Final Features
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space Heuristic Cost Approximation
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead for stages
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead for stages
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead for stages
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
1
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation
Approximate Q-Factor Q(x, u) At x
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Base Policy
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk xk+m+1
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk xk+m+1
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
Current Partial Folding
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
Sec. 1.3 Stochastic Dynamic Programming 27
Similar to the deterministic case, Q-learning involves the calculationof either the optimal Q-factors (1.16) or approximations Qk(xk, uk). Theapproximate Q-factors may be obtained using approximation in value spaceschemes, and can be used to obtain approximately optimal policies throughthe Q-factor minimization
µk(xk) ∈ arg minuk∈Uk(xk)
Qk(xk, uk). (1.17)
In Chapter 4, we will discuss the use of neural networks in such approxi-mations.
Cost Versus Q-factor Approximations - Robustness and On-Line Replanning
We have seen that it is possible to implement approximation in value spaceby using cost function approximations [cf. Eq. (1.15)] or by using Q-factorapproximations [cf. Eq. (1.17)], so the question arises which one to use in agiven practical situation. One important consideration is the facility of ob-taining suitable cost or Q-factor approximations. This depends largely onthe problem and also on the availability of data on which the approxima-tions can be based. However, there are some other major considerations.
In particular, the cost function approximation scheme
µk(xk) ∈ arg minuk∈Uk(xk)
E{gk(xk, uk, wk) + Jk+1
(fk(xk, uk, wk)
)},
has an important disadvantage: the expected value above needs to be com-puted on-line for all uk ∈ Uk(xk), and this may involve substantial compu-tation. On the other hand it also has an important advantage in situationswhere the system function fk, the cost per stage gk, or the control con-straint set Uk(xk) can change as the system is operating. We will discussin more detail how this situation can arise in practice later in this chapter.Assuming that the new values of fk, gk, or Uk(xk) become known to thecontroller, on-line replanning may be used, as discussed earlier for deter-ministic problems. This may improve substantially the robustness of theapproximation in value space scheme.
By comparison, the Q-factor function approximation scheme (1.17)does not allow for on-line replanning. On the other hand, for problemswhere there is no need for on-line replanning, the Q-factor approximationscheme does not require the on-line computation of expected values andmay allow for a much faster on-line computation of the minimizing controlµk(xk) via Eq. (1.17).
1.3.2 Infinite Horizon Problems - An Overview
We will now provide an outline of infinite horizon stochastic DP with anemphasis on its aspects that relate to our RL/approximation methods.
Bertsekas Reinforcement Learning 17 / 29
Page 15
Let’s Take a 15-min Working Break: Catch your Breath, Collect yourQuestions, and Consider the Following Challenge Puzzle
A chess match puzzleA chess player plays against a chess computer program a two-game match.
A win counts for 1, a draw counts for 1/2, and a loss counts for 0, for both playerand computer.
“Sudden death" games are played if the score is tied at 1-1 after the two games.The chess player can choose to play each game in one of two possible styles:
I Bold play (wins with probability pw < 1/2 and loses with probability 1 − pw ) orI Timid play (draws with probability pd < 1 and loses with probability 1 − pd ).
The style for the 2nd game is chosen after seeing the outcome of the 1st game.
Note that the player plays worse than the computer (on the average), regardless ofchosen style of play, and must play bold at least one game to have any chance towin.
Speculate on the optimal policy of the player.
Is it possible for the player to have a better than 50-50 chance to win the match,even though the computer is the better player?
Bertsekas Reinforcement Learning 18 / 29
Page 16
Answer: Depending on pw and pd , Player’s Win Prob. May be > 1/2
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Imperfect-State Info Ch. 4
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB
Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB
Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB
Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
A AB AC AD ABC ABD ACB ACD ADB ADC
Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB
Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}
(Is the path s ! i ! j
have a chance to be part of a shorter s ! j path?)
(Does the path s ! i ! j shorter than the current s ! j path?)
Length = 0 Dead-End Position Solution
Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j
YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?
Open List INSERT REMOVE
w1, w2, . . . xN t m xk xk+1 5.5
Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence
Decoder Decoded Sequence
s x0 x1 x2 xN�1 xN t m xk xk+1 5.5
w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk
xk+1 5.5
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Style Choice Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Style Choice Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Style Choice Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Style Choice Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Style Choice Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
The optimal policy: Play bold in the 1st game. Then play bold again if the 1stgame is lost, and timid if the 1st game is won (see the full DP solution in DPB, DPtextbook, Vol. I, Chapter 1; available from Google Books).
Example: For pw = 0.45 and pd = 0.9, optimal style of play policy gives a matchwin probability of roughly 0.53 (a simple DP calculation that you can try).
Intuition: The player can use feedback, while the computer cannot.
Bertsekas Reinforcement Learning 19 / 29
Page 17
Infinite Horizon Problems
......Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Permanent trajectory P k Tentative trajectory T k
Stage k Future Stages
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π(T
(λ)µ (Φr)
)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
∑nj=1 pij(u)
(g(i, u, j) + J(j)
)Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
(F (i)
)
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
(F (i)
)
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
(F (i)
)
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
1
Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random cost gk(xk, uk, wk)
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function
Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J
u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)
x0 xk x1k+1 x2
k+1 x3k+1 x4
k+1 States xN Base Heuristic ik States ik+1 States ik+2
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
1
Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)
Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk)
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function
Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J
u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)
x0 xk x1k+1 x2
k+1 x3k+1 x4
k+1 States xN Base Heuristic ik States ik+1 States ik+2
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
1
xk+1 = f(xk, uk, wk) g(xk, uk, wk)
Termination State
Aggregate Problem Approximation Jµ(i) Jµ(i) u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Con-
straint Relaxation
Tail problem approximation u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Constraint Relaxation d`i
�j`
Learned from scratch ... with 4 hours of training! Current “Improved”
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J�F (i)
�
Plays di↵erent! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk,µk+1,...,µk+`�1
En
gk(xk, uk, wk) +
k+`�1X
m=k+1
gk
�xm, µm(xm), wm
�+ Jk+`(xk+`)
o
Subspace S = {�r | r 2 <s} x⇤ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (�)(x) = T (x) x = P (c)(x)
x � T (x) y � T (y) rf(x) x � P (c)(x) xk xk+1 xk+2 Slope = �1
c
T (�)(x) = T (x) x = P (c)(x)
1
xk+1 = f(xk, uk, wk) ↵kg(xk, uk, wk)
Termination State
Aggregate Problem Approximation Jµ(i) Jµ(i) u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Con-
straint Relaxation
Tail problem approximation u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Constraint Relaxation d`i
�j`
Learned from scratch ... with 4 hours of training! Current “Improved”
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J�F (i)
�
Plays di↵erent! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk,µk+1,...,µk+`�1
En
gk(xk, uk, wk) +
k+`�1X
m=k+1
gk
�xm, µm(xm), wm
�+ Jk+`(xk+`)
o
Subspace S = {�r | r 2 <s} x⇤ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (�)(x) = T (x) x = P (c)(x)
x � T (x) y � T (y) rf(x) x � P (c)(x) xk xk+1 xk+2 Slope = �1
c
T (�)(x) = T (x) x = P (c)(x)
1
xk+1 = f(xk, uk, wk) αkg(xk, uk, wk)
Termination State Infinite Horizon
Aggregate Problem Approximation Jµ(i) Jµ(i) u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Con-
straint Relaxation
Tail problem approximation u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Constraint Relaxation dℓi
φjℓ
Learned from scratch ... with 4 hours of training! Current “Improved”
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J(F (i)
)
Plays different! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E{gk(xk, uk, wk) +
k+ℓ−1∑
m=k+1
gk
(xm, µm(xm), wm
)+ Jk+ℓ(xk+ℓ)
}
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
1
Infinite number of stages, and stationary system and costSystem xk+1 = f (xk , uk ,wk ) with state, control, and random disturbance.
Policies π = {µ0, µ1, . . .} with µk (x) ∈ U(x) for all x and k .
Cost of stage k : αk g(xk , µk (xk ),wk
).
Cost of a policy π = {µ0, µ1, . . .}: The limit as N →∞ of the N-stage costs
Jπ(x0) = limN→∞
Ewk
{N−1∑k=0
αk g(xk , µk (xk ),wk
)}
0 < α ≤ 1 is the discount factor. If α < 1 the problem is called discounted.
Optimal cost function J∗(x0) = minπ Jπ(x0).
Problems with α = 1 typically include a special cost-free termination state t . Theobjective is to reach (or approach) t at minimum expected cost.
Bertsekas Reinforcement Learning 21 / 29
Page 18
Infinite Horizon Problems - The Three Theorems
Intuition: N-stages opt. costs –> Infinite horizon opt. costApply DP, let VN−k (x) be the optimal cost-to-go starting at x with k stages to go:
VN−k (x) = minu∈U(x)
Ew
{αN−k g(x , u,w) + VN−k+1
(f (x , u,w)
)}, VN(x) ≡ 0
Define Jk (x) = VN−k (x)/αN−k , i.e., reverse the time index and divide with αN−k :
Jk (x) = minu∈U(x)
Ew
{g(x , u,w) + αJk−1
(f (x , u,w)
)}, J0(x) ≡ 0 (DP)
JN(x) is equal to V0(x), the N-stages optimal cost starting from x
So for any k , Jk (x) = k -stages optimal cost starting from x . Intuitively:
J∗(x) = limk→∞
Jk (x), for all x
J∗ satisfies Bellman’s equation: Take the limit in Eq. (DP) (?)
J∗(x) = minu∈U(x)
Ew
{g(x , u,w) + αJ∗
(f (x , u,w)
)}, for all x
Optimality condition: Let µ∗(x) attain the min in the Bellman equation for all x
The policy {µ∗, µ∗, . . .} is optimal. (This type of policy is called stationary.)
Bertsekas Reinforcement Learning 22 / 29
Page 19
Infinite Horizon Problems - Algorithms
Value iteration (VI): Generates finite horizon opt. cost function sequence {Jk}
Jk (x) = minu∈U(x)
Ew
{g(x , u,w) + αJk−1
(f (x , u,w)
)}, J0 is “arbitrary" (??)
Policy Iteration (PI): Generates sequences of policies {µk} and their costfunctions {Jµk }; µ0 is “arbitrary"
The typical iteration starts with a policy µ and generates a new policy µ in two steps:
Policy evaluation step, which computes the cost function Jµ (base) policy µ
Policy improvement step, which computes the improved (rollout) policy µ using theone-step lookahead minimization
µ(x) ∈ arg minu∈U(x)
Ew
{g(x , u,w) + αJµ
(f (x , u,w)
)}There are several options for policy evaluation to compute Jµ
Solve Bellman’s equation for µ [Jµ(x) = E{g(x , µ(x),w) + αJµ(f (x , µ(x),w))}] byusing VI or other method (it is linear in Jµ)
Use simulation (on-line Monte-Carlo, Temporal Difference (TD) methods)
Bertsekas Reinforcement Learning 23 / 29
Page 20
Exact and Approximate Policy IterationAlphazero has discovered a new way to play! Base Policy Evaluation
One-Step Lookahead Policy Improvement
u = µ(x, r) Current State x µ µ
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
u = µ(x, r) Current State x µ Rollout Policy µ Randomized Jµ
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network J State-Control Pairs Data-TrainedClassifier
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
u = µ(x, r) Current State x µ Rollout Policy µ Randomized
Jµ instead of J*
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network J State-Control Pairs Data-TrainedClassifier
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
u = µ(x, r) Current State x µ Rollout Policy µ Randomized
Jµ instead of J* Bellman Eq. with
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network J State-Control Pairs Data-TrainedClassifier
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ
u = µ(x, r) Current State x µ Rollout Policy µ Randomized
Jµ instead of J* Bellman Eq. with
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network J State-Control Pairs Data-TrainedClassifier
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ
u = µ(x, r) Current State x µ Rollout Policy µ Randomized
Jµ instead of J* Bellman Eq. with
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network J State-Control Pairs Data-TrainedClassifier
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ
u = µ(x, r) Current State x µ Rollout Policy µ Randomized
Jµ instead of J* Bellman Eq. with
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network J State-Control Pairs Data-TrainedClassifier
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ
u = µ(x, r) Current State x µ Rollout Policy µ Randomized
Jµ instead of J* Bellman Eq. with
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network J State-Control Pairs Data-TrainedClassifier
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
u = µ(x, r) Current State x µ Rollout Policy µ
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network
1
Important facts (to be discussed later):PI yields in the limit an optimal policy (?)
PI is faster than VI; can be viewed as Newton’s method for solving Bellman’s Eq.
PI can be implemented approximately, with a value and (perhaps) a policy network
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
1
Base Policy Rollout Policy Approximation in Value Space
One-Step or Multistep Lookahead
Approximation in Policy Space
Approximate Q-Factor Q(x, u) At x
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{
gk+1(xk+1, uk+1, wk+1)
+Jk+2
(fk+1(xk+1, uk+1, wk+1)
)},
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
u = µ(x, r) Current State x µ Rollout Policy µ
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement
u = µ(x, r) Current State x µ Rollout Policy µ
Approximate Policy Evaluation Approximate Policy Improvement
Value Network Policy Network
1
Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement µ
Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ
Assigns x to µ(x, r) Pairs (xs, us) Training Data
u = µ(x, r) Current State x µ Rollout Policy µ Randomized µ(·, r)
Jµ instead of J* Bellman Eq. TRUNCATED ROLLOUT with BASEPOLICY µ
Approximate Policy Evaluation Approximate Policy Improvement
(Assigns x to u)
Value Network Policy Network Value Data
J State-Control Pairs Data-Trained Classifier with µ
Initial State Current State Approximation Truncated Rollout Usinga Local Policy Network
State Space Partition
Each Set Has a Local Value Network and a Local Policy Network
Terminal Cost Supplied by Local Value Network Terminal State
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
Bertsekas Reinforcement Learning 24 / 29
Page 21
A More Abstract Notational View
Bellman’s equation, VI, and PI can be written using Bellman operatorsRecall Bellman’s equation
J∗(x) = minu∈U(x)
Ew
{g(x , u,w) + αJ∗
(f (x , u,w)
)}, for all x
It can be written as a fixed point equation: J∗(x) = (TJ∗)(x), where T is the Bellmanoperator that transforms a function J(·) into a function (TJ)(·)
(TJ)(x) = minu∈U(x)
Ew
{g(x , u,w) + αJ
(f (x , u,w)
)}, for all x
Shorthand theory using Bellman operators:VI is the fixed point iteration Jk+1 = TJk
There is a Bellman operator Tµ for any policy µ and corresponding Bellman Eq.Jµ(x) = (TµJµ)(x) = E{g(x , µ(x),w) + αJµ(f (x , µ(x),w))}PI is written compactly as Jµk = Tµk Jµk (policy evaluation) and Tµk+1 Jµk = TJµk
(policy improvement)
The abstract view is very useful for theoretical analysis, intuition, and visualization
Bertsekas Reinforcement Learning 25 / 29
Page 22
Deterministic Linear Quadratic Problem - Infinite Horizon, Undiscounted
Linear system xk+1 = axk + buk ; quadratic cost per stage g(x ,u) = qx2 + ru2
Bellman equation: J(x) = minu{
qx2 + ru2 + J(ax + bu)}
Finite horizon results (quadratic optimal cost, linear optimal policy) suggest:
J∗(x) = K ∗x2 where K ∗ is some positive scalar
The optimal policy has the form µ∗(x) = L∗x where L∗ is some scalar
To characterize K ∗ and L∗, we plug J(x) = Kx2 into the Bellman equation
Kx2 = minu
{qx2 + ru2 + K (ax + bu)2} = · · · = F (K )x2
where F (K ) = a2rKr+b2K + q with the minimizing u being equal to − abK
r+b2K x
Thus the Bellman equation is solved by J∗(x) = K ∗x2, with K ∗ being a solution ofthe Riccati equation
K ∗ = F (K ∗) =a2rK ∗
r + b2K ∗+ q
and the optimal policy is linear:
µ∗(x) = L∗x with L∗ = − abK ∗
r + b2K ∗Bertsekas Reinforcement Learning 26 / 29
Page 23
Graphical Solution of Riccati Equation
xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1
F (P ) P Pk Pk+1 P ⇤ Q 0 P � RB2
A2RB2 + Q 45�
F (P ) k Q 0 P � RE{B2} 45�
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
Stochastic Problems
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
b2 K K∗ Kk kk+1αKr
r+αKb2 + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
b2 K K∗ Kk kk+1αKr
r+αKb2 + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1
F (P ) k Q 0 P � RE{B2} 45�
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
Stochastic Problems
Perfect-State Info Ch. 3
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
using an Corresponds to One-Step Lookahead Policy µ
Line
1
Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ
Cost of Truncated Rollout Policy µ
T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost T 2J T J
J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
rb2 + q q F (K) = arK
r+b2K + q
using an Corresponds to One-Step Lookahead Policy µ
Line Stability Region
TµT mµ J = TT m
µ J Yields Truncated Rollout Policy µ Defined by
1
Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ
Cost of Truncated Rollout Policy µ
T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
arb2 + q q F (K) = a2rK
r+b2K + q K = 0 K KL
L = −r + ab2K
abKK1 L = −r + ab2K1
abK1
using an Corresponds to One-Step Lookahead Policy µ
Line Stability Region F (K) = arKr+b2K + q FL(K1)
Tµ(T mµ J) = T (T m
µ J) Yields Truncated Rollout Policy µ Defined by
Newton step from J for solving J = TJ
1
E↵ective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ
Cost of Truncated Rollout Policy µ 1
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ J TµJ
TJ Instability Region Stability Region 0 Tmµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K⇤ = 0 K = 0 K K = 0 KL
L = � abK
r + ab2KK1 L = � abK1
r + ab2K1
F (K) =a2rK
r + b2K
J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
using an Corresponds to One-Step Lookahead Policy µ
Line Stability Region F (K) = arKr+b2K + q FL(K1) K = a2 � 1
Tµ(Tmµ J) = T (Tm
µ J) Yields Truncated Rollout Policy µ Defined by
Newton step from J for solving J = TJ
Newton step from T `�1J for solving J = TJ (TJ)(1)
1
Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1
J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ
L = − abK
r + b2K
Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K
State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1
State 2 (TJ)(1)
J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2
Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J
Cost of Truncated Rollout Policy µ 1 of the graph of T
J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost J is a function of x
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K∗ = 0 K = 0 K K = 0 KL
L = − abK
r + ab2KK1 L = − abK1
r + b2K1
F (K) =a2rK
r + b2K
J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
1
Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1
J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ
L = − abK
r + b2K
Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K
State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1
State 2 (TJ)(1)
J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2
Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J
Cost of Truncated Rollout Policy µ 1 of the graph of T
J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost J is a function of x
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K∗ = 0 K = 0 K K = 0 KL
L = − abK
r + ab2KK1 L = − abK1
r + b2K1
F (K) =a2rK
r + b2K
J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
1
Interval I Interval II Interval III Interval IV Ks K⇤ Kµ K � 12 �µ �1
J 0 Jµ = � 1µ TµJ = �µ + (1 � µ2)J TJ = minµ2(0,1] TµJ
L = � abK
r + b2K
Region of Instability Region of Stability TµJ = �µ + (1 � µ2)J K
State 1 State 2 2-State/2-Control ExampleE↵ective Cost Approximation Value Space Approximation State 1
State 2 (TJ)(1)
J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ � rb2
Generic stable policy µ TµJ Generic unstable policy µ0 Tµ0J
Cost of Truncated Rollout Policy µ 1 of the graph of T
J⇤ J⇤(1) J⇤(2) (TJ⇤)(1) = J⇤(1) (TJ⇤)(2) = J⇤(2)
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost J is a function of x
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ J TµJ
TJ Instability Region Stability Region 0 Tmµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K⇤ = 0 K = 0 K K = 0 KL
L = � abK
r + ab2KK1 L = � abK1
r + b2K1
F (K) =a2rK
r + b2K
J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K)
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K)
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K)
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bellman Equation on Space of Quadratic Functions J(x) = Kx2
F (K)
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Optimal Policy Riccati Equation: K = F (K)
J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bertsekas Reinforcement Learning 27 / 29
Page 24
Visualization of VI
xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1
F (P ) P Pk Pk+1 P ⇤ Q 0 P � RB2
A2RB2 + Q 45�
F (P ) k Q 0 P � RE{B2} 45�
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
Stochastic Problems
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
b2 K K∗ Kk kk+1αKr
r+αKb2 + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
b2 K K∗ Kk kk+1αKr
r+αKb2 + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
b2 K K∗ Kk kk+1αKr
r+αKb2 + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X Multiagent
rb2 + 1 1 − r
b2 K K∗ Kk Kk+1αKr
r+αKb2 + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step
Approximation of E{·}: Approximate minimization:
minu∈U(x)
n∑
y=1
pxy(u)(g(x, u, y) + αJ(y)
)
x1k, u1
k u2k x2
k dk τ
Q-factor approximation
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)
)∈ C
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
{g(x, u, w) + αJ
(f(x, u, w)
)}
Truncated Rollout Policy µ m Steps
1
xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1
F (P ) k Q 0 P � RE{B2} 45�
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
Stochastic Problems
Perfect-State Info Ch. 3
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
using an Corresponds to One-Step Lookahead Policy µ
Line
1
Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ
Cost of Truncated Rollout Policy µ
T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost T 2J T J
J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
rb2 + q q F (K) = arK
r+b2K + q
using an Corresponds to One-Step Lookahead Policy µ
Line Stability Region
TµT mµ J = TT m
µ J Yields Truncated Rollout Policy µ Defined by
1
Interval I Interval II Interval III Interval IV Ks K⇤ Kµ K � 12 �µ �1
J 0 Jµ = � 1µ TµJ = �µ + (1 � µ2)J TJ = minµ2(0,1] TµJ
L = � abK
r + b2K
Region of Instability Region of Stability TµJ = �µ + (1 � µ2)J K
State 1 State 2 2-State/2-Control ExampleE↵ective Cost Approximation Value Space Approximation State 1
State 2 (TJ)(1)
J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ � rb2
Generic stable policy µ TµJ Generic unstable policy µ0 Tµ0J
Cost of Truncated Rollout Policy µ 1 of the graph of T
J⇤ J⇤(1) J⇤(2) (TJ⇤)(1) = J⇤(1) (TJ⇤)(2) = J⇤(2)
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost J is a function of x
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ J TµJ
TJ Instability Region Stability Region 0 Tmµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K⇤ = 0 K = 0 K K = 0 KL
L = � abK
r + ab2KK1 L = � abK1
r + b2K1
F (K) =a2rK
r + b2K
J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
1
Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1
J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ
L = − abK
r + b2K
Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K
State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1
State 2 (TJ)(1)
J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2
Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J
Cost of Truncated Rollout Policy µ 1 of the graph of T
J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost J is a function of x
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K∗ = 0 K = 0 K K = 0 KL
L = − abK
r + ab2KK1 L = − abK1
r + b2K1
F (K) =a2rK
r + b2K
J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
1
Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1
J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ
L = − abK
r + b2K
Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K
State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1
State 2 (TJ)(1)
J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2
Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J
Cost of Truncated Rollout Policy µ 1 of the graph of T
J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost J is a function of x
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K∗ = 0 K = 0 K K = 0 KL
L = − abK
r + ab2KK1 L = − abK1
r + b2K1
F (K) =a2rK
r + b2K
J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
1
Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ
Cost of Truncated Rollout Policy µ
T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ
TJ Instability Region Stability Region 0 T mµ J
arb2 + q q F (K) = a2rK
r+b2K + q K = 0 K KL
L = −r + ab2K
abKK1 L = −r + ab2K1
abK1
using an Corresponds to One-Step Lookahead Policy µ
Line Stability Region F (K) = arKr+b2K + q FL(K1)
Tµ(T mµ J) = T (T m
µ J) Yields Truncated Rollout Policy µ Defined by
Newton step from J for solving J = TJ
1
E↵ective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ
Cost of Truncated Rollout Policy µ 1
TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ
Multistep Lookahead Policy Cost
FL(K) = (a + bL)2K + q + rL2 FL(K)
T 2J T J J Region where Sequential Improvement Holds TJ J TµJ
TJ Instability Region Stability Region 0 Tmµ J
a2rb2
a2rb2 + q q F (K) = a2rK
r+b2K + q K⇤ = 0 K = 0 K K = 0 KL
L = � abK
r + ab2KK1 L = � abK1
r + ab2K1
F (K) =a2rK
r + b2K
J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}
using an Corresponds to One-Step Lookahead Policy µ
Line Stability Region F (K) = arKr+b2K + q FL(K1) K = a2 � 1
Tµ(Tmµ J) = T (Tm
µ J) Yields Truncated Rollout Policy µ Defined by
Newton step from J for solving J = TJ
Newton step from T `�1J for solving J = TJ (TJ)(1)
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk)
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk)
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk)
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Value Space Approximation J Jµ = TµJµ TµJ
One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ
Newton iterate starting from K Tangent Line of Unstable Policy
J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK
TJ Instability Region Stability Region Slope=1
also Newton Step Value Iteration: Kk+1 = F (Kk)
Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from
using an Corresponds to One-Step Lookahead Policy µ
Line Stable Policies Unstable Policy Optimal Policy
Region of stability
Also Region of Convergence of Newton’s Method Riccati Equation
Cost of rollout policy µ Cost of base policy µ
1
Bertsekas Reinforcement Learning 28 / 29
Page 25
About the Next Lecture
Linear quadratic problems and Newton step interpretationsApproximation in value space as a Newton step for solving the Riccati equation
Rollout as a Newton step starting from the cost of the base policy
Policy Iteration as repeated Newton steps
Problem formulations and reformulationsHow do we formulate DP models for practical problems?
Problems involving a terminal state (stochastic shortest path problems)
Problem reformulation by state augmentation (dealing with delays, correlations,forecasts, etc)
Problems involving imperfect state observation (POMDP)
Multiagent problems - Nonclassical information patterns
Systems with unknown or changing parameters - Adaptive control
PLEASE READ SECTIONS 1.5 and 1.6 OF THE CLASS NOTES (AMAP)
1ST HOMEWORK (DUE IN ONE WEEK): Exercise 1.1 of the Class NotesBertsekas Reinforcement Learning 29 / 29