4SC000 Q2 2017-2018 Optimal Control and Dynamic Programming Duarte Antunes
4SC000 Q2 2017-2018
Optimal Control and Dynamic Programming
Duarte Antunes
Part IIStage decision problems
Recap
Optimal control formulation• Dynamic model & cost function (transition diagram for discrete optimization problems).
• Computing an optimal policy vs computing an optimal path.
1
Dynamic progamming algorithm• Allows to compute policies (to deal with uncertainty).
• Equivalent way to write it: DP equation.
• Stochastic dynamic programming: computes a policy that minimizes an expected cost.
Alternative algorithms• To compute optimal paths, alternative algorithms (e.g. Dijkstra's) may be more efficient.
Partial information• When there is only partial information about the state rely on the Bayes filter.
2
Introduce optimal control concepts for stage decision problems
Goals of part II
Discrete optimization problems Stage decision problems
Formulation Transition diagramDynamic system &
additive cost function
DP algorithm & Stochastic DP
Graphical DP algorithm & DP equation
DP equation
Alternative algorithms Dijkstra's algorithm Static optimization
Partial information Bayes filter Kalman filter
Applicationfocus
Operational research & Computer science
Digital control
Outline
• Dynamic programming for stage decision problems
• Linear quadratic regulator
3
Stage decision problems
h�1X
k=0
gk(xk, uk) + gh(xh)
xk+1 = fk(xk, uk)
Dynamic model
Cost function
k 2 {0, . . . , h� 1}
• State and input live in arbitrary spaces.
• If these spaces are discrete this is a discrete optimization problem.
• Typically and for every .
• Goals: find an optimal path and find an optimal policy.
xk 2 Xk uk 2 Uk(xk)
xk 2 Rn uk 2 Rm k 2 {0, . . . , h� 1}
4
Optimal path
• Given an initial condition , a path is a set of decisions such that,
• A path is said to be optimal if there does not exist another path with a strictly smaller cost.
x0 {u0, u1, . . . , uh�1}{(x0, u0), . . . , (xh�1, uh�1)}
X0 X1
x0
x1
g0(x0, u0)g1(x1, u1)
gh�1(xh�1, uh�1)
Stage 0 Stage 1 Stage h� 1 Stage h
Xh�1 Xh
xh
xh�1 gh(xh)
x1 = f0(x0, u0) x2 = f1(x1, u1) xh=fh�1(xh�1,uh�1)
and satisfy the equations of the dynamic model.uk 2 Uk(xk)
5
Optimal policy
Policy A policy is a set of functions ⇡ = {µ0, . . . , µh�1}, µk : Xk ! Uk.
x`
h
Optimal policyA policy is said to be optimal if for every state at every stage , is the first action of an optimal path for the tail subproblem which considers only stages with initial condition and cost
` 2 {0, . . . , h� 1} µ`(x`)
x`{`, `+ 1, . . . , h}
h�1X
k=`
gk(xk, uk) + gh(xh)
Dynamic programming algorithm
6
Start with for every and for each decision stage, starting from the last and moving backwards, , compute and k 2 {h� 1, h� 2, . . . , 0} Jkµk
Then is an optimal policy.{µ0, . . . , µh�1}
(DP equation)
from
Theorem The policy obtained with the DP algorithm is an optimal policy. (proof in the appendix).
Jh(xh) = gh(xh) xh 2 Xh
and
where is the minimizer in the DP equation, i.e.,uk
J
k
(xk
) = minuk2Uk(xk)
g
k
(xk
, u
k
) + J
k+1(fk(xk
, u
k
))
µk(xk) = uk
Jk(xk) = gk(xk, µk(xk)) + Jk+1(fk(xk, µk(xk))
7
Dynamic
model
k 2 {0, 1}
Terminal cost
Cost
Quadratic
Non-quadratic
Simple integrator example
Consider the following simple example of a stage-decision problem
xk+1 = xk + uk
g2(x2) = x
22
1X
k=0
x
2k + u
2k + g2(x2)
g2(x2) = e
x2
8
Quadratic terminal costStep I
x2 = x1 + u1
= minu1
x
21 + u
21 + x
22J1(x1) = min
u1
g1(x1, u1) + g2(x2) = minu1
2(x21 + u
21 + x1u1)
Qx1(u1)
|{z}Quadratic function of u1
-3 -2 -1 0 1 21
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Qx1(u1)
u1
How to compute the minimum?
Differentiate and equate to zero to find minimizer!
d
du1Q
x1(u1) = 0 , 2(2u1 + x1) = 0 , u1 = �1
2x1
Replacing in we obtain the cost-to-goQx1(u1) J1(x1) =
3
2x
21
9
Quadratic terminal costStep 2
J0(x0) = minu0
g0(x0, u0) + J1(x1)
x1 = x0 + u0
= minu0
x
20 + u
20 +
3
2x
21 = min
u0
5
2x
20 + 3u0x0 +
5
2u
20
Differentiating and equating to zero we obtain a function belonging to the optimal policy
u0 = �3
5x0
which leads to the cost-to-go
J0(x0) =8
5x
20
10
Optimal policy and optimal path
Optimal policy
Optimal path for
Computed by using:
Optimal cost
u0 = �3
5x0 u1 = �1
2x1
x0 = 1
k 2 {0, 1}xk+1 = xk + uk
u0 = �3
5u1 = �1
5
x0 = 1x1 =
2
5x2 =
1
5
J0(1) =8
5
11
Non-quadratic terminal costLet us try to apply the dynamic programming algorithm considering a non-quadratic terminal cost
We get stuck
• this equation implicitly determines from but there is not an explicit form.
• This implies that it is not easy to determine and move to step 2. u1 = µ1(x1)
u1 x1
g2(x2) = e
x2
Step 1J1(x1) = min
u1
g1(x1, u1) + g2(x2) = minu1
x
21 + u
21 + e
x2
x2 = x1 + u1
Differentiating and equating to zero, we obtain
2u1 + ex1+u1 = 0
12
Discussion
Linear dynamic models, quadratic cost• For these problems, we can explicitly obtain the optimal policy as shown next.
Non-linear dynamic models and/or non-quadratic cost• Very hard to apply DP and hence obtain optimal policies.
• This leads to approximation techniques such as discretization.
• Another class of approximation techniques will addressed in the next lectures.
Outline
• Dynamic programming for stage decision problems
• Linear quadratic regulator
13
Linear quadratic regulator
GivenDynamic model
Cost function
Find
•This is the finite-horizon linear quadratic optimal control problem in discrete-time.
•The solution when approaches infinity and the matrices in the dynamic model and cost function are time-invariant is the linear quadratic regulator.
xk+1 = Akxk +Bkuk k 2 {0, . . . , h� 1}
Optimal policyuk = µk(xk) k 2 {0, . . . , h� 1}
h�1X
k=0
⇥x
|k u
|k
⇤ Qk Sk
S
|k Rk
� xk
uk
�+ x
|hQhxh
h
14
Remarks• The linear quadratic regulator is one of the celebrated results in control
theory and one of the main achievements of optimal control.
Qh � 0Qk Sk
S|k Rk
�� 0, Rk > 0• Assumptions: are symmetric,
xk+1 = Axk +Buk
h�1X
k=0
⇥x
|k u
|k
⇤ Q S
S
|R
� xk
uk
�+ x
|hQhxh
• Model and cost are often time-invariant, i.e., and
• Cost function can result from a continuous-time problem.
• However, in general the cost is specified in discrete-time and used as tuning knob to obtain desired specifications (e.g. overshoot, etc).
• We focus on the stabilization problem, i.e., driving the state to zero.
Qk, Rk
15
Dynamic programming algorithm
xh = Ah�1xh�1 +Bh�1uh�1Step I
J
h�1(xh�1) = minuh�1
⇥x
|h�1 u
|h�1
⇤ Q
h�1 S
h�1
S
|h�1 R
h�1
� x
h�1
u
h�1
�+ J
h
(xh
)| {z }x
|hQhxh
Jh(Ah�1xh�1+Bh�1uh�1) = (Ah�1xh�1+Bh�1uh�1)|Qh(Ah�1xh�1+Bh�1uh�1)
terminal cost
Quadratic function of uh�1
Then
Jh�1(xh�1) = minuh�1
x
|h�1
�A
|h�1QhAh�1 +Qh�1
�xh�1
+ 2u|h�1
�S
|h�1 +B
|h�1QhAh�1
�xh�1 + u
|h�1
�B
|h�1QhBh�1 +Rh�1
�uh�1
16
Minimizing a quadratic function in Rn
J(�X�1y) = y|X�1XX�1y � 2y|X�1y + z
= z � y|X�1y
Unique minimizer
Minimum
rJ(u) = 0 , 2Xu+ 2y = 0 , u = �X�1y
minu2Rn
J(u) J(u) = u|Xu+ 2u|y + z
105
0-5
-10-10-5
05
-20
0
20
40
60
80
100
120
140
10
X > 0
17
Dynamic Programming
Step I
|{z}X
|{z}y
z|{z}
Jh�1(xh�1) = minuh�1
x
|h�1
�A
|h�1QhAh�1 +Qh�1
�xh�1
+ 2u|h�1
�S
|h�1 +B
|h�1QhAh�1
�xh�1 + u
|h�1
�B
|h�1QhBh�1 +Rh�1
�uh�1
Policy
uh�1 = �X
�1y = �
�B
|h�1QhBh�1 +Rh�1
��1�S
|h�1 +B
|h�1QhAh�1
�xh�1
Cost-to-go
Jh�1(xh�1)=z � y
|X
�1y
=x
|h�1
�A
|h�1QhAh�1+Qh�1
�xh�1
� x
|h�1(Sh�1+A
|h�1QhBh�1
��B
|h�1QhBh�1 +Rh�1
��1�S
|h�1 +B
|h�1QhAh�1)xh�1
18
Dynamic Programming
Step 2
Jh�2(xh�2) = x
|h�2Ph�2xh�2
xh�1 = Ah�2xh�2 +Bh�2uh�2
J
h�2(xh�2) = minuh�2
⇥x
|h�2 u
|h�2
⇤ Q
h�2 S
h�2
S
|h�2 R
h�2
� x
h�2
u
h�2
�+ J
h�1(xh�1)| {z }x
|h�1Ph�1xh�1
uh�2 = Kh�2xh�2
Since the cost-to.go is quadratic (as the terminal cost) we can apply the same reasoning and obtain
Ph�2 =A|h�2Ph�1Ah�2 +Qh�2
��Sh�2 +A|
h�2Ph�1Bh�2
��B|
h�2Ph�1Bh�2 +Rh�2
��1�S|h�2 +B|
h�2Ph�1Ah�2
�
Kh�2 = ��B|
h�2Ph�1Bh�2 +Rh�2
��1�S|h�2 +B|
h�2Ph�1Ah�2
�
19
Dynamic Programming
Step
J
k
(xk
) = minuk
⇥x
|k
u
|k
⇤ Q
k
S
k
S
|k
R
k
� x
k
u
k
�+ J
k+1(xk+1)| {z }x
|k+1Pk+1xk+1
xk+1 = Akxk +Bkukh� k
Jk(xk) = x
|kPkxk
uk = Kkxk
Kk = ��B|
kPk+1Bk +Rk
��1�S|k +B|
kPk+1Ak
�
Thus, simply iterate these equations for starting with to obtain the optimal policy
uk = Kkxk
The optimal cost for a given initial condition is J0(x0) = x
|0P0x0
Ph = Qh
Riccati equation
k 2 {h� 1, . . . , 1, 0}
Pk = A|kPk+1Ak+Qk�(Sk+A|
kPk+1Bk)�B|
kPk+1Bk+Rk
��1(S|
k +B|kPk+1Ak)
20
Example: double integrator
Consider a double integrator
Discretization
Continuous-time model
F
y(t)v(t)
�=
0 10 0
� y(t)v(t)
�+
01
�u(t)
y(t) =F (t)
m| {z }u(t)
x(t) = [y(t) v(t)]|
y = 0 y
xk+1 = e
2
40 10 0
3
5⌧
xk +
Z ⌧
0e
2
40 10 0
3
5r
dr
01
�uk
=
1 ⌧
0 1
�xk +
⌧2
2⌧
�uk
21
Example: double integrator
Qualitative goal: drive the mass to position zero in a fast way but with reasonable actuation values.
Dynamic model
xk+1 =
1 ⌧
0 1
�xk +
⌧2
2⌧
�uk
To achieve this goal let us start with this cost function
and then tune these parameters to improve the results.
Q =
1 00 1
�R = 1 h = 5
⌧ = 0.2
Ph�1k=0(x
|kQxk + u
|kRuk) + x
|hQhxh
Qh =
10 00 10
�
22
Dynamic programming
Iterate the following equations to obtain the optimal policy
k 2 {4, 3, 2, 1, 0}
Kk = ��B|
kPk+1Bk +Rk
��1�S|k +B|
kPk+1Ak
�
Pk = A|kPk+1Ak+Qk�(Sk+A|
kPk+1Bk)�B|
kPk+1Bk+Rk
��1(S|
k +B|kPk+1Ak)
P5 = Q5 =
10 00 10
�
=
10.9715 1.70941.7094 8.4359
�
First iteration
=⇥�0.1425 �1.4530
⇤K4 = �(
⇥0.02 0.2
⇤10 00 10
�0.020.2
�+1)�1
⇥0.02 0.2
⇤10 00 10
�1 0.20 1
�
P4 =
1 0.20 1
�| 10 00 10
� 1 0.20 1
�+
1 00 1
��1 0.20 1
�| 10 00 10
� 0.020.2
�(�K4)
23
Dynamic programming
P3 =
11.739 3.1443.144 8.0782
�P2 =
12.188 4.3114.311 8.2725
�P1 =
12.295 5.1655.165 8.675
�P0 =
12.121 5.7025.702 9.085
�
K1 = [�0.807 �1.432]K0 = [�0.918 �1.503] K2 = [�0.638 �1.368] K3 = [�0.414 �1.353]
Next iterations
uk = Kkxk k 2 {0, 1, . . . , 4}Optimal policy
Optimal path for initial condition x0 =⇥1 0
⇤| (iterate )uk = Kkxk
(x0, u0) = ( [1 0]|,�0.918)
(x1, u1) = ([0.982 � 0.184]|,�0.529) (x4, u4) = ([0.8082 � 0.313]|, 0.339)
x5 = [0.7525 � 0.2448]|(x2, u2) = ([0.934 � 0.289]|,�0.200)
(x3, u3) = ([0.8724 � 0.330]|, 0.085)
xk+1 = Axk +Buk
24
Plots and tuning
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1
v(t)
-0.4
-0.3
-0.2
-0.1
0
t0 0.5 1
u(t)
-1
-0.5
0
0.5
Transitory responses are still far from qualitative specifications
Guidelines to tune the cost• By increasing the terminal cost one expects that the response gets closer to the desired
final position.• Same is expected by penalizing more the position error relatively to the velocity error.• Decreasing the penalty on the control action will allow more control authority to reach
the origin.
25
Increasing terminal cost
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1v(t)
-1
-0.8
-0.6
-0.4
-0.2
0
t0 0.5 1
u(t)
-3-2-10123
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1
v(t)
-1.5
-1
-0.5
0
t0 0.5 1
u(t)
-5
0
5
Qh = 1000I
Qh = 100I
Final position error improved by increasing terminal cost
26
Changing state cost
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1v(t)
-1
-0.8
-0.6
-0.4
-0.2
0
t0 0.5 1
u(t)
-4
-2
0
2
4Qh = 100I
Q =
10 00 1
�
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1
v(t)
-2
-1.5
-1
-0.5
0
t0 0.5 1
u(t)
-8-6-4-2024
Q =
100 00 1
�
Increasing position cost leads to smaller position error and larger velocity
27
Changing control cost
Qh = 100I
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1
v(t)
-3
-2
-1
0
1
t0 0.5 1
u(t)
-15
-10
-5
0
5
10
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1
v(t)
-5-4-3-2-101
t0 0.5 1
u(t)
-30
-20
-10
0
10
20
Decreasing control penalty leads to fast responses, but large actuation!R = 0.1
R = 0.01 Q =
10 00 1
�
28
Cheap control
As we obtain deadbeat control: zero state is achieved in 2 stepsR ! 0
t0 0.5 1
y(t)
-0.20
0.20.40.60.81
t0 0.5 1
v(t)-5-4-3-2-101
t0 0.5 1
u(t)
-30-20-100102030
Can we then always drive a mass to zero in two sampling periods?
• No, because this typically requires very large unfeasible actuations. • Actuators have limitations which were not incorporated in our linear model.• In this LQR framework the solution is to increasingly penalize the control input
until actuation constraints are met. More on this point later.
29
Increasing the horizon
t0 1 2
y(t)
-0.20
0.20.40.60.81
t0 1 2
v(t)
-0.4
-0.3
-0.2
-0.1
0
t0 1 2
u(t)
-0.8
-0.6
-0.4
-0.2
0
0.2
t0 2 4 6
y(t)
-0.20
0.20.40.60.81
t0 2 4 6
v(t)
-0.5
-0.4
-0.3
-0.2
-0.1
0
t0 2 4 6
u(t)
-1-0.8-0.6-0.4-0.20
0.2
Let us increase the horizon considering Q = Qh = I, R = 1 x0 =⇥1 0
⇤|
J0(x0) = x
|0Px0 = 8.347costh = 10
h = 30 J0(x0) = x
|0Px0 = 9.1881cost
0
0
30
Increasing the horizon
t0 5 10
y(t)
-0.20
0.20.40.60.81
t0 5 10
v(t)
-0.5-0.4-0.3-0.2-0.10
0.1
t0 5 10
u(t)
-1-0.8-0.6-0.4-0.20
0.2
t0 10 20
y(t)
-0.20
0.20.40.60.81
t0 10 20
v(t)
-0.5-0.4-0.3-0.2-0.10
0.1
t0 10 20
u(t)
-1-0.8-0.6-0.4-0.20
0.2
h = 50
h = 100
J0(x0) = x
|0Px0 = 9.1890
J0(x0) = x
|0Px0 = 9.1890
The cost converges to a constant as the time horizon increases
0
0
31
Discussion• Since the cost is positive-definite, if the horizon is large the optimal
input should drive the state to zero to stop paying cost.• This explains why the cost converges as the horizon increases.
• This reasoning is valid for every initial condition. Thus if converges as then converges, where results from the recursion
• Note that we are now considering time-invariant
h0X
k=0
gk(xk,Kkuk) +h�1X
k=h0+1
gk(xk,Kkuk) + gT (xT , uT )
|{z}⇡ 0 since xk ⇡ 0 and uk = Kkxk
x
|0P0x0
P0h ! 1
k 2 {h� 1, h� 2, . . . , 0}
P0
Ak, Bk, Qk, Rk, Sk
Pk=A|Pk+1A+Q�(S+A|Pk+1B)�B|Pk+1B+R
��1(S|+B|Pk+1A)
32
Discussion
• For the double integrator example with
• Let denote the limit of the recursion
then
• Moreover,
Q = Qh = I, R = 1
P
K = ��B|PB+R
��1�S| +B|PA
�P =A|PA+Q�(S+A|PB)
�B|PB+R
��1(S|+B|PA)
Pk=A|Pk+1A+Q�(S+A|Pk+1B)�B|Pk+1B+R
��1(S|+B|Pk+1A)
P0
k0 20 40 60 80 100
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
k1k2
Kk ! K
Pk =
p1,k p2,kp2,k p3,k
�
Kk =⇥K1,k K2,k
⇤
k0 20 40 60 80 100012345678910
p1p2p3
33
Infinite horizon LQR
P =A|PA+Q�(S+A|PB)�B|PB+R
��1(S|+B|PA)
1X
k=0
x
|kQxk + 2x|
kSuk + u
|kRuk
xk+1 = Axk +Buk
The optimal policy for the stage decision problem with infinite number of stages with dynamic model
and cost function
is given by uk = Kxk
where is the unique positive definite solution to the algebraicP
(A,B) controllable
Q SS| R
�> 0
K = ��B|PB+R
��1�S| +B|PA
�
Furthermore the closed-loop is exponentially stable.xk+1 = (A+BK)xk
Riccati equation
Proposition (special case of [Bertsekas, Sec. 4., Proposition 4.4.1])
34
Discussion
• As mentioned in [Bertsekas, Sec. 4.1], we can relax the assumptions is controllable and is positive definite.
(A,B)Q
• For simplicity, throughout the discussion, we assume . S = 0
• In fact, if is controllable, is positive definite, for a full rank and (not necessarily positive definite if ) and is observable, then the previous theorem still holds.
(A,B) R Q = NN|
N 2 Rn⇥r r n r < n(A,N)
R•Moreover, if we further relax the assumptions to is positive definite, is stabilizable, , full rank and is detectable then the theorem still holds except that is not necessarily positive definite.
(A,B)Q = NN|
P(A,N)N
• Actually according to ‘Linear optimal control’, B. O. Andersson, J. B. Moore, Sec 14.1, we just need to assure that is positive definite and is observable to guarantee stability of the closed-loop.
B|QB +R
• Therefore, we can for instance pick and positive definite and the closed-loop is stable (this will not the case for continuous-time optimal control problems).
R = 0 Q
(A,N)
35
Inverted pendulumInverted pendulum
Linearized model (see [1, p. 32])✓
[1] Feedback control of dynamic systems, Franklin, Powell, Emani-Naeini
d
dt
2
664
x
x
✓
✓
3
775 =
2
6664
0 1 0 0
0 � (I+m`2)bq
m2g`2
q 00 0 0 1
0 �m`bq
mg`(M+m)q 0
3
7775
2
664
x
x
✓
✓
3
775+
2
664
0I+ml2
q
0mlq
3
775u(t)
q = (I +m`2)(M +m)�m2`2
State space
m, I
M x
x = 0
`
u
(I +m`
2)✓ �mg`✓ = m`x
(M +m)x+ bx�m`✓ = u
36
Matlab implementation
clear all, close all, clc % definition of the continuous-time modelm = 0.2;M = 1;b = 0.05;I = 0.01;g = 9.8;l = 0.5;p = (I+m*l^2)*(M+m)-m^2*l^2;Ac = [0 1 0 0; 0 -(I+m*l^2)*b/p (m^2*g*l^2)/p 0; 0 0 0 1; 0 -(m*l*b)/p m*g*l*(M+m)/p 0];Bc = [ 0; (I+m*l^2)/p; 0; m*l/p];
% discretization n = 4;tau = 0.1;sysd = c2d(ss(Ac,Bc,zeros(1,n),0),tau);A = sysd.a; B = sysd.b; % LQR controlQ = diag([1 1 1 1]);S = zeros(4,1);R = 1;K = dlqr(A,B,Q,R,S); K = -K; % simulationkend = 10/tau;x0 = [1 0 0 0]';x(:,1) = x0;for k=1:kend u(:,k) = K*x(:,k); x(:,k+1) = A*x(:,k)+B*u(:,k);end plot((1:kend)*tau,u), figure, plot((1:kend)*tau,x(3,1:end-1)),figure, plot((1:kend)*tau,x(1,1:end-1)),
Model definition Controller synthesis
37
Time responses
Q = I, S = 0, R = 1, ⌧ = 0.1
t0 2 4 6 8 10
u
-3
-2
-1
0
1
2
3
t0 2 4 6 8 10
3
-0.02
-0.01
0
0.01
0.02
0.03
0.04
t0 2 4 6 8 10
x
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
38
Tuning the parametersWant faster convergence ? Reduce penalty on control input to increase control authority
Want to reduce the angle amplitude? Increase penalty on the angle state
R = 0.01
t0 2 4 6 8 10
x
0
0.2
0.4
0.6
0.8
1
1.2
t0 2 4 6 8 10
3
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
t0 2 4 6 8 10
u
-3
-2
-1
0
1
2
3
4
t0 2 4 6 8 10
x
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
t0 2 4 6 8 10
3
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
t0 2 4 6 8 10
u
-3
-2
-1
0
1
2
3
Q = diag([1 1 100 1])
Concluding remarks
To summarise:• Stage decision problems are extensions of discrete optimization problem for which state and input spaces can be arbitrary.
• In practice may be hard to obtain expressions for the costs-to-go
• When the cost is quadratic and the system is linear we obtain a framework for state feedback control design for any linear plant.
39
After this lecture, you should be able to:• Apply DP to stage-decision problems.
• Solve finite-horizon optimal control problems in discrete-time with a quadratic cost and a linear model by iteratively solving Riccati equations.
• Obtain the linear quadratic regulator using the algebraic Riccati equations for infinite horizon problems.
Appendix A Proof of optimality of dynamic programming
Proof of optimality
A1
Theorem
The policy obtained with the DP algorithm is an optimal policy.
Proof
We shall prove using induction that obtained by the DP algorithm is an optimal policy for the subproblem from stage to stage and that is the cost of the optimal path starting at .
• Step I: Prove this for .
• Step II: Assume that the induction hypothesis holds for a given and prove it for .
⇡k := {µk, . . . , µh�1}k h� 1
Jk(xk) xk
k = h� 1
k � 1k
Step I
A2
• It is also clear that
is the optimal cost for the subproblem with initial condition at stage .
• By construction is an optimal policy as it is the first decision of the optimal path from stage to stage since
min
uh�12Uh�1(xh�1)g
h�1(xh�1, uh�1) + J
h
(fh�1(xh�1, uh�1))
= g
h�1(xh�1, µh�1(xh�1)) + J
h
(fh�1(xh�1, µh�1(xh�1))).
⇡h�1 = {µh�1}h� 1 h
J
h�1(xh�1) = minuh�12Uh�1(xh�1) gh�1(xh�1, uh�1) + J
h
(fh�1(xh�1, uh�1))
xh�1 h� 1
Step II
A3
• Assume now that is an optimal policy and is the cost of the optimal path which starts at initial state . We shall prove using contradiction that is an optimal policy and is the cost of an optimal path which starts at initial state .
• Argument using contradiction: if is not optimal then there must exist a state such that is not the first action of the optimal path from stage to stage denoted by
• Since we are assuming that is an optimal policy we must have
⇡k+1 := {µk+1, . . . , µh�1} Jk+1(xk+1)xk+1
Jk(xk)xk
µk(xk) uk( 6= µk(xk)) k
� = {(xk, uk), (xk+1, uk+1), . . . , xh)}
⇡k+1 := {µk+1, . . . , µh�1}
u`+1 = µ`+1(x`+1) for every ` 2 {k + 1, h� 2}
⇡k := {µk, . . . , µh�1}
⇡k
h
Step II
A4
• The cost of such path is
• However, the cost of the path which has as the first decision is less or equal (contradiction)
J� =h�1X
`=k
g`(x`, u`) + gh(xh),
= gk(xk, uk) +h�1X
`=k+1
g`(x`, µ`(xk)) + gh(xh),
= gk(xk, uk) + Jk+1(f(xk, uk)).
µk(xk)
Jk(xk) = minuk
gk(xk, uk) + Jk+1(f(xk, uk))
= gk(xk, µk(xk)) + Jk+1(f(xk, µk(xk))
gk(xk, uk) + Jk+1(f(xk, uk)) = J�