7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
1/31
Solutions Vol. II, Chapter 1
1.5
(a) We have
nj=1
pij(u) =
nj=1
pij(u) mj1
nk=1 mk
=
nj=1pij(u)
nj=1 mj
1 n
k=1 mk
= 1.
Therefore, pij(u) are transition probabilities.
(b) We have for the modified problem
J(i) = minuU(i)
g(i, u) +
1 n
j=1
mj
n
j=1
pij(u) mj1
nk=1 mk
J(j)
= minuU(i)
g(i, u) + n
j=1
pij(u)J(j) n
k=1
mkJ(k)
.
So
J(i) + nk=1 mkJ(k)
1 = min
uU(i)
g(i, u) + n
j=1
pij(u)J(j) n
k=1
mk(1 11
)
1
J(k)
J(i) +n
k=1 mkJ(k)
1 = min
uU(i)
g(i, u) + n
j=1
pij(u)
J(j) +
n
k=1 mkJ(k)
1
.Thus
J(i) +n
k=1 mkJ(k)
1 =J(i), i.
Q.E.D.
1.7
We show that for any bounded function J :S R, we have
JT(J) T(J) F(J), (1)
37
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
2/31
JT(J) T(J) F(J). (2)
For any , define
F(J)(i) =g(i, (i)) +
j=ipij((i))J(j)
1 pii((i))
and note that
F(J)(i) =T(J)(i) pii((i))J(i)
1 pii((i)) . (3)
Fix >0. IfJT(J), let be such that F(J) F(J) + e. Then, using Eq. (3),
F(J)(i) + F(J)(i) =T(J)(i) pii((i))J(i)
1 pii((i))
T(J)(i) pii((i))T(J)(i)
1 pii((i)) =T(J)(i).
Since > 0 is arbitrary, we obtain F(J)(i) T(J)(i). Similarly, if J T(J), let be such that
T(J) T(J) + e. Then, using Eq. (3),
F(J)(i) F(J)(i) = T(J)(i) pii((i))J(i)1 pii((i))
T(J)(i) + pii((i))T(J)(i)1 pii((i))
T(J)(i) + 1
.
Since >0 is arbitrary, we obtainF(J)(i) T(J)(i).
From (1) and (2) we see that F and Thave the same fixed points, so J is the unique fixed point
ofF. Using the definition ofF, it can be seen that for any scalar r >0 we have
F(J+ re) F(J) + re, F (J) re F(J re). (4)
Furthermore, Fis monotone, that is
JJ F(J) F(J). (5)
For any bounded function J, let r >0 be such that
J re J J+ re.
Applying Frepeatedly to this equation and using Eqs. (4) and (5), we obtain
Fk(J) kre J Fk(J) + kre.
ThereforeFk(J) converges to J. From Eqs. (1), (2), and (5) we see that
JT(J) Tk(J) Fk(J) J,
JT(J) Tk(J) Fk(J) J.
These equations demonstrate the faster convergence property ofF overT.
38
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
3/31
As a final result (not explicitly required in the problem statement), we show that for any two
bounded functions J :S R, J :S R, we have
maxj
|F(J)(j) F(J)(j)| maxj
|J(j) J(j)|, (6)
soFis a contraction mapping with modulus . Indeed, we have
F(J)(i) = minuU(i)
g(i, u) +
j=ipij(u)J(j)
1 pii(u)
= minuU(i)
g(i, u) +
j=ipij(u)J
(j)
1 pii(u) +
j=ipij(u)[J(j) J(j)]
1 pii(u)
F(J)(i) + max
j|J(j) J(j)|, i,
where we have used the fact
1 pii(u) 1 pii(u) = j=ipij(u).Thus, we have
F(J)(i) F(J)(i) maxj
|J(j) J(j)|, i.
The roles ofJ andJ may be reversed, so we can also obtain
F(J)(i) F(J)(i) maxj
|J(j) J(j)|, i.
Combining the last two inequalities, we see that
|F(J)(i) F(J)(i)| maxj
|J(j) J(j)|, i.
By taking the maximum overi, Eq. (6) follows.
1.9
(a) SinceJ, J B(S), i.e., are real-valued, bounded functions onS, we know that the infimum and the
supremum of their difference is finite. We shall denote
m= minxSJ(x) J(x)
and
M= maxxs
J(x) J(x)
.
Thus
m J(x) J(x) M, x S,
39
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
4/31
or
J(x) + m J(x) J(x) + M, x S.
Now we apply the mapping Ton the above inequalities. By property (1) we know thatT will preserve
the inequalities. Thus
T(J + me)(x) T(J)(x) T(J + Me)(x), x S.
By property (2) we know that
T(J)(x) + min[a1r, a2r] T(J+ re)(x) T(J)(x) + max[a1r, a2r].
If we replace r bym or M, we get the inequalities
T(J)(x) + min[a1m, a2m] T(J + me)(x) T(J)(x) + max[a1m, a2m]
and
T(J)(x) + min[a1M, a2M] T(J + Me)(x) T(J)(x) + max[a1M, a2M].
Thus
T(J)(x) + min[a1m, a2m] T(J)(x) T(J)(x) + max[a1M, a2M],
so that
|T(J)(x) T(J)(x)| max[a1|M|, a2|M|, a1|m), a2|m|].
We also have
max[a1|M|, a2|M|, a1|m|, a1|m|, a2|m|] a2max[|M|, |m|] a2supxS
|J(x) J(x).
Thus
|T(J)(x) T(J)(x)| a2maxxS
|J(x) J(x)|
from which
maxxS
|T(J)(x) T(J)(x)| a2maxxS
|J(x) J(x)|.
Thus Tis a contraction mapping since we know by the statement of the problem that 0 a1 a2 < 1.
Since the set B(S) of bounded real valued functions is a complete linear space, we conclude that
the contraction mappingThas a unique fixed point, J, and limk Tk(J)(x) = J(x).
(b) We shall first prove the lower bounds ofJ(x). The upper bounds follow by a similar argument. Since
J, T(J) B(S), there exists a c , (c < ), such that
J(x) + c T(J)(x). (1)
40
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
5/31
We apply T on both sides of (1) and since Tpreserves the inequalities (by assumption (1)) we have by
applying the relation of assumption (2).
J(x) + min[c + a1c, c + a2c] T(J)(x) + min[a1c, a2c] T(J+ ce)(x) T2(J)(x). (2)
Similarly, if we applyTagain we get,
J(x) + mini(1,2)
[c + aic, c + a2i c] T(J) + min[a1c + a21c, a2c + a
22c]
T2(J) + min[a21c, a22c] T(T(J) + min[a1c, a2c]e)(x) T
3(J)(x).
Thus by induction we conclude
J(x) + min[
km=0
am1 c,k
m=0
am2 c] T(J)(x) + min[k
m=1
am1 c,k
m=1
am2 c] . . .
Tk(J)(x) + min[ak1c, ak2c] T
k+1(J)(x).
(3)
By taking the limit as k and noting that the quantities in the minimization are monotone, and
either nonnegative or nonpositive, we conclude that
J(x) + min
1
1 a1c,
1
1 a2c
T(J)(x) + min
a1
1 a1c,
a21 a2
c
Tk(J)(x) + min
ak1
1 a1c,
ak21 a2
c
Tk+1(J)(x) + min
ak+111 a1
c, ak+121 a2
c
J(x).
(4)
Finally we note that
min[ak1c, ak2c] T
k+1(J)(x) Tk(J)(x).
Thus
min[ak1c, ak2c] inf
xS(Tk+1(J)(x) Tk(J)(x)) .
Letbk+1 = infxS(Tk+1(J)(x) Tk(J)(x)) .Thus min[ak1c, ak2c] bk+1.From the above relation we infer
that
minak+11 c
1 a1,ak+12 c
1 a2 min a1
1 a1bk1 ,
a21 a2
bk+1= ck+1Therefore
Tk(J)(x) + min
ak1c
1 a1,
ak2c
1 a2
Tk+1(J)(x) + ck+1.
This relationship gives for k = 1
T(J)(x) + min
a1c
1 a1,
a2c
1 a2
T2(J)(x) + c2
41
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
6/31
Let
c= infxS
(T(J)(x) J(x))
Then the above inequality still holds. From the definition ofc1 we have
c1 = min a1c
1 a1,
a2c
1 a2
.
Therefore
T(J)(x) + c1 T2(J)(x) + c2
andT(J)(x) + c1 J(x) from Eq. (4). Similarly, letJ1(x) = T(J)(x), and let
b2 = minxS
(T2(J)(x) T(J)(x)) = minxS
(T(J1)(x) T(J1)(x)).
If we proceed as before, we get
J1(x) + min
1
1 a3b2,
1
1 a2b2
T(J1)(x) + min
a1b21 a2
, a1b21 a2
T2(J1)(x) + min
a21b21 a2
, a22b21 a2
J(x).
Then
min[a1b2, a2b2] minxS
[T2(J1)(x) T(J1)(x)] = minxS
[T3(J)(x) T2(J)(x)] = b3
Thus
min a21b2
1 a1,
a22b2
1 a2 min a1b3
1 a2,
a2b3
1 a2 .Thus
T(J1)(x) + min
a1b21 a2
, a2b21 a2
T2(J1)(x) + min
a1b31 a2
, a2b21 a2
or
T2(J)(x) + c2 T3(J)(x) + c3
and
T2(J)(x) + c2 J(x).
Proceeding similarly the result is proved.
The reverse inequalities can be proved by a similar argument.
(c) Let us first consider the state x = 1
F(J)(1) = minuU(1)
g(j, j) + a
nj=1
p1jJ(j)
42
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
7/31
Thus
F(J+ re)(1) = minuU(1)
g(1, u) +
nj=1
pij(J+ re)(j)
= min
uU(1)
g(1, u) +
nj=1
p1jJ(j) + ar
=F(J)(1) + rThus
F(J+ re)(1) F(J((1)
r = (1)
Since 0 1 we conclude that n . Thus
n F(J+ re)(1) F(J)(1)
r =
For the statex = 2 we proceed similarly and we get
F(J)(2) = minuU(2)
g(2, u) + p21F(J)(1) + nJ=2
p2jJ(j)and
F(J+ re)(2) = minuU(2)
g(2, u) + p21F(J+ re)(1) +
nJ=2
p2j(J+ re)(j)
= minuU(2)
g(2, u) + p21F(J)(1) + 2rp21+
nJ=2
p2J(j) + n
J=2
pijre(j)
where, for the last equality, we used relation (1).
Thus we conclude
F(J+ re)(2) =F(J)(2) + 2rp21+ n
j=2
p2jr= F(J)(2) + 2rp21+ r(1 p21)
which yieldsF(J+ re)(2) F(J)(2)
r =2P21+ (1 p21) (2)
Now let us study the behavior of the right-hand side of Eq. (2). We have 0 <
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
8/31
Claim:
i F(J+ re)(x) F(J)(x)
r
Proof: We shall employ an inductive argument. Obviously the result holds forx = 1, 2. Let us assume
that it holds for all x i. We shall prove it for x = i +j
F(J)(i + 1) = minuU(i+1)
g(i + 1, u) +
ij=1
p1+ijF(J)(j) + n
j=i+1
pi+1jpi+1jJ(j)
F(J+ re)(i + 1) = minuU(i+1)
g(i + 1, u) +
ij=1
pi+1jF(J+ re)(j) +
j=i+1n
pi+1,j(J+ re)(j)
We knowj F(J+ re)(j) , j i, thus
F(J)(i + 1) + rj=1
F(J)(i + 1) + 2rp + r(1 p)
where
p=i
j=1
p1+ij
Obviouslyi
j=1
jpi+1j ii
j=1
pi+1j =ip
Thus
i+1p + (1 p)F(J+ re)(j) F(J)(j)
r 2p + (1 p)
Since 0< i+1 2
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
9/31
For property (2) we note that
T(J+ re)(x) = g(x) + M(J+ re)(x) = g(x) + M J(x) + rM e(x) = T(J)(x) + rM e(x)
We have
1 M e(x) 2
so that
T(J+ re)(x) T(J)(x)
r =M e(x)
and
1 T(J+ re)(x) T(J)(x)
r 2
Thus property (2) also holds if2 < 1.
1.10
(a) If there is a unique such thatT(J) =T(J), then there exists an >0 such that for all Rn
with maxi |(i)| we have
F(J+ ) =T(J+ ) J =g+ P(J+ ) J =g+ (P I)(J+ ).
It follows thatFis linear around Jand its Jacobian is P I.
(b) We first note that the equation defining Newtons method is the first order Taylor series expansion of
FaroundJk.Ifk is the unique such thatT(Jk) = T(Jk),thenFis linear near Jk and coincides with
its first order Taylor series expansion around Jk. Therefore the vector Jk+1 is obtained by the Newton
iteration satisfies
F(Jk+1) = 0
or
Tk(Jk+1) = Jk+1.
This equation yields Jk+1 = Jk , so the next policy k+1 is obtained as
k+1 = arg min
T(Jk).
This is precisely the policy iteration of the algorithm.
45
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
10/31
1.12
For simplicity, we consider the case where U(i) consists of a single control. The calculations are very
similar for the more general case. We first show thatnj=1 Mij = . We apply the definition of the
quantities Mij
nj=1
Mij =nj=1
ij +
(1 )(Mij ij)
1 mi
=
nj=1
ij+nj=1
(1 )(Mij ij)
1 mi
= 1 + (1 )n
j=1
Mij1 mi
(1 )
1 mi
nj=1
ij = 1 + (1 ) mi
1 mi
(1 )
1 mi
= 1 (1 ) = .
LetJ1 , . . . , J n satisfy
Ji =gi+n
j=1 MijJj . (1)We substituteJ into the new equation
Ji = gi+
nj=1
MijJj
and manipulate the equation until we reach a relation that holds trivially
J1 =gi(1 )
1 mi+
nj=1
ijJj + 1
1 mi
nj=1
(Mij ij)Jj
= gi(1 )1 mi+ Ji + 1 1 mi
nj=1
MijJj 1 1 miJi
=Ji + 1
1 mi
gi+ n
j=1
MijJj Ji
.
This relation follows trivially from Eq. (1) above. Thus J is a solution of
Ji= gi+n
J=1
MijJj .
1.17
The form of Bellmans Equation for the tax problem is
J(x) = mini
j=i
cj(xi) + Ewi{J[xi, xi1, fi(xi, wi)
46
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
11/31
Let J(x) = J(x)
J(x) = maxi
n
j=1
cj(xj) + ci(xi) + Ewi{J[ ]}
Let J(x) = (1 )J(x) + nj=1 Cj(xj) By substitution we obtainJ(x) = max
i
(1 ) n
j=1
cj(xj) + (1 )ci(xi) + Ewi{(1 )J[ ]}
= maxi
[ci(xi) Ewi{ci(f(xi, wi)}] + Ewi{J( )}].
Thus Jsatisfies Bellmans Equation of a multi-armed Bandit problem with
Ri(xi) = ci(xi) Ewi{ci(f(xi, wi))}.
1.18
Bellmans Equation for the restart problem is
J(x) = max[R(x0) + E{J[f(x0, w)]}, R(x) + E{J[f(x, w)]}]. (A)
Now, consider the one-armed bandit problem with rewardR(x)
J(x, M) = max{M, R(x) + E[J(f(x, w), M)]}. (B)
We have
J(x0, M) = R(x0) + E[J(f(x0, w), M)]> M
ifM < m(x0) andJ(x0, M) = M. This implies that
R(x0) + E[J(f(x0, w))] =m(x0).
Therefore the forms of both Bellmans Equations (A) and (B) are the same whenM=m(x0).
47
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
12/31
Solutions Vol. II, Chapter 2
2.1
(a) (i) First, we need to define a state space for the problem. The obvious choice for a state variable
is our location. However, this does not encapsulate all of the necessary information. We also need to
include the value of c if it is known. Thus, let the state space consist of the following 2m+ 2 states:
{S, S1, . . . , S m, I1, . . . I m, D}, whereSis associated with being at the starting point with no information,
Si and Ii are associated with being at S and I, respectively, and knowing that c = ci, and D is the
termination state.
At state S, there are two possible controls: go directly to D (direct) or go to an intermediate
point (indirect). If control direct is selected, we go to state D with probability 1, and the cost is
g(S, direct, D) = a. If control indirect is selected, we go to state Ii with probability pi, and the cost is
g(S, indirect, Ii) = b.
At state Si, for i {1, . . . , m}, we have the same controls as at state S. Again, if control direct is
selected, we go to state D with probability 1, and the cost is g(Si,direct,D) =a. If, on the other hand,
control indirectis selected, we go to state Ii with probability 1, and the cost is g(S, indirect, Ii) = b.
At state Ii, for i {1, . . . , m}, there are also two possible controls: go back to the start (start) or
go to the destination (dest). If control start is selected, we go to state Si with probability 1, and the
cost isg (Ii,start,Si) = b. If control destis selected, we go to state D with probability 1, and the cost isg(Ii,dest,D) = ci.
We have thus formulated the problem as a stochastic shortest path problem. Bellmans equation
for this problem is
J(S) = min[a, b +mi=1
piJ(Ii)]
J(Si) = min[a, b + J(Ii)]
J(Ii) = min[ci, b + J(Si)].
We assume thatb >0. Then, Assumptions 5.1 and 5.2 hold since all improper policies have infinite cost.As a result, if(Ii) =start, then (Si) = direct. If(Ii)=start, then we never reach state Si and
so it doesnt matter what the control is in this case. Thus, J(Si) =a, and (Si) =direct. From this,
it is easy to derive the optimal costs and controls for the other states
J(Ii) = min[ci, b + a] (Ii) =
dest, ifci < b + a
start, otherwise,
48
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
13/31
J(S) = min[a, b +
mi=1
pimin(ci, b + a)]
(S) =
direct, ifa < b +
mi=1pimin(ci, b + a)
indirect, otherwise.
For the numerical case given, we see that a < b+m
i=1pimin(ci, b+ a) since a = 2 and b+mi=1pimin(ci, b + a) = 2.5. Hence (S) = direct. We need not consider the other states since they will
never be reached.
(ii) In this case, every time we are at the starting location, our available information is the same. We
thus no longer need the states Si from part (i). Our state space for this part is then S, I1, . . . , I m, D.
At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state
D with probability 1, and the cost is g(S, direct, D) = a. If control indirectis selected, we go to state Ii
with probabilitypi, and the cost is g(S, indirect, Ii) = b [same as in part (ii)].
At state Ii, for i {1, . . . , m}, the possible controls are {start, dest}. If controlstart is selected,
we go to state Swith probability 1, and the cost is g(Ii,start,S) = b. If control dest is selected, we go
to state D with probability 1, and the cost is g(Ii,dest,D) = ci.
Bellmans equation for this stochastic shortest path problem is
J(S) = min[a, b +mi=1
piJ(Ii)]
J(Ii) = min[ci, b + J(S)].
The optimal policy can be described by
(S) =
direct, ifa < b +
mi=1piJ
(Ii)
indirect, otherwise,
(Ii) =
dest, ifci < b + J(S)
start, otherwise.
We will solve the problem for the numerical case by guessing an optimal policy and then showing
that the resulting cost J satisfies J= T J. SinceJ is the unique solution to this equation, our policy
is optimal. So lets guess the initial policy to be
(S) = direct (I1) = dest (I2) = start.
Then
J(S) = a = 2 J(I1) = c1 = 0 J(I2) = b + J(S) = 1 + 2 = 3.
49
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
14/31
From Bellmans equation, we have
J(S) = min(2, 1 + 0.5(3 + 0)) = 2
J(I1) = min(0, 1 + 2)) = 0
J(I2) = min(5, 1 + 2)) = 3.
Thus, our policy is optimal.
(b) The state space for this problem is the same as for part a(ii): {S, I1, . . . , I m, D}.
At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state
D with probability 1, and the cost is g(S, direct, D) = a. If control indirectis selected, we go to state Ii
with probabilitypi, and the cost is g(S, indirect, Ii) = b [same as in part a,(i) and (ii)].
At state Ii, for i {1, . . . , m}, we have an additional option of waiting. So the possible controls
are {start, dest, wait}. If control start is selected, we go to state Swith probability 1, and the cost
is g(Ii,start,S) = b. If control dest is selected, we go to state D with probability 1, and the cost is
g(Ii,dest,D) = ci. If control wait is selected, we go to state Ij with probability pj , and the cost is
g(Ii,wait,Ij) = d.
Bellmans equation is
J(S) = min[a, b +m
i=1piJ(Ii)]
J(Ii) = min[ci, b + J(S), d +
mj=1
pjJ(Ij)].
We can describe the optimal policy as follows:
(S) =
direct, ifa < b +
mi=1piJ
(Ii)
indirect, otherwise.
If direct was selected, we do not need to consider the other states (other than D) since they will never
be reached. If indirect was selected, then defining k = min(2b, d), we see that
(Ii) =
dest, ifci< k+m
i=1 J(Ii)
start, ifci> k+m
i=1 J(Ii) and 2b < d
wait, ifci> k+m
i=1 J(Ii) and 2b > d.
50
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
15/31
2.2
Lets define the following states:
H: Last flip outcome was heads
T: Last flip outcome was tails
C: Caught (this is the termination state)
(a) We can formulate this problem as a stochastic shortest path problem with stateCbeing the termina-
tion state. There are four possible policies: 1 = {always flip fair coin},2 = {always flip two-headed coin},
3 = {flip fair coin if last outcome was heads / flip two-headed coin if last outcome was tails}, and4 =
{flip fair coin if last outcome was tails / flip two-headed coin if last outcome was heads}. The only way
to reach the termination state is to be caught cheating. Under all policies except1, this is inevitable.
Thus 1 is an improper policy, and2, 3, and 4 are proper policies.
(b) Let J1(H) and J2(T) be the costs corresponding policy 1 where the starting state is H and T,
respectively. The expected benefit starting from state Tup to the first return to T(and always using the
fair coin), is1
2
1 +
1
2+
1
22+
m
2 =
1
2(2 m).
Therefore
J1(T) =
+ ifm 2.
Also we have
J1(H) =1
2(1 + Jn(H)) +
1
2Jn(T),
so
J1(H) = 1 + J(T).
It follows that ifm >2, then1 results in infinite cost for any initial state.
(c,d) The expected one-stage rewards at each stage are
Play Fair in State H: 12
Cheat in StateH: 1 p
Play Fair in State T: 1m2
Cheat in StateT: 0
We show that any policy that cheats at Hat some stage cannot be optimal. As a result we can eliminate
cheating from the control constraint set of state H.
51
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
16/31
Indeed suppose we are at state H at some stage and consider a policy which cheats at the first
stage and then follows the optimal policy from the second stage on. Consider a policy which plays
fair at the first stage, and then follows from the second stage on if the outcome of the first stage is H
or cheats at the second stage and follows from the third stage on if the outcome of the first stage is
T. We have
J(H) = (1 p)[1 + J(H)]
J(H) =1
2(1 + J(H)) +
1
2
(1 p)[1 + J(H)]
=
1
2+
1
2[J(H) + J(H)]
1
2+ J(H),
where the inequality follows from the fact thatJ(H) J(H) since is optimal. Therefore the reward
of policy can be improved by at least 12 by switching to policy , and therefore cannot be optimal.
We now need only consider policies in which the gambler can only play fair at state H: 1 and3.
Under1, we saw from part b) that the expected benefits are
J1(T) =
+ ifm 2,
and
J1(H) =
+ ifm 2.
Under3, we have
J3(T) = (1 p)J3(H),
J3(H) =1
2[1 + J3(H)] +
1
2J3(T).
Solving these two equations yields
J3(T) =1 p
p ,
J3(H) =1
p.
Thus ifm >2, it is optimal to cheat if the last flip was tails and play fair otherwise, and if m
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
17/31
2.7
(a) Leti be any state in Sm. Then,
J(i) = min
uU(i)
[E{g(i,u,j) + J(j)}]
= minuU(i)
jSm
pij(u)[g(i,u,j) + J(j)] +
jSm1S1t
pij(u)[g(i,u,j) + J(j)]
= minuU(i)
jSm
pij(u)[g(i,u,j) + J(j)] + (1 jSm
pij(u))
jSm1S1t
pij(u)[g(i,u,j) + J(j)]
(1
jSmpij(u))
.
In the above equation, we can think of the union ofSm1, . . . , S 1,and t as an aggregate termination state
tm associated with Sm. The probability of a transition from i Sm totm (under u) is given by,
pitm(u) = 1
jSmpij(u).
The corresponding cost of a transition from i Sm totm (underu) is given by,
g(i,u,tm) =
j=Sm1S1t
pij(u)[g(i,u,j) + J(j)]
pitm(u) .
Thus, for i Sm, Bellmans equation can be written as,
J(i) = minuU(i)
jSm
pij(u)[g(i,u,j) + J(j)] +pitm(u)[g(i,u,tm) + 0]
.
Note that with respect to Sm, the termination state tm is both absorbing and of zero cost. Let tm and
g(i,u,tm) be similarly constructed form = 1, . . . , M .
The original stochastic shortest path problem can be solved as M stochastic shortest path sub-
problems. To see how, start with evaluatingJ(i) for i S1 (where t1 = {t}). With the values ofJ(i),
for i S1, in hand, the g cost-terms for the S2 problem can be computed. The solution of the original
problem continues in this manner as the solution ofM stochastic shortest path problems in succession.
(b) Suppose that in the finite horizon problem there are n states. Define a new state space Snew
and sets Sm as follows,
Snew= {(k, i)|k {0, 1, . . . , M 1} and i {1, 2, . . . , n}}
Sm = {(k, i)|k= M mand i {1, 2, . . . , n}}
for m = 1, 2, . . . , M . (Note that the Sms do not overlap.) By associating Sm with the state space of
the original finite-horizon problem at stage k = M m, we see that ifik Sm1 under all policies. By
augmenting a termination state t which is absorbing and of zero cost, we see that the original finite-
horizon problem can be cast as a stochastic shortest path problem with the special structure indicated in
the problem statement.
53
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
18/31
2.8
Let J be the optimal cost of the original problem and Jbe the optimal cost of the modified problem.
Then we have
J(i) = minu
nj=1
pij(u) (g(i,u,j) + J(j)) ,
and
J(i) = minu
nj=1,j=i
pij(u)
1 pii(u)
g(i,u,j) +
g (i,u,i)pii(u)
1 pii(u) + J(j)
.
For eachi, let(i) be a control such that
J(i) =
nj=1
pij((i)) (g(i, (i), j) + J(j)) .
Then
J(i) = n
j=1,j=ipij((i)) (g(i, (i), j) + J(j)) +pii((i)) (g(i, (i), i) + J(i)) .By collecting the terms involving J(i) and then dividing by 1 pii((i)),
J(i) = 1
1 pii((i))
nj=1,j=i
pij((i))(g(i, (i), j) + J(j))
+pii((i))g(i, (i), i)
.
Sincen
j=1,j=i
pij((i))
1pii((i))= 1, we have
J(i) = 1
1 pii((i))
nj=1,j=i
pij((i))(g(i, (i), j) + J(j))
+ n
j=1,j=i
pij((i))
1 pii((i))pii((i))g(i, (i), i)
=n
j=1,j=i
pij((i))
1 pii((i))
(g(i, (i), j) + J(j) +pii((i))g(i, (i), i)
1 pii((i))
) .ThereforeJ(i) is the cost of stationary policy {, , . . .} in the modified problem. Thus
J(i) J(i) i.
Similarly, for eachi, let (i) be a control such that
J(i) =n
j=1,j=i
pij((i))
1 pii(mu(i))
g(i,(i), j) +
g (i,(i), i)pii((i))
1 pii((i)) + J(j)
.
Then, using a reverse argument from before, we see that J(i) is the cost of stationary policy {, , . . .}
in the original problem. Thus
J(i) J
(i) i.
Combining the two results, we have J(i) = J(i), and thus the two problems have the same optimal costs.
Ifpii(u) = 1 for some i = t, we can eliminate u from U(i) without increasing J(i) or any other
optimal costJ(j), j =i. If that were not so, every optimal stationary policy must useu at state i and
therefore must be improper, which is a contradiction.
54
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
19/31
2.17
Consider a modified stochastic shortest path problem where the state space is denoted byS, the control
space by U, the transition costs by g, and the transition probabilities by p. Let the state space
S =SS SSU, where
SS={1, . . . , n , t} where each i SS corresponds to i S
SSU={(i, u)|i S, u U(i)} where each (i, u) SSU corresponds to i Sandu U(i).
For i, j SS, u U(i), we define U(i) = U(i), g(i,u,j) = g(i,u,j), and pij(u) = pij(u). For (i, u)
SS Uandj SS, the only possible control isu
=u (i.e., U(i, u) = {u}), and we have g ((i, u), u, j) =
g(i,u,j) andp(i,u)j(u) = pij(u).
Since trajectories originating from a state i SS are equivalent to trajectories in the original
problem, the optimal cost-to-go value for statei in the modified problem is J(i), the optimal cost-to-go
value from the original problem. Let us denote the optimal cost-to-go value for (i, u) SSU byJ(i, u).
ThenJ(i) and J(i, u) solve uniquely Bellmans equation of the modified problem, which is
J(i) = minuU(i)
nj=1
pij(u)(g(i,u,j) + J(j)) (1)
J(i, u) =n
j=1
pij(u)(g(i,u,j) + J(j)). (2)
The Q-factors for the original problem are defined as
Q(i, u) =n
j=1
pij(u)(g(i,u,j) + J(j)),
so from Eq. (2), we have
Q(i, u) = J(i, u), (i, u). (3)
Also from Eqs. (1) and (2), we have
J(i) = minuU(i)
J(i, u), i. (4)
Thus from Eqs. (1)-(4), we obtain
Q(i, u) =n
j=1
pij(u)
g(i,u,j) + min
uU(j)Q(j, u)
. (5)
There remains to show that there is no other solution to Eq. (5). Indeed, if Q(i, u) were such that
Q(i, u) =n
j=1
pij(u)
g(i,u,j) + min
uU(j)Q(j, u)
, (i, u), (6)
55
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
20/31
then by defining
J(i) = minuU(i)
Q(i, u) (7)
we obtain from Eq. (6)
Q(i, u) =
nj=1
pij(u)(g(i,u,j) + J(j)), (i, u). (8)
By combining Eqs. (7) and (8), we have
J(i) = minuU(i)
nj=1
pij(u)(g(i,u,j) + J(j)), i. (9)
Thus J(i) and Q(i, u) satisfy Bellmans Eq. (1)-(2) for the modified problem. Since this Bellman equation
is solved uniquely by J(i) and J(i, u), we see that
Q(i, u) = J(i, u) = Q(i, u), (i, u).
Thus the Q-factorsQ(i, u) solve uniquely Eq. (5).
56
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
21/31
Solutions Vol. II, Chapter 3
3.4
By using the relationT(J) T(J) + e= J + e and the monotonicity ofT, we obtain
T2(J) T(J) + e J + e + e.
Proceeding similarly, we obtain
Tk (J) T(J) +
k2i=0
i
e J +
k1i=0
ie
and by taking limit as k , the desired resultJ J + (/(1 ))e follows.
3.5
Under assumption P, we have by Prop. 1.2(a), J J. Let r >0 be such that
J J re.
Then, applyingTk to this inequality, we have
J
=Tk
(J
) Tk
(J
) k
re.
Taking the limit as k , we obtainJ J, which combined with the earlier shown relation J J,
yields J =J. Under assumption N, the proof is analogous, using Prop. 1.2(b).
3.8
From the proof of Proposition 1.1, we know that there exists a policy such that, for all i> 0.
J(x) J(x) +
i=0 iiLet
i=
2i+1i >0.
Thus,
J(x) J(x) + i=0
1
2i+1 =J(x) + xS.
57
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
22/31
If
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
23/31
wherei =(Ri+ B
iKi+1Bi)
1biKk+1Aix
p1 = (Rp1+ Bp1K0Bp1)
1Bp1K0Ap1x
andK0 . . . , K p1 satisfy the coupled set ofp algebraic Ricatti equations
Ki = Ai[Ki+1 Ki+1Bi(Ri+ BiKi+1Bi)
1BiKi+1]Ai+ Qi, i= 0, . . . , p 2,
Kp1 = Ap1[K0 K0Bp1(Rp1+ Bp1K0Bp1)
1Bp1K0]Ap1+ Qp1.
3.14
The formulation of the problem falls under assumption P for periodic policies. All the more, the problem
is discounted. Since wk are independent with zero mean, the optimality equation for the equivalent
stationary problem reduces to the following system of equations
J(x0, 0) = minu0U(x0)
Ew0{x0Q0x0+ u0(x0)
R0u0(x0) + J(A0x0+ B0u0+ w0, 1)}
J(x1, 1) = minu1U(x1)
Ew1{x1Q1x1+ u1(x1)
R1u1(x1) + J(A1x1+ B1u1+ w1, 2)}
. . .
J(xp1, p 1) = minup1U(xp1)
Ewp1{xp1Qp1xp1+ up1(xp1)
Rp1up1(xp1)
+ J(Ap1xp1+ Bp1up1+ wp1, 0)}
(1)
From the analysis in7.8 in Ch.7 on periodic problems we see that there exists a periodic policy
{0, 1, . . . ,
p1,
1,
2, . . . ,
p1, . . .}
which is optimal. In order to obtain the solution we argue as follows: Let us assume that the solution is
of the same form as the one for the general quadratic problem. In particular, assume that
J(x, i) = xKix + ci,
whereciis a constant andKiis positive definite. This is justified by applying the successive approximation
method and observing that the sets
Uk(xi, , i) = {ui Rm|xQx + uiRui+ (Ax + Bui)Kki+1(Ax + Bui) }
are compact. The latter claim can be seen from the fact that R 0 and Kki+1 0. Then by Proposition
7.7, limk Jk(xi, i) = J(xi, i) and the form of the solution obtained from successive approximation is
as described above.
59
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
24/31
In particular, we have for 0 i p 1
J(x, i) = minuiU(xi)
Ewi{xQix + ui(x)R1ui(x) + J(A1x + B1ui+ wi, i + 1)}
= minuiU(xi)
Ewi{xQix + ui(x)R1ui(x) + [(Aix + Biui+ wi)ki+1(Aix + Biui+ wi) + ci+1]}
= minuiU(xi)
Ewi{x(Qi+ AiKi+1Ai)xi+ ui(ri+ BiKi+1Bi)ui+ 2xAiKi+1Biui+
+ 2wiKi+1Biui+ 2xAiKi+1wi+ w
iKi+1wi+ ci+1}
= minuiU(xi)
{x(Qi+ AiKi+1Ai)xi+ ui(Ri+ B
iKi+1Bi)ui+ 2x
AiKi+1Biui+
+ wiKi+1wi+ c1}
where we have taken into consideration the fact that E(wi) = 0. Minimizing the above quantity will give
us
ui= (Ri+ BiKi+1Bi)1BiKi+1Aix (2)
Thus
J(x, i) = x [Qi+ Ai(Ki+1 2Ki+1(Ri+ BiKi+1Bi)
1BiKi+1)Ai] x + ci= xKix + ci
whereci= Ewi{wiKi+1wi} + ci+1 and
Ki= Qi+ Ai(Ki+1 2Ki+1(Ri+ BiKi+1Bi)
1BiKi+1)Ai.
Now for this solution to be consistent we must have Kp = K0. This leads to the following system of
equations
K0 = Q0+ A0(K1 2K1(R0+ B0K1B0)1B0K1)A0
. . .
Ki= Qi+ Ai(Ki+1 2Ki+1(Ri+ BiKi+1Bi)
1BiKi+1)Ai
. . .
Kp1 = Qp1+ Ap1(K0 2K0(Rp1+ Bp1K0Bp1)
1Bp1K0)Ap1
(3)
This system of equations has a positive definite solution since (from the description of the problem) the
system is controllable, i.e. there exists a sequence of controls such that {u0, . . . , ur} such that xr+1 = 0.
Thus the result follows.
3.16
(a) Consider the stationary policy, {0, 0, . . . , }, where0 = L0x. We have
J0(x) = 0
60
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
25/31
T0(J0)(x) = xQx + xL0RL0x
T20(J0)(x) = xQx + xL0RL0x + (Ax + BL0x + w)
Q(Ax + BL0x + w)
=xM1x + constant
whereM1 = Q + L0
RL0+ (A + BL0)Q(A + BL0),
T30(J0)(x) = xQx + xL0RL0x + (Ax + BL0x + w)
M1(Ax + BL0+ w) + (constant)
=xM2x + constant
Continuing similarly, we get
Mk+1 = Q + L0RL0+ (A + BL0)Mk(A + BL0).
Using a very similar analysis as in Section 8.2, we get
Mk K0
where
K0 = Q + L0RL0+ (A + BL0)K0(A + BL0)
(b)
J1(x) = limN
E wkk=0,,N1
N1k=0
k
xkQxk+ 1(xk)R1(xk)
= limN
TN1(J0)(x)
Proceeding as in the proof of the validity of policy iteration (Section 7.3, Chapter 7). We have
T1(J0) = T(J0)
J0(x) = xK0x + constant =T0(J0)(x) T1(J0(x)
Hence, we obtain
J0(x) T1(J0)(x) . . . Tk1(J0)(x) . . .
implying,
J0(x) limk
Tk1(J0)(x) = J1(x).
(c) As in part (b), we show that
Jk (x) = xKkx + constant Jk1(x).
Now since
0 xKkx xKk1x, x
61
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
26/31
we have
Kk K.
The form ofK is,K= (A + BL)K(A + BL) + Q + LRL
L= (BKB + R)1BKA
To show that Kis indeed the optimal cost matrix, we have to show that it satisfies
K=A[K 2KB (BKB + R)1BK]A + Q
=A[KA + KBL] + Q
Let us expand the formula for K, using the formula for L,
K= (AKA + AKB L + LBKA + LBKB L) + Q + LRL.
Substituting, we get K= (AKA + AKB L + LBKA) + Q LBKA
=AKA + AKB L + Q.
Thus Kis the optimal cost matrix.
A second approach: (a) We know that
J0(x) = limn
Tn0(J0)(x).
Following the analysis at 8.1 we have
J0(x) = 0
T0(J)(x) = E{xQx + 0(x)R0(x)}= xQx + 0(x)R0(x) = x(Q + L0RL0)x
T20(J)(x) = E{xQx + 0(x)R0(x) + (Ax + B0(x) + w)
Q(Ax + B0(x) + w)}
=x (Q + L0RL0+ (A + BL0)Q(A + BL0)) x + E{wQw}.
Define
K00 =Q
Kk+10 =Q + L0RL0+ (A + BL0)
Kk0 (A + BL0).
Then
Tk+10 (J)(x) = xKk+10 x +
k1m=0
kmE{wKm0 w}.
The convergence ofKk+10 follows from the analysis of4.1. Thus
J0(x) = xK0x +
1 E{wK0w}
62
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
27/31
(as in8.1) which proves the required relation.
(b) Let1(x) be the solution of the following
minu
{uRu + (Ax + Bu)K0(Ax + Bu)}
which yields
u1 = (R+ BK0B)1BK0Ax= L1x.
Thus
L1 = (R+ BK0B)1BK0A= M1
whereM=R+ BK0B and =B K0A. Let us consider the cost associated with u1 if we ignore w
J1(x) =
k=0 k (xkQxk+ 1(xk)
Rm1(xk)) =
k=0 kxk(Q + L
1RL1)xk.
However, we know the following
xk+1 = (A + BL1)k+1x0+k+1m=1
(A + BL1)k+1mwm.
Thus, if we ignore the disturbance w we get
J1(x) = x0
k=0
k(A + BL1)k(Q + L1RL1)(A + BL1)kx0.
Let us call
K1 =
k=0
k(A + BL1)k(Q + L1RL1)(A + BL1)kx0. (1)
We know that
K 0 (A + BL0)K0(A + BL0) L0RL0 = Q.
Substituting in (1) we have
K1 =k=0
k(A + BL1)k(K0+ (A + BL1)K0(A + BL1))(A + BL1)+
+
k=0
{k(A + BL1)k[(A + BL1)K0(A + BL1) (A + BL0)K0(A + BL0)+
+ L1RL1 L0RL0](A + BL1)
k}.
However, we know that
K0 =k=0
k(A + BL1)k (K0 (A + BL1)K0(A + BL1)) (A + BL1)k.
63
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
28/31
Thus we conclude that
K1 K0 =k=0
k(A + BL1)k(A + BL1)k
where
= (A + BL1)K0(A + BL1) (A + BL0)K0(A + BL0) + L1K0L1+ L
0K0L0.
We manipulate the above equation further and we obtain
=L1(R+ BKoB)L1 L0(R+ B
K0B)L0+ L1BK0A + AK0BL1
L0BK0A AK0BL0
=L1ML1 L0ML0+ L
1 +
L1 L0 L0
=(L0 L1)M(L0 L1) ( + ML1)(L0 L1) (L0 L1)( + M L1).
However, it is seen that
+ ML1 = 0.
Thus
= (L0 L1)M(L0 L1).
SinceM0 we conclude that
K0 K1 =k=0
k(A + BL1)k(L0 L1)M(L0 L1)(A + BL1)k 0.
Similarly, the optimal solution for the case where there are no disturbances satisfies the equation
K= Q + LRL + (A + BL)K(A + BL)
withL = (R+ BKB )1BKA. If we follow the same steps as above we will obtain
K1 K=k=0
k(A + BL1)k(L1 L)M(L1 L)(A + BL1)k 0.
Thus K K1 K0. Since K1 is bounded, we conclude that A+ BL1 is stable (otherwise K1 ).
Thus, the sum converges and K1 is the solution ofK1 = (A+BL1)K1(A+L1) + Q+L1RL1. Now
returning to the case with the disturbances w we conclude as in case (a) that
J1(x) = xK1x +
1 E{wK1w}.
SinceK1 K0 we conclude that J1(x) J0(x) which proves the result.
c) The policy iteration is defined as follows: Let
Lk = (R+ BKk1B)1BKk1A.
64
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
29/31
Thenk(x) = Lkx and
Jk(x) = xKkx +
1 E{wKkw}
whereKk is obtained as the solution of
Kk = (A + BLk)Kk(A + BLk) + Q + LkRLk.
If we follow the steps of (b) we can prove that
KKk . . . K1 K0. (2)
Thus by the theorem of monotonic convergence of positive operators (Kantorovich and Akilov p.
189: Functional Analysis in Normed Spaces) we conclude that
K = limp
Kk
exists. Then if we take the limit of both sides of eq. (2) we have
K= (A + BL)K(A + L) + Q + LRL
with
L= (R+ BKB)1BKA.
However, according to 4.1, K is the unique solution of the above equation. Thus, K = K and
the result follows.
65
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
30/31
Solutions Vol. II, Chapter 4
4.4
(a) We have
Tk+1h0 =T(Tkh0) = T
hki +
Tkh0
(i)e
= T hki +
Tkh0
(i).
Theith component of this equation yieldsTk+1h0
(i) =
T hki
(i) +
Tkh0
(i).
Subtracting these two relations, we obtain
Tk+1h0
Tk+1h0
(i) = T hki
T hki
(i),
from whichhk+1i =T h
ki
T hki
(i).
Similarly, we have
Tk+1h0 =T
Tkh0
= T
hk +
1
n
Tkh0
(i)e
= Thk +
1
n
Tkh0
(i)e.
From this equation, we obtain
1
n
Tk+1h0
(i) =
1
n
Thk
(i) +
1
n
Tkh0
(i)e.
By subtracting these two relations, we obtain
hk+1 =Thk
1
n Thk(i).The proof forhk is similar.
(b) We have
hk =Tkh0
1
n
i
Tkh0
(i)
e=
1
n
ni=1
hki .
So sincehki converges, the same is true forhk. Also,
hk =Tkh0 mini
Tkh0
(i)e
and
hk(j) =
Tkh0
(j) mini
Tkh0
(i)
= maxi
Tkh0
(j)
Tkh0
(i)
= maxi
hki(j).
Sincehki converges, the same is true forhk.
66
7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2
31/31
4.8
Bellmans equation for the auxiliary (1 )discounted problem is as follows:
J(i) = minuU(i)[g(i, u) + (1 )j
pij(u)J(j)]. (1)
Using the definition of pij(u), we obtain
j
pij(u)J(j) =j=t
(1 )1pij(u) J(j) + (1 )1(pit(u) )J(t),
or j
pij(u)J(j) =j
(1 )1pij(u)J(j) (1 )1J(t).
This together with (1) leads to
J(i) = minuU(i)
[g(i, u) +j
pij(u)J(j) J(t)],
or, equivalently,
J(t) + J(i) = minuU(i)
[g(i, u) +j
pij(u)J(j)]. (2)
Returning to the problem of minimizing the average cost per stage, we notice that we have to solve the
equation
+ h(i) = minuU(i)
[g(i, u) +
jpij(u)h(j)]. (3)
Using (2), it follows that (3) is satisfied for = J(t) andh(i) = J(i) for all i. Thus, by Proposition 2.1,
we conclude thatJ(t) is the optimal average cost and J(i) is a corresponding differential cost at state i.