1 OR II OR II GSLM 52800 GSLM 52800
1
OR IIOR IIGSLM 52800GSLM 52800
2
3
Discounted ProblemDiscounted Problem the value of $1 in period n+1 is only$, 0 <
< 1, of period n ( ) total discounted cost of starting at state
adopting policy with periods g to o
niv R i
R n
1
0( ) ( ) ( )
Mn ni ik ij j
jv R C p k v R
1( )i ikv R C
for a fixed policy , lim ( ) ( ), i.e.,ni i
nR v R v R
0( ) ( ) ( ), 0,1,...,
Mi ik ij j
jv R C p k v R i M
4
Evaluating the Expected Value Evaluating the Expected Value of a Fixed Policyof a Fixed Policy
= 0.9 the optimal policy for long-term average
cost: do nothing at states 0 and 1, overhaul at state 2, and replace at state 3 7 1 1
8 16 163 1 14 8 8
0
0
0 1 0 01 0 0 0
7 1 10 1 2 38 16 16
3 1 11 1 2 34 8 8
2 1
3 0
( ) 0.9[ ( ) ( ) ( )]
( ) 1000 0.9[ ( ) ( ) ( )]
( ) 4000 0.9[ ( )]( ) 6000 0.9[ ( )]
v R v R v R v R
v R v R v R v R
v R v Rv R v R
0
1
2
3
( ) 14,949( ) 16,262( ) 18,636( ) 19,454
v Rv Rv Rv R
5
Policy ImprovementPolicy Improvement the improvement over a given policy similar procedure to MDP for long-term
average cost
6
Policy ImprovementPolicy Improvement 1 Value Determination: Fix policy R. Solve
0( ) ( ) ( ), for 0,1,...,
Mi ik ij j
jv R C p k v R i M
2 Policy Improvement: For each state i, find action k as argument minimum of
1,2,..., 0min ( ) ( )
Mik ij j
k K jC p k v R
3 Form a new policy from actions in 2. Stop if this policy is the same as R; else go to 1
7
Policy ImprovementPolicy Improvement can be proven
vi(Rn+1) vi(Rn), for all i, n
the algorithm stops in finite number of iterations
8
Example Example Iteration 1:
Policy Improvement nothing can be done at state 0 and machine must be
replaced at state 3 possible decisions at
state 1: decision 1 (do nothing, $1000)decision 3 (replace, $6000)
state 2: decision 1 (do nothing, $3000)decision 2 (overhaul, $4000)decision 3 (replace, $6000)
7 1 18 16 163 1 14 8 8
1 12 2
0
0
0 0
1 0 0 0
9
ExampleExample Iteration 1:
Policy Improvement
1 10 11 12 13
2 20 21 22 23
State 1: 0.9[ ( )(14949) ( )(16262) ( )(18636) ( )(19454)]State 2: 0.9[ ( )(14949) ( )(16262) ( )(18636) ( )(19454)]
k
k
C p k p k p k p kC p k p k p k p k
0
1
2
3
( ) 14,949( ) 16,262( ) 18,636( ) 19,454
v Rv Rv Rv R
DecisionDecisionState 1State 1
CC11kk pp1010((kk)) pp1111((kk)) pp1212((kk)) pp1313((kk)) EE(value)(value)
11 10001000 00 3/43/4 1/81/8 1/81/8 1626216262
33 60006000 11 00 00 00 1945419454
DecisionDecisionState 2State 2
CC2k2k pp2020((kk)) pp2121((kk)) pp2222((kk)) pp2323((kk)) EE(value)(value)
11 30003000 00 00 1/21/2 1/21/2 201402014022 40004000 00 11 00 00 186361863633 60006000 11 00 00 00 1945419454
7 1 18 16 163 1 14 8 8
1 12 2
0
0
0 0
1 0 0 0
minimumminimum
10
ExampleExample policy: do nothing at states 0 and 1,
overhaul at state 2, and replace at state 3 no change in policy, i.e., optimum
11
Linear Programming ApproachLinear Programming Approach
yik = discounted expected time being in state i and adopting decision k
j = initial probability at state j expected total discounted cost depends on {j },
though the minimum policy does not
12
Linear Programming ApproachLinear Programming Approach
choose j such that
solve 0
1, 0, 0,1,...,M
j jj
j M
0 1
1 0 1
min ,
. .
( ) , 0,1,...,
0, 0,1,..., ; 1, 2,...,
M Kik ik
j k
K M Kjk ik ij j
k i k
ik
Z C y
s t
y y p k j M
y i M k K
1
(decision | state ) ikik K
ikk
yD P k i
y
13
Linear Programming ApproachLinear Programming Approach
take j = 1/411 13 21 22 23 33
101 13 23 33 4
7 3 111 13 01 11 228 4 4
1 1 1 121 22 23 01 11 2116 8 2 4
1 133 01 116 8
min 1000 6000 3000 4000 6000 6000 ,. .
0.9( )
0.9( )
0.9( )
0.9(
Z y y y y y ys t
y y y y
y y y y y
y y y y y y
y y y
1 11 212 4)
all 0ik
y
y
01 11 13
21 22 23 33
1.21, ( , ) (6.656,0),( , , ) (0,1.067,0), 1.067y y yy y y y
01 11 13
21 22 23 33
1, ( , ) (1,0),( , , ) (0,1,0), 1D D DD D D D
7 1 18 16 163 1 14 8 8
1 12 2
0
0
0 0
1 0 0 0
14
Successive Approximation Successive Approximation
the policy is defined by the argument minimum of the recursive equations
stop when the policy converges
1 min{ }, 0, 1, ..., i ikk
v C i M
1
0min{ ( ) }, 0, 1, ...,
Mn ni ik ij j
k jv C p k v i M
15
Successive Approximation Successive Approximation Iteration 1
10 01 0v C
11 11 13 11min{ , } 1000v C C C
12 21 22 23 21min{ , , } 3000v C C C C
13 33 6000v C
16
Successive Approximation Successive Approximation Iteration 2
2 7 1 10 01 8 16 160.9 (1000) (3000) (6000) 1294v C
3 1 12 4 8 81
1000 0.9 (1000) (3000) (6000) ,min 2688
6000 0.9(0)v
1
0min{ ( ) }, 0, 1, ...,
Mn ni ik ij j
k jv C p k v i M
10 0v 11 1000v 12 3000v 13 6000v
State 0 1 2 3
0 0 7/8 1/16 1/16
1 0 3/4 1/8 1/8
2 0 0 1/2 1/2
3 0 0 0 1
1 12 2 22
3000 0.9 (3000) (6000) ,min 4900
4000 0.9 1(1000) ,6000 0.9(0)v
23 6000 0.9(0) 6000v
policy: do nothing at states 0 and 1, overhaul at state 2, and replace at state
3;no change optimal
17
Successive Approximation Successive Approximation Iteration 3
2 7 1 10 01 8 16 160.9 (2688) (4900) (6000) 2730v C
3 1 12 4 8 81
1000 0.9 (2688) (4900) (6000) ,min 4041
6000 0.9(1294)v
0min{ ( ) }, 0, 1, ...,
Mn ni ik ij j
k jv C p k v i M
10 1294v 11 2688v 12 4900v 13 6000v
State 0 1 2 3
0 0 7/8 1/16 1/16
1 0 3/4 1/8 1/8
2 0 0 1/2 1/2
3 0 0 0 1
1 12 2 22
3000 0.9 (4900) (6000) ,min 6419
4000 0.9 1(2688) ,6000 0.9(1294)v
23 6000 0.9(1294) 7165v
policy converged: do nothing at states 0 and 1, overhaul at state 2, and replace at state 3
18