Top Banner
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * ed in part on slides by Craig Boutilier and Daniel Weld
39

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Dec 18, 2015

Download

Documents

Isabella Fields
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1

Markov Decision ProcessesInfinite Horizon Problems

Alan Fern *

* Based in part on slides by Craig Boutilier and Daniel Weld

Page 2: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

2

What is a solution to an MDP?

MDP Planning Problem:

Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value”

h This depends on how we define the value of a policy

h There are several choices and the solution algorithms depend on the choice

h We will consider two common choices5 Finite-Horizon Value5 Infinite Horizon Discounted Value

Page 3: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

3

Discounted Infinite Horizon MDPsh Defining value as total reward is problematic with

infinite horizons (r1 + r2 + r3 + r4 + …..)5 many or all policies have infinite expected reward5 some MDPs are ok (e.g., zero-cost absorbing states)

h “Trick”: introduce discount factor 0 ≤ β < 15 future rewards discounted by β per time step

h Note:

h Motivation: economic? prob of death? convenience?

],|[)(0

sREsVt

tt

max

0

max

1

1][)( RREsV

t

t

Bounded Value

Page 4: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Page 5: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

5

Notes: Discounted Infinite Horizon

h Optimal policies guaranteed to exist (Howard, 1960)5 I.e. there is a policy that maximizes value at each state

h Furthermore there is always an optimal stationary

policy5 Intuition: why would we change action at s at a new time

when there is always forever ahead

h We define to be the optimal value function.5 That is, for some optimal stationary π

)(* sV

)()(* sVsV

Page 6: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

6

Computational Problems

h Policy Evaluation5 Given and an MDP compute

h Policy Optimization5 Given an MDP, compute an optimal policy and .

5 We’ll cover two algorithms for doing this: value iteration

and policy iteration

Page 7: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

7

Policy Evaluation

h Value equation for fixed policy

h Equation can be derived from original definition of infinite horizon discounted value

)'(' )'),(,(β)()( ss VsssTsRsV

immediate rewarddiscounted expected valueof following policy in the future

Page 8: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

8

Policy Evaluation

h Value equation for fixed policy

h How can we compute the value function for a fixed policy?5 we are given R, T, and and want to find for each s5 linear system with n variables and n constraints

g Variables are values of states: V(s1),…,V(sn)g Constraints: one value equation (above) per state

5 Use linear algebra to solve for V (e.g. matrix inverse)

)'(' )'),(,(β)()( ss VsssTsRsV

Page 9: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Page 10: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

10

Policy Evaluation via Matrix Inverse

Vπ and R are n-dimensional column vector (one element for each state)

T is an nxn matrix s.t.

RIV

RVI

VRV

1-βT)(

βT)(

βT

)),(,T(s j)T(i, i ji ss

Page 11: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

11

Computing an Optimal Value Function

h Bellman equation for optimal value function

h Bellman proved this is always true for an optimal value function

)'(' *)',,(maxβ)()(* ss VsasTsRsVa

immediate rewarddiscounted expected valueof best action assuming wewe get optimal value in future

Page 12: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

12

Computing an Optimal Value Functionh Bellman equation for optimal value function

h How can we solve this equation for V*?5 The MAX operator makes the system non-linear, so the problem is

more difficult than policy evaluation

h Idea: lets pretend that we have a finite, but very, very long, horizon and apply finite-horizon value iteration 5 Adjust Bellman Backup to take discounting into account.

)'(' *)',,(maxβ)()(* ss VsasTsRsVa

Page 13: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Bellman Backups (Revisited)

a1

a2

s4

s1

s3

s2

Vk

0.7

0.3

0.4

0.6

ComputeExpectations

Vk+1(s) s

ComputeMax

)'(' )',,(max)()(1 ss VsasTsRsV kk

a

Page 14: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

14

Value Iterationh Can compute optimal policy using value iteration

based on Bellman backups, just like finite-horizon problems (but include discount term)

h Will it converge to optimal value function as k gets large? 5 Yes.

h Why?

)'(' )',,(max)()(

0)(1

0

ss VsasTsRsV

sVkk

a

*lim VV kk

;; Could also initialize to R(s)

Page 15: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

15

Convergence of Value Iterationh Bellman Backup Operator: define B to be an

operator that takes a value function V as input and returns a new value function after a Bellman backup

h Value iteration is just the iterative application of B:

)'(' )',,(maxβ)()]([ ss VsasTsRsVBa

][

01

0

kk VBV

V

Page 16: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

16

Convergence: Fixed Point Property

h Bellman equation for optimal value function

h Fixed Point Property: The optimal value function is a fixed-point of the Bellman Backup operator B.5 That is B[V*]=V*

)'(' *)',,(maxβ)()(* ss VsasTsRsVa

)'(' )',,(maxβ)()]([ ss VsasTsRsVBa

Page 17: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

17

Convergence: Contraction Propertyh Let ||V|| denote the max-norm of V, which returns

the maximum element of the vector. 5 E.g. ||(0.1 100 5 6)|| = 100

h B[V] is a contraction operator wrt max-norm

h For any V and V’, || B[V] – B[V’] || ≤ β || V – V’ ||5 You will prove this.

h That is, applying B to any two value functions causes them to get closer together in the max-norm sense!

Page 18: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Page 19: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

19

Convergenceh Using the properties of B we can prove convergence of

value iteration.

h Proof:1. For any V: || V* - B[V] || = || B[V*] – B[V] || ≤ β|| V* - V||

2. So applying Bellman backup to any value function V brings us closer to V* by a constant factor β

||V* - Vk+1 || = ||V* - B[Vk ]|| ≤ β || V* - Vk ||

3. This means that ||Vk – V*|| ≤ βk || V* - V0 ||

4. Thus 0lim * k

k VV

Page 20: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

20

Value Iteration: Stopping Conditionh Want to stop when we can guarantee the value

function is near optimal.

h Key property: (not hard to prove)

If ||Vk - Vk-1||≤ ε then ||Vk – V*|| ≤ εβ /(1-β)

h Continue iteration until ||Vk - Vk-1||≤ ε 5 Select small enough ε for desired error

guarantee

Page 21: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

21

How to Act

h Given a Vk from value iteration that closely approximates V*, what should we use as our policy?

h Use greedy policy: (one step lookahead)

h Note that the value of greedy policy may not be equal to Vk

5 Why?

)'(' )',,(maxarg)]([ ss VsasTsVgreedy kk

a

Page 22: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

22

How to Act

h Use greedy policy: (one step lookahead)

h We care about the value of the greedy policy which we denote by Vg

5 This is how good the greedy policy will be in practice.

h How close is Vg to V*?

)'(' )',,(maxarg)]([ ss VsasTsVgreedy kk

a

Page 23: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

23

Value of Greedy Policy

h Define Vg to be the value of this greedy policy5 This is likely not the same as Vk

h Property: If ||Vk – V*|| ≤ λ then ||Vg - V*|| ≤ 2λβ /(1-β) 5 Thus, Vg is not too far from optimal if Vk is close to optimal

h Our previous stopping condition allows us to bound λ based on ||Vk+1 – Vk||

h Set stopping condition so that ||Vg - V*|| ≤ Δ5 How?

)'(' )',,(maxarg)]([ ss VsasTa

sVgreedy kk

Page 24: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Property: If ||Vk – V*|| ≤ λ then ||Vg - V*|| ≤ 2λβ /(1-β)

Property: If ||Vk - Vk-1||≤ ε then ||Vk – V*|| ≤ εβ /(1-β)

Goal: ||Vg - V*|| ≤ Δ

Answer: If ||Vk - Vk-1||≤ then ||Vg - V*|| ≤ Δ

Page 25: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Policy Evaluation Revisited

h Sometimes policy evaluation is expensive due to

matrix operations

h Can we have an iterative algorithm like value

iteration for policy evaluation?

h Idea: Given a policy π and MDP M, create a new

MDP M[π] that is identical to M, except that in

each state s we only allow a single action π(s)5 What is the optimal value function V* for M[π] ?

h Since the only valid policy for M[π] is π, V* = Vπ.

Page 26: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Policy Evaluation Revisited

h Running VI on M[π] will converge to V* = Vπ.

5 What does the Bellman backup look like here?

h The Bellman backup now only considers one

action in each state, so there is no max5 We are effectively applying a backup restricted by π

)'(' )'),(,(β)()]([ ss VsssTsRsVB

Restricted Bellman Backup:

Page 27: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

27

Iterative Policy Evaluation

h Running VI on M[π] is equivalent to iteratively applying the restricted Bellman backup.

h Often become close to Vπ for small k

][

01

0

kk VBV

V

VV kk lim

Iterative Policy Evaluation:

Convergence:

Page 28: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

28

Optimization via Policy Iteration

h Policy iteration uses policy evaluation as a sub

routine for optimization

h It iterates steps of policy evaluation and policy

improvement

1. Choose a random policy π2. Loop:

(a) Evaluate Vπ

(b) π’ = ImprovePolicy(Vπ)

(c) Replace π with π’Until no improving action possible at any state

Given Vπ returns a strictlybetter policy if π isn’t optimal

Page 29: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

29

Policy Improvement

h Given Vπ how can we compute a policy π’ that is

strictly better than a sub-optimal π?

h Idea: given a state s, take the action that looks the

best assuming that we following policy π thereafter

5 That is, assume the next state s’ has value Vπ (s’)

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-

optimal π.

For each s in S, set )'(' )',,(maxarg)(' ss VsasTsa

Page 30: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

30

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

)'(' )',,(maxarg)(' ss VsasTsa

For any two value functions and , we write to indicate

that for all states s,

Useful Properties for Proof:

1) For any and , if then

Page 31: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

31

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

)'(' )',,(maxarg)(' ss VsasTsa

Proof:

Page 32: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

32

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

)'(' )',,(maxarg)(' ss VsasTsa

Proof:

Page 33: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

33

Optimization via Policy Iteration

1. Choose a random policy π2. Loop:

(a) Evaluate Vπ

(b) For each s in S, set (c) Replace π with π’Until no improving action possible at any state

)'(' )',,(maxarg)(' ss VsasTsa

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

Policy iteration goes through a sequence of improving policies

Page 34: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

34

Policy Iteration: Convergence

h Convergence assured in a finite number of iterations5 Since finite number of policies and each step

improves value, then must converge to optimal

h Gives exact value of optimal policy

Page 35: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

35

Policy Iteration Complexityh Each iteration runs in polynomial time in the

number of states and actions

h There are at most |A|n policies and PI never repeats a policy5 So at most an exponential number of iterations5 Not a very good complexity bound

h Empirically O(n) iterations are required often it seems like O(1)5 Challenge: try to generate an MDP that requires

more than that n iterations

h Still no polynomial bound on the number of PI iterations (open problem)!5 But may have been solved recently ????…..

Page 36: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

36

Value Iteration vs. Policy Iterationh Which is faster? VI or PI

5 It depends on the problem

h VI takes more iterations than PI, but PI requires more time on each iteration5 PI must perform policy evaluation on each iteration

which involves solving a linear system

h VI is easier to implement since it does not require the policy evaluation step5 But see next slide

h We will see that both algorithms will serve as inspiration for more advanced algorithms

Page 37: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

37

Modified Policy Iterationh Modified Policy Iteration: replaces exact

policy evaluation step with inexact iterative evaluation5 Uses a small number of restricted Bellman

backups for evaluation

h Avoids the expensive policy evaluation step

h Perhaps easier to implement.

h Often is faster than PI and VI

h Still guaranteed to converge under mild assumptions on starting points

Page 38: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Modified Policy Iteration

1. Choose initial value function V2. Loop: (a) For each s in S, set (b) Partial Policy Evaluation Repeat K times:

Until change in V is minimal

)'(' )',,(maxarg)( ss VsasTsa

Policy Iteration

][VBV Approx.evaluation

Page 39: 1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

39

Recap: things you should knowh What is an MDP?

h What is a policy? 5 Stationary and non-stationary

h What is a value function? 5 Finite-horizon and infinite horizon

h How to evaluate policies? 5 Finite-horizon and infinite horizon5 Time/space complexity?

h How to optimize policies? 5 Finite-horizon and infinite horizon5 Time/space complexity?5 Why they are correct?