Top Banner
1/9 Improved Bounds for Policy Iteration in Markov Decision Problems Shivaram Kalyanakrishnan Department of Computer Science and Engineering Indian Institute of Technology Bombay [email protected] November 2017 Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1/9
56

Improved Bounds for Policy Iteration in Markov Decision ...

Dec 18, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: <1>Improved Bounds for Policy Iteration in Markov Decision ...

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

Page 2: <1>Improved Bounds for Policy Iteration in Markov Decision ...

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

Page 3: <1>Improved Bounds for Policy Iteration in Markov Decision ...

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

Page 4: <1>Improved Bounds for Policy Iteration in Markov Decision ...

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

Page 5: <1>Improved Bounds for Policy Iteration in Markov Decision ...

2/9

Sequential Decision Making

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

Page 6: <1>Improved Bounds for Policy Iteration in Markov Decision ...

2/9

Sequential Decision Making

ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

Page 7: <1>Improved Bounds for Policy Iteration in Markov Decision ...

2/9

Sequential Decision Making

ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

Page 8: <1>Improved Bounds for Policy Iteration in Markov Decision ...

2/9

Sequential Decision Making

ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg

http://www.nature.com/polopoly_fs/7.33483.1453824868!/image/WEB_Go—1.jpg_gen/derivatives/landscape_630/WEB_Go—1.jpg

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

Page 9: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 10: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 11: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 12: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 13: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy?

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 14: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy? One that maximises expected long-term reward.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 15: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy? One that maximises expected long-term reward.

Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

[

r0 + γr1 + γ2r2 + . . . |start state = s

]

.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 16: <1>Improved Bounds for Policy Iteration in Markov Decision ...

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

γ = 0.9

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Discount factor (γ)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy? One that maximises expected long-term reward.

Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

[

r0 + γr1 + γ2r2 + . . . |start state = s

]

.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

Page 17: <1>Improved Bounds for Policy Iteration in Markov Decision ...

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

Page 18: <1>Improved Bounds for Policy Iteration in Markov Decision ...

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45BBB 10.0 11.0 10.0

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

Page 19: <1>Improved Bounds for Policy Iteration in Markov Decision ...

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

Page 20: <1>Improved Bounds for Policy Iteration in Markov Decision ...

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

Page 21: <1>Improved Bounds for Policy Iteration in Markov Decision ...

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

Page 22: <1>Improved Bounds for Policy Iteration in Markov Decision ...

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?

Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

Page 23: <1>Improved Bounds for Policy Iteration in Markov Decision ...

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?

Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.

One extra definition needed: Action Value Function Qπ

a for a ∈ A.

a = Ra + γTaVπ.

Given π, a polynomial computation yields Vπ and Q

π

a for a ∈ A.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

Page 24: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 25: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 26: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

3Q (s ) ππQ (s ) 3

7Q (s ) ππQ (s ) 7

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 27: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 28: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Improving actions

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 29: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Improving actions

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 30: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 31: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

π′

(s) > Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 32: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

π′

(s) > Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 33: <1>Improved Bounds for Policy Iteration in Markov Decision ...

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

π′

(s) > Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

Page 34: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 35: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 36: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 37: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 38: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 39: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 40: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 41: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Different switching strategies lead to different routes to the top.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 42: <1>Improved Bounds for Policy Iteration in Markov Decision ...

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Different switching strategies lead to different routes to the top.

How long are the routes?!

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

Page 43: <1>Improved Bounds for Policy Iteration in Markov Decision ...

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

Page 44: <1>Improved Bounds for Policy Iteration in Markov Decision ...

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

Page 45: <1>Improved Bounds for Policy Iteration in Markov Decision ...

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].

Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

Page 46: <1>Improved Bounds for Policy Iteration in Markov Decision ...

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Batch-switching PIDeterministic 1.6479n

k0.7207n

[KMG16a, GK17]

Recursive Simple PIRandomised – (2 + ln(k − 1))n

[KMG16b]

Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].

Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

Page 47: <1>Improved Bounds for Policy Iteration in Markov Decision ...

8/9

Recursive Simple Policy Iteration

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

Page 48: <1>Improved Bounds for Policy Iteration in Markov Decision ...

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

Page 49: <1>Improved Bounds for Policy Iteration in Markov Decision ...

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

Page 50: <1>Improved Bounds for Policy Iteration in Markov Decision ...

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

Page 51: <1>Improved Bounds for Policy Iteration in Markov Decision ...

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

Page 52: <1>Improved Bounds for Policy Iteration in Markov Decision ...

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

Page 53: <1>Improved Bounds for Policy Iteration in Markov Decision ...

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9

Page 54: <1>Improved Bounds for Policy Iteration in Markov Decision ...

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9

Page 55: <1>Improved Bounds for Policy Iteration in Markov Decision ...

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?

For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9

Page 56: <1>Improved Bounds for Policy Iteration in Markov Decision ...

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?

For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.

Thank you!

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9