1/9 Improved Bounds for Policy Iteration in Markov Decision Problems Shivaram Kalyanakrishnan Department of Computer Science and Engineering Indian Institute of Technology Bombay [email protected]November 2017 Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1/9
56
Embed
Improved Bounds for Policy Iteration in Markov Decision ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1/9
Improved Bounds for Policy Iteration in Markov Decision Problems
Shivaram Kalyanakrishnan
Department of Computer Science and EngineeringIndian Institute of Technology Bombay
Every MDP is guaranteed to have an optimal policy π⋆, such that
∀π ∈ Π,∀s ∈ S : Vπ⋆
(s) ≥ Vπ(s).
What is the complexity of computing an optimal policy?
Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.
One extra definition needed: Action Value Function Qπ
a for a ∈ A.
Qπ
a = Ra + γTaVπ.
Given π, a polynomial computation yields Vπ and Q
π
a for a ∈ A.
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9
5/9
Policy Improvement
s s s s s s ss1 2 3 4 5 6 7 8
π
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
s s s s s s ss1 2 3 4 5 6 7 8
π
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
s s s s s s ss1 2 3 4 5 6 7 8
π
3Q (s ) ππQ (s ) 3
7Q (s ) ππQ (s ) 7
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
s s s s s s ss1 2 3 4 5 6 7 8
π
Improvable states
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
s s s s s s ss1 2 3 4 5 6 7 8
π
Improvable states
Improving actions
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
Given π,
Pick one or more improvable states, and in them,
Switch to an arbitrary improving action.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
Improvable states
Improving actions
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
Given π,
Pick one or more improvable states, and in them,
Switch to an arbitrary improving action.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
s s s s s s ss1 2 3 4 5 6 7 8
π
Policy Improvement
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
Given π,
Pick one or more improvable states, and in them,
Switch to an arbitrary improving action.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
s s s s s s ss1 2 3 4 5 6 7 8
π
Policy Improvement
Policy Improvement Theorem (H60, B12):
(1) If π has no improvable states, then it is optimal, else
(2) if π′ is obtained as above, then
∀s ∈ S : Vπ′
(s) ≥ Vπ(s) and ∃s ∈ S : V
π′
(s) > Vπ(s).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
Given π,
Pick one or more improvable states, and in them,
Switch to an arbitrary improving action.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
s s s s s s ss1 2 3 4 5 6 7 8
π
Policy Improvement
Policy Improvement Theorem (H60, B12):
(1) If π has no improvable states, then it is optimal, else
(2) if π′ is obtained as above, then
∀s ∈ S : Vπ′
(s) ≥ Vπ(s) and ∃s ∈ S : V
π′
(s) > Vπ(s).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
5/9
Policy Improvement
Given π,
Pick one or more improvable states, and in them,
Switch to an arbitrary improving action.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
s s s s s s ss1 2 3 4 5 6 7 8
π
Policy Improvement
Policy Improvement Theorem (H60, B12):
(1) If π has no improvable states, then it is optimal, else
(2) if π′ is obtained as above, then
∀s ∈ S : Vπ′
(s) ≥ Vπ(s) and ∃s ∈ S : V
π′
(s) > Vπ(s).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Different switching strategies lead to different routes to the top.
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
6/9
Policy Iteration (PI)
π ← Arbitrary policy.
While π has improvable states:
π ← PolicyImprovement(π).
Different switching strategies lead to different routes to the top.
How long are the routes?!
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9
7/9
Switching Strategies and Bounds
Upper bounds on number of iterations
PI Variant Type k = 2 General k
Howard’s PIDeterministic O
(
2n
n
)
O
(
kn
n
)
[H60, MS99]
Mansour and Singh’sRandomised 1.7172n ≈ O
(
k
2
)n
Randomised PI [MS99]
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9
7/9
Switching Strategies and Bounds
Upper bounds on number of iterations
PI Variant Type k = 2 General k
Howard’s PIDeterministic O
(
2n
n
)
O
(
kn
n
)
[H60, MS99]
Mansour and Singh’sRandomised 1.7172n ≈ O
(
k
2
)n
Randomised PI [MS99]
Lower bounds on number of iterations
Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9
7/9
Switching Strategies and Bounds
Upper bounds on number of iterations
PI Variant Type k = 2 General k
Howard’s PIDeterministic O
(
2n
n
)
O
(
kn
n
)
[H60, MS99]
Mansour and Singh’sRandomised 1.7172n ≈ O
(
k
2
)n
Randomised PI [MS99]
Lower bounds on number of iterations
Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].
Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9
7/9
Switching Strategies and Bounds
Upper bounds on number of iterations
PI Variant Type k = 2 General k
Howard’s PIDeterministic O
(
2n
n
)
O
(
kn
n
)
[H60, MS99]
Mansour and Singh’sRandomised 1.7172n ≈ O
(
k
2
)n
Randomised PI [MS99]
Batch-switching PIDeterministic 1.6479n
k0.7207n
[KMG16a, GK17]
Recursive Simple PIRandomised – (2 + ln(k − 1))n
[KMG16b]
Lower bounds on number of iterations
Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].
Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9
8/9
Recursive Simple Policy Iteration
s s s s s s ss1 2 3 4 5 6 7 8
π
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9
8/9
Recursive Simple Policy Iteration
Given π,
Pick the improvable state with the highest index, and,
Switch to an improving action picked uniformly at random.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9
8/9
Recursive Simple Policy Iteration
Given π,
Pick the improvable state with the highest index, and,
Switch to an improving action picked uniformly at random.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9
8/9
Recursive Simple Policy Iteration
Given π,
Pick the improvable state with the highest index, and,
Switch to an improving action picked uniformly at random.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
s s s s s s ss1 2 3 4 5 6 7 8
π
Policy Improvement
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9
8/9
Recursive Simple Policy Iteration
Given π,
Pick the improvable state with the highest index, and,
Switch to an improving action picked uniformly at random.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
s s s s s s ss1 2 3 4 5 6 7 8
π
Policy Improvement
Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9
8/9
Recursive Simple Policy Iteration
Given π,
Pick the improvable state with the highest index, and,
Switch to an improving action picked uniformly at random.
Let the resulting policy be π′.
s s s s s s ss1 2 3 4 5 6 7 8
π
Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9
9/9
Conclusion
Policy Iteration: widely used algorithm, more than half a century old.
Substantial gap exists between upper and lower bounds.
We furnish several exponential improvements to upper bounds.
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9
9/9
Conclusion
Policy Iteration: widely used algorithm, more than half a century old.
Substantial gap exists between upper and lower bounds.
We furnish several exponential improvements to upper bounds.
Bears similarity to Simplex algorithm for Linear Programming.
Howard’s PI works much better in practice than the variants for which we have
shown improved upper bounds!
Open problem: Is the number of iterations taken by Howard’s PI on n-state,
2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9
9/9
Conclusion
Policy Iteration: widely used algorithm, more than half a century old.
Substantial gap exists between upper and lower bounds.
We furnish several exponential improvements to upper bounds.
Bears similarity to Simplex algorithm for Linear Programming.
Howard’s PI works much better in practice than the variants for which we have
shown improved upper bounds!
Open problem: Is the number of iterations taken by Howard’s PI on n-state,
2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?
For references see tutorial.
Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9
9/9
Conclusion
Policy Iteration: widely used algorithm, more than half a century old.
Substantial gap exists between upper and lower bounds.
We furnish several exponential improvements to upper bounds.
Bears similarity to Simplex algorithm for Linear Programming.
Howard’s PI works much better in practice than the variants for which we have
shown improved upper bounds!
Open problem: Is the number of iterations taken by Howard’s PI on n-state,
2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?
For references see tutorial.
Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.
Thank you!
Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9