1 Logical Representations and Computational Methods for Markov Decision Processes Craig Boutilier Department of Computer Science University of Toronto NASSLI Lecture Slides (c) 2002, C. Boutilier Planning in Artificial Intelligence Planning has a long history in AI • strong interaction with logic-based knowledge representation and reasoning schemes Basic planning problem: • Given: start state, goal conditions, actions • Find: sequence of actions leading from start to goal • Typically: states correspond to possible worlds; actions and goals specified using a logical formalism (e.g., STRIPS, situation calculus, temporal logic, etc.) Specialized algorithms, planning as theorem proving, etc. often exploit logical structure of problem is various ways to solve effectively
22
Embed
Planning in Artificial Intelligence - Stanford University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Logical Representations and Computational Methods for Markov Decision Processes
Craig Boutilier
Department of Computer Science
University of Toronto
�NASSLI Lecture Slides (c) 2002, C. Boutilier
Planning in Artificial Intelligence
Planning has a long history in AI• strong interaction with logic-based knowledge
representation and reasoning schemes
Basic planning problem:• Given: start state, goal conditions, actions• Find: sequence of actions leading from start to goal• Typically: states correspond to possible worlds;
actions and goals specified using a logical formalism (e.g., STRIPS, situation calculus, temporal logic, etc.)
Specialized algorithms, planning as theorem proving, etc. often exploit logical structure of problem is various ways to solve effectively
2
�NASSLI Lecture Slides (c) 2002, C. Boutilier
A Planning Problem
�NASSLI Lecture Slides (c) 2002, C. Boutilier
Difficulties for the Classical Model
Uncertainty• in action effects• in knowledge of system state• a “sequence of actions that guarantees goal
achievement” often does not exist
Multiple, competing objectives
Ongoing processes• lack of well-defined termination criteria
3
�NASSLI Lecture Slides (c) 2002, C. Boutilier
Some Specific Difficulties
Maintenance goals: “keep lab tidy”• goal is never achieved once and for all• can’t be treated as a safety constraint
Preempted/Multiple goals: “coffee vs. mail”• must address tradeoffs: priorities, risk, etc.
Anticipation of Exogenous Events• e.g., wait in the mailroom at 10:00 AM• on-going processes driven by exogenous events
Similar concerns: logistics, process planning, medical decision making, etc.
�NASSLI Lecture Slides (c) 2002, C. Boutilier
Markov Decision Processes
Classical planning models:• logical rep’n s of deterministic transition systems• goal-based objectives• plans as sequences
Markov decision processes generalize this view• controllable, stochastic transition system• general objective functions (rewards) that allow
tradeoffs with transition probabilities to be made• more general solution concepts (policies)
4
�NASSLI Lecture Slides (c) 2002, C. Boutilier
Logical Representations of MDPs
MDPs provide a nice conceptual model
Classical representations and solution methods tend to rely on state-space enumeration
• combinatorial explosion if state given by set of possible worlds/logical interpretations/variable assts
• Bellman’s curse of dimensionality
Recent work has looked at extending AI-style representational and computational methods to MDPs
• we’ll look at some of these (with a special emphasis on “logical” methods)
�NASSLI Lecture Slides (c) 2002, C. Boutilier
Course Overview
Lecture 1• motivation• introduction to MDPs: classical model and algorithms
decision trees and BDDs; situation calculus• some simple ways to exploit logical structure:
abstraction and decomposition
5
�NASSLI Lecture Slides (c) 2002, C. Boutilier
Course Overview (con’t)
Lecture 3• decision-theoretic regression• propositional view as variable elimination• exploiting decision tree/BDD structure• approximation• first-order DTR with situation calculus
Lecture 4• linear function approximation• exploiting logical structure of basis functions• discovering basis functions
���NASSLI Lecture Slides (c) 2002, C. Boutilier
Course Overview (con’t)
Lecture 5• temporal logic for specifying non-Markovian dynamics• model minimization• wrap up; further topics
6
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Markov Decision ProcessesAn MDP has four components, S, A, R, Pr:
• (finite) state set S (|S| = n)• (finite) action set A (|A| = m)• transition function Pr(s,a,t)
� each Pr(s,a,-) is a distribution over S� represented by set of n x n stochastic matrices
• bounded, real-valued reward function R(s)� represented by an n-vector� can be generalized to include action costs: R(s,a)� can be stochastic (but replacable by expectation)
Model easily generalizable to countable or continuous state and action spaces
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
System Dynamics
Finite State Space S��������� ���������������������� � ��!�����"#%$�&(')!*� �,+ �- &(��' ./!��0��"1 ���,2 2 ���34353
• many or all policies have infinite expected reward• some MDPs (e.g., zero-cost absorbing states) OK
“Trick”: introduce discount factor 0 � �����
• future rewards discounted by � per time step
Note:
Motivation: economic? failure prob? convenience?
],|[)(0
sREsVt
ttk πβπ �∞
==
max
0
max
1
1][)( RREsV
t
t
ββπ −
=≤ �∞
=
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Some Notes
Optimal policy maximizes value at each state
Optimal policies guaranteed to exist (Howard60)
Can restrict attention to stationary policies
• why change action at state s at new time t?
We define for some optimal �)()(* sVsV π=
17
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Value Equations (Howard 1960)
Value equation for fixed policy value
Bellman equation for optimal value function
)'(' )'),(,Pr()()( ss VssssRsV � ⋅+= ππ π
)'(' *)',,Pr(max)()(* ss VsassRsVa� ⋅+=
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Backup Operators
We can think of the fixed policy equation and the Bellman equation as operators in a vector space
• e.g., La(V) = V’ = R + � PaV
• V� is unique fixed point of policy backup operator L �• V* is unique fixed point of Bellman backup L*
We can compute V � easily: policy evaluation• simple linear system with n variables, n constraints• solve V = R + � PV
Cannot do this for optimal policy• max operator makes things nonlinear
18
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Value IterationCan compute optimal policy using value iteration, just like FH problems (just include discount term)
• no need to store argmax at each stage (stationary)
)'(' )',,Pr(max)()( 1 ss VsassRsV kk
a�
−⋅+= β
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Convergence�������
is a contraction mapping in Rn
• || LV – LV’ || �� || V – V’ ||
When to stop value iteration? when ||Vk - Vk-1|| ��• ||Vk+1 - Vk|| �� ||Vk - Vk-1||
• this ensures ||Vk – V*|| �� �� /1- �Convergence is assured
• any guess V: || V* - L*V || = ||L*V* - L*V || �� || V* - V ||
• so fixed point theorems ensure convergence
19
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
How to Act
Given V* (or approximation), use greedy policy:
• if V within of V*, then V( � ) within 2 of V*
There exists an � s.t. optimal policy is returned• even if value estimate is off, greedy policy is optimal• proving you are optimal can be difficult (methods like
action elimination can be used)
)'(' *)',,Pr(maxarg)(* ss Vsassa
� ⋅=π
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Policy Iteration
Given fixed policy, can compute its value exactly:
Convergence assured (Howard)• intuitively: no local maxima in value space, and each
policy must improve value; since finite number of policies, will converge to optimal policy
Very flexible algorithm• need only improve policy at one state (not each state)
Gives exact value of optimal policy
Generally converges much faster than VI• each iteration more complex, but fewer iterations• quadratic rather than linear rate of convergence
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Modified Policy Iteration
MPI a flexible alternative to VI and PI
Run PI, but don’t solve linear system to evaluate policy; instead do several iterations of successive approximation to evaluate policy
You can run SA until near convergence• but in practice, you often only need a few backups to
get estimate of V( � ) to allow improvement in �
• quite efficient in practice• choosing number of SA steps a practical issue
21
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Asynchronous Value IterationNeedn’t do full backups of VF when running VIGauss-Siedel: Start with Vk .Once you compute Vk+1(s), you replace Vk(s) before proceeding to the next state (assume some ordering of states)
• tends to converge much more quickly• note: Vk no longer k-stage-to-go VF
AVI: set some V0; Choose random state s and do a Bellman backup at that state alone to produce V1; Choose random state s…
• if each state backed up frequently enough, convergence assured
• useful for online algorithms (reinforcement learning)
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
Some Remarks on Search Trees
Analogy of Value Iteration to decision trees• decision tree (expectimax search) is really value
iteration with computation focussed on reachable states
Real-time Dynamic Programming (RTDP)• simply real-time search applied to MDPs• can exploit heuristic estimates of value function• can bound search depth using discount factor• can cache/learn values• can use pruning techniques
22
� �NASSLI Lecture Slides (c) 2002, C. Boutilier
References
�M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley, 1994.�D. P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models, Prentice-Hall, 1987.�R. Bellman, Dynamic Programming, Princeton, 1957.�R. Howard, Dynamic Programming and Markov Processes, MIT Press, 1960.�C. Boutilier, T. Dean, S. Hanks, Decision Theoretic Planning: Structural Assumptions and Computational Leverage, Journal of Artif. Intelligence Research 11:1-94, 1999.�A. Barto, S. Bradke, S. Singh, Learning to Act using Real-Time Dynamic Programming, Artif. Intelligence 72(1-2):81-138, 1995.