Decentralized Decision Making in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher Amato, Daniel Bernstein, Alan Carlin, Claudia Goldman, Eric Hansen, Akshat Kumar, Marek Petrik, Sven Seuken, Feng Wu, and Xiaojian Wu IJCAI’11 Workshop on Decision Making in Partially Observable, Uncertain Worlds Barcelona, Spain July 18, 2011
83
Embed
Decentralized Decision Makingebrun/ijcai2011/ws_papers/zilber... · 3! Problem Characteristics! A group of decision makers or agents interact in a stochastic environment! Each “episode”
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Decentralized Decision Making !in Partially Observable, Uncertain Worlds
Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst
Joint work with Martin Allen, Christopher Amato, Daniel Bernstein, Alan Carlin, Claudia Goldman, Eric Hansen, Akshat Kumar, Marek Petrik, Sven Seuken, Feng Wu, and Xiaojian Wu
IJCAI’11 Workshop on Decision Making in Partially Observable, Uncertain Worlds Barcelona, Spain July 18, 2011
2!
Decentralized Decision Making!
Challenge: How to achieve intelligent coordination of a group of decision makers in spite of stochasticity and partial observability?!
Key objective: Develop effective decision-theoretic methods to address the uncertainty about the domain, the outcome of actions, and the knowledge, beliefs and intentions of the other agents.!
3!
Problem Characteristics! A group of decision makers or agents interact in
a stochastic environment! Each “episode” involves a sequence of decisions
over finite or infinite horizon! The change in the environment is determined
stochastically by the current state and the set of actions taken by the agents!
Each decision maker obtains different partial observations of the overall situation!
Decision makers have the same objectives!
4!
Applications! Autonomous rovers for space
exploration! Protocol design for multi-
access broadcast channels! Coordination of mobile robots!
Decentralized detection and tracking!
Decentralized detection of hazardous weather events !
5!
Outline! Models for decentralized decision making
Complexity results
Solving finite-horizon DEC-POMDPs
Solving infinite-horizon DEC-POMDPs
Scalability beyond two agents
Conclusion
8!
Decentralized POMDP !
Generalization of POMDP involving multiple cooperating decision makers with different observation functions!
a1
o1
o2
a2
1
2
World Reward r
9!
DEC-POMDPs! A DEC-POMDP is defined by a tuple 〈S, A1, A2, P, R1, R2, Ω1, Ω2, O〉, where! S is a finite set of domain states, with initial state s0! A1, A2 are finite action sets! P(s, a1, a2, s' ) is a state transition function! R(s, a1, a2) is a reward function! Ω1, Ω2 are finite observation sets
O(a1, a2, s', o1, o2) is an observation function!
Straightforward generalization to n agents!
10!
Formal Models !
11!
Example: Mobile Robot Planning!
States: grid cell pairs
Actions: ↑,↓,←,→
Transitions: noisy
Goal: meet quickly
Observations: red lines
12!
Example: Cooperative Box-Pushing
Goal: push as many boxes as possible to goal area; larger box has higher reward, but requires two agents to be moved.
13!
Solving DEC-POMDPs! Each agentʼs behavior is described by a
local policy δi!
Policy can be represented as a mapping from! Local observation sequences to actions; or! Local memory states to actions
Actions can be selected deterministically or stochastically!
Goal is to maximize expected reward over a finite horizon or discounted infinite horizon!
14!
Work on Decentralized Decision Making and DEC-POMDPs!
Team theory [Marschak 55, Tsitsiklis & Papadimitriou 82]! Incorporating dynamics [Witsenhausen 71]! Communication strategies [Varaiya & Walrand 78, Xuan et al. 01,
Pynadath & Tambe 02]! Approximation algorithms [Peshkin et al. 00, Guestrin et al. 01,
Nair et al. 03, Emery-Montemerlo et al. 04]! First Exact DP algorithm [Hansen et al. 04]! First policy iteration algorithm [Bernstein et al. 05]! Many recent exact and approximate DEC-POMDP algorithms!
15!
Some Fundamental Questions! Are DEC-POMDPs significantly harder to solve than
POMDPs? Why?! What features of the problem domain affect the
complexity and how?! Is optimal dynamic programming possible?! Can dynamic programming be made practical?! Is it beneficial to treat communication as a separate type
of action?! How can we exploit the locality of agent interaction to
develop more scalable algorithms?!
16!
Outline! Models for decentralized decision making
Complexity results
Solving finite-horizon DEC-POMDPs
Solving infinite-horizon DEC-POMDPs
Scalability beyond two agents
Conclusion
17!
Previous Complexity Results!
MDP! P-complete !( if T < |S| )!
Papadimitriou & Tsitsiklis 87!
POMDP! PSPACE- complete !( if T < |S| )!
Papadimitriou & Tsitsiklis 87!
MDP! P-complete! Papadimitriou & Tsitsiklis 87!
POMDP! Undecidable! Madani et al. 99!
Finite Horizon!
Infinite-Horizon Discounted!
18!
How Hard are DEC-POMDPs? !Bernstein, Givan, Immerman & Zilberstein, UAI 2000, MOR 2002
The complexity of finite-horizon DEC-POMDPs has been hard to establish.!
A static version of the problem, where a single set of decisions is made in response to a single set of observations, was shown to be NP-hard [Tsitsiklis and Athan, 1985]!
We proved that two-agent finite-horizon DEC-POMDPs are NEXP-hard!
But these are worst-case results! ! Are real-world problems easier?!
19!
What Features of the Domain Affect the Complexity and How? Factored state spaces (structured domains)! Independent transitions (IT)!
Independent observations (IO)! Structured reward function (SR)!
Goal-oriented objectives (GO)! Degree of observability (partial, full, jointly full)!
Degree and structure of interaction!
Degree of information sharing and communication!
20!
NP-C
P-C P-C
NP-C
NP-C NEXP-C
NEXP-C
Complexity of Sub-Classes Goldman & Zilberstein, JAIR 2004
Combining the two approaches:! The DP algorithm is a bottom-up approach! The search operates top-down!
The DP step can only eliminate a policy tree if it is dominated for every belief state!
But, only a small subset of the belief space is actually reachable!
Furthermore, the combined approach allows the algorithm to focus on a small subset of joint policies that appear best !
42!
Memory-Bounded DP Cont.
43!
The MBDP Algorithm!
44!
Generating “Good” Belief States! MDP Heuristic -- Obtained by solving the
corresponding fully-observable multiagent MDP!
Infinite-Horizon Heuristic -- Obtained by solving the corresponding infinite-horizon DEC-POMDP!
Random Policy Heuristic -- Could augment another heuristic by adding random exploration!
Heuristic Portfolio -- Maintain a set of belief states generated by a set of different heuristics!
Recursive MBDP!
46!
Performance of MBDP!
MBDP Successors! Improved MBDP (IMBDP)!
![Seuken and Zilberstein, UAI 2007]!
MBDP with Observation Compression (MBDP-OC)!![Carlin and Zilberstein, AAMAS 2008]!
Point Based Incremental Pruning (PBIP)!![Dibangoye, Mouaddib, and Chaib-draa, AAMAS 2009]!
PBIP with Incremental Policy Generation (PBIP-IPG)!![Amato, Dibagoye, Zilberstein, AAAI 2009]!
Constraint-Based Dynamic Programming (CBDP)!![Kumar and Zilberstein, AAMAS 2009]!
Point-Based Backup for Decentralized POMDPs!![Kumar and Zilberstein, AAMAS 2010]!
Point-Based Policy Generation (PBPG)!![Wu, Zilberstein, and Chen, AAMAS 2010]! 49!
Key Ideas Behind These Algorithms! Perform search in a reduced
policy space! Exact algorithm perform only
lossless pruning ! Approximate algorithms rely on
more aggressive pruning! MBDP represents an
exponential size policy with linear space O(maxTrees × T)!
Resulting policy is an acyclic finite-state controller.!
55!
56!
Outline! Models for decentralized decision making
Complexity results
Solving finite-horizon DEC-POMDPs
Solving infinite-horizon DEC-POMDPs
Scalability beyond two agents
Conclusion
57!
Infinite-Horizon DEC-POMDPs
Unclear how to define a compact belief-state without fixing the policies of other agents!
Value iteration does not generalize to the infinite-horizon case!
Can generalize policy iteration for POMDPs [Hansen 98, Poupart & Boutilier 04]!
Basic idea: Representing local policies using (deterministic/stochastic) finite-state controllers and defining a set of controller transformations that guarantee improvement & convergence !
58!
Policies as Controllers
Finite state controller represents each policy! Fixed memory! Randomness used to offset memory limitations ! Action selection, ψ : Qi → ΔAi! Transitions, η : Qi × Ai × Oi → ΔQi!
Value of two-agent joint controller given by the Bellman equation:!
Repeat!1) Evaluate the controller!2) Perform an exhaustive backup!3) Perform value-preserving transformations!Until controller is ε-optimal for all states!
Theorem: For any ε, bounded policy iteration returns a joint controller that is ε-optimal for all initial states in a finite number of iterations.!
66!
Useful Transformations! Controller reductions!
Shrink the controller without sacrificing value!
Bounded dynamic programming updates! Increase value while keeping the size fixed!
Both can be done using polynomial-size linear programs!
Generalize ideas from POMDP literature, particularly the BPI algorithm [Poupart & Boutilier 03]!
67!
Controller Reduction! For some node qi, find a convex combination of nodes
in Qi \ qi that dominates qi for all states and nodes of the other controllers; Merge qi into the convex combination by changing transition probabilities!
Theorem: A controller reduction is a value-preserving transformation.!
€
V (s,qi,q−i,qc ) + ε ≤ P( ˆ q i)V (s, ˆ q i,q− i,qc )ˆ q i
∑€
P( ˆ q i)
68!
Bounded DP Update! For some node qi, find better parameters assuming that
the old parameters will be used from the second step onwards; New parameters must yield value at least as high for all states and nodes of the other controllers!
Additional linear constraints:! ensure controllers are independent! all probabilities sum to 1 and are non-negative!
€
z( q ,s) = x( q ', a ) R(s, a ) + γ P(s' | s, a ) O( o | s', a ) y( q , a , o , q ') q '∑
o ∑ z( q ',s')
s'∑
⎡
⎣ ⎢ ⎢
⎤
⎦ ⎥ ⎥
a ∑
€
b0(s)s∑ z( q 0,s)
€
x( q , a ) = P( a | q )
€
y( q , a , o , q ') = P( q ' | q , a , o )
€
z( q ,s) = V ( q ,s)
€
q
79!
Independence Constraints
Independence constraints guarantee that action selection and controller transition probabilities for each agent depend only on local information!
Action selection independence:!
Controller transition independence:!
80!
Probability Constraints
Probability constraints guarantee that action selection probabilities and controller transition probabilities are non negative and that they add up to 1: !
! (Superscript f ʼs represent arbitrary fixed values)!
81!
Optimality Theorem: An optimal solution of the NLP results in
optimal stochastic controllers for the given size and initial state distribution.!
Advantages of the NLP approach:! Efficient policy representation with fixed memory! NLP represents optimal policy for given size! Takes advantage of known start state! Easy to implement using off-the-shelf solvers!
Limitations:! Difficult to solve optimally!
82!
Adding a Correlation Device NLP approach can be extended to include a correlation
device, using the following formulation:!
New variable w(c,c') represents the transition function of the correlation device; action selection and controller transitions depend on new shared signal. !
Values and running times (in seconds) for each controller size using NLP methods and DEC-BPI with and without a 2 node correlation device and BFS. An “x” indicates that the approach was not able to solve the problem.!
90!
NLP Approach Summary The NLP defines the optimal fixed-size stochastic
controller!
Approach shows consistent improvement over DEC-BPI using an off-the-shelf locally optimal solver!
A small correlation device can have significant benefits !
Better performance may be obtained by exploiting the structure of the NLP!
91!
Outline! Models for decentralized decision making
Complexity results
Solving finite-horizon DEC-POMDPs
Solving infinite-horizon DEC-POMDPs
Scalability beyond two agents
Conclusion
Exploiting the Locality of Interaction
In practical settings that involve many agents, each agent often interacts with a small number of “neighboring” agents (e.g., firefighting, sensor networks)!
Algorithms designed exploit this property include LID-JESP [Nair et al. AAAI 05] and SPIDER [Varakantham et al. AAMAS 07] and FANS [Marecki et al. AAMAS 08] !
FANS uses FSCs for policy representation and! Exploits FSCs for dynamic programming in policy evaluation and
heuristic computations and provides significant speedups! Introduces novel heuristics to automatically vary the FSC size in
different agents! Performs policy search that exploits the locality of agent interactions!
Model the domain as a Network Distributed POMDP (ND-POMDP)—a restricted class of DEC-POMDPs characterized by a decomposable reward function.!
CBDP uses a point-based dynamic programming (similar to MBDP).!
CBDP uses constraint networks algorithms to improve the efficiency of key steps:! Computation of the heuristic function! Belief sampling using heuristic function! Finding the best joint policy for a particular belief!
CBDP provides orders!!of magnitude of speedup !!over FANS!
Provides better solution quality for all test instances! Provides strong theoretical guarantees on the time and
space complexity enhancing scalability! Linear complexity in planning horizon length! Linear in the number of agents, which is necessary to solve large
realistic problems! Exponential only in a small parameter that depends on the level of
interaction among the agents!
N S E W N
S E W N S E W loc1 loc2
Sample Results A 7-agent configuration with 4 actions per agent. Two adjacent
agents are required to track a target! Graphs show the solution quality (left) and time (right) of our
approach (CBDP) compared with the best existing method (FANS)!
FANS is not scalable beyond horizon 7. CBDP has linear complexity in the horizon, and it provides better solution quality is less time!
0.1!
1!
10!
100!
1000!
2! 3! 4! 5! 6! 7! 8! 10!
CBDP!FANS!
0!
100!
200!
300!
400!
500!
600!
700!
2! 3! 4! 5! 6! 7! 8! 10!
CBDP!FANS!
Horizon! Horizon!
Solu
tion
qual
ity!
Tim
e (s
ec, l
ogsc
ale)!
New Scalable Approach!Kumar, Zilberstein, and Toussaint, IJCAI 2011!
Extend an approach [Toussaint and Storkey, ICML 06] that maps planning under uncertainty (POMDP) problems into probabilistic inference !
Characterize general constraints on the interaction graph that facilitate scalable planning!
Introduce an efficient algorithm to solve such models using probabilistic inference!
Identify a number of existing models with such constraints!
96!
Value Factorization! θ = parameters of an agent! Factors state-space s = (s1, . . . , sM )!
Example: Consider four agents!!s.t. V = V12 + V23 + V34!
97!
Existing Models Satisfy VF! Each agent/state variable can participate in multiple
value factors! Worst case complexity is NEXP-C! TI-DEC-MDP, ND-POMDP, TD-POMDP satisfy value
factorization!
98!
Computational Advantages! Applicability!
In models that satisfy VF, inference in the EM framework can be done independently in each value factor!
Smaller value factors ⇒ efficient inference! Planning no longer exponential, linear in # of factors!
Implementation! Distributed planning! Efficient implementation using message-passing! Parallel computation of messages!
99!
Planning by Inference! Recasts planning as likelihood
maximization in a DBN mixture with binary reward variable r :!!P(r =1 | s, a1, a2) ∝ R(s, a1, a2)!
100!
DB
N M
ixtu
re
Exploiting the VF Property! Exploit additive nature of value function for scalability! Outer mixture simulates the VF property! Each Vf (θf, sf ) evaluated using time dependent mixture!
Theorem: Maximizing the likelihood of observing the variable r = 1 optimizes the joint-policy!
101!
The Expectation-Maximization Algorithm! Observed data r = 1, every other variable hidden! Use the EM algorithm to maximize the likelihood! Implemented using message passing on the VF graph! Example: 3 factors {Ag1, Ag2}, {Ag2, Ag3} and {Ag2, Ag3}!
102!
Properties of the EM Algorithm! Scalability!
μ message requires independent inference in each factor! Agents/state vars. can be involved in multiple factors – can
model complex systems via simpler interactions! Distributed planning via message passing!
Complexity! Linear in the number of factors, exponential in the number of
agents/state variables in a factor! Generality!
No additional assumptions (such as TOI) required – a general optimization recipe for models with the VF property!
Local optima?!103!
Experiments! ND-POMDP domains involving target tracking in
sensor networks with imperfect sensing! Multiple targets, limited sensors with battery! Penalty = -1 per sensor for miscoordination,
recharging battery; positive reward (+80) per target scanned simultaneous by two adjacent sensors!
104!
Comparisons with NLP Approach (5P Domain)!
106!
Scalability on Larger Benchmarks! 15 agent and 20 agent domains, internal states = 5!
108!
Summary of the EM Approach ! Value factorization (VF) facilitates scalability! Several existing weakly-coupled models satisfy VF! An EM algorithm can solve models with such
property and yield good quality solutions! Scalability: E-step decomposes according to value
factors; smaller factors lead to efficient inference! Can be easily implemented using message-passing
among the agents! Future work: Explore techniques for even faster
inference, and establish better error bounds.!109!
110!
Outline! Models for decentralized decision making
Complexity results
Solving finite-horizon DEC-POMDPs
Solving infinite-horizon DEC-POMDPs
Scalability beyond two agents
Conclusion
111!
Back to Some Basic Questions! Are DEC-POMDPs significantly harder to solve than
POMDPs? Why?! What features of the problem domain affect the
complexity and how?! Is optimal dynamic programming possible?! Can dynamic programming be made practical?! Is it beneficial to treat communication as a separate type
of action?! How can we exploit the locality of agent interaction to
develop more scalable algorithms?!
112!
Questions?!
Additional Information:!Resource-Bounded Reasoning Lab!University of Massachusetts, Amherst!http://rbr.cs.umass.edu