A 1 A 4 A 2 A 3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford University Construction Crew Problem: Dynamic Resource Allocation Joint Decision Space Represent as MDP: Action space: joint action a for all agents State space: joint state x of all agents Reward function: total reward r Action space is exponential: Action is assignment a = {a 1 ,…, a n } State space: Exponential in # variables Global decision requires complete observation , , Context-Specific Structure Summary: Context-Specific Coordination Summary of Algorithm 1. Pick local rule-based basis functions h i 2. Single LP algorithm for Factored MDPs obtains Q i ’s 3. Variable coordination graph computes maximizing action Construction Crew Problem SysAdmin: Rule-based x Table-based Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous decisions Limited observability Limited communication Multiagent Coordination Examples Comparing to Apricodd [Boutilier et al. ’96-’99] Conclusions and Extensions Multiagent planning algorithm: Variable coordination structure; Limited context-specific communication; Limited context-specific observability. Solve large MDPs! Extensions to hierarchical and relational models Stanford University Stanford University ! CMU Agent 2 Plumbing, Painting Agent 1 Foundation, Electricity, Plumbing Agent 3 Electricity, Painting Agent 4 Decoration WANTED: Agents that coordinate to build and maintain houses, but only when necessary! Foundation ! {Electricity, Plumbing} ! Painting ! Decoration Local Q-function Approximation M 4 M 1 M 3 M 2 Q 3 Q(A 1 ,…,A 4 , X 1 ,…,X 4 ) ¼ Q 1 (A 1 , A 4 , X 1 ,X 4 ) + Q 2 (A 1 , A 2 , X 1 ,X 2 ) + Q 3 (A 2 , A 3 , X 2 ,X 3 ) + Q 4 (A 3 , A 4 , X 3 ,X 4 ) Associated with Agent 3 Observe only X 2 and X 3 ed observability: agent i only observes variables in Q i Must choose action to maximize i Q i Problems with Coordination Graph Tasks last multiple time steps Failures cause chain reactions Multiple houses 0 100 200 300 400 0 5 10 15 20 25 num ber ofagents csi non-csi Bidirectional Ring y = 0.53x 2 -0.96x -0.01 R 2 = 0.99 0 100 200 300 400 500 0 5 10 15 20 25 30 num ber ofagents non-csi csi y = 0.000049 exp(2.27x) R = 0.999 2 Server Reverse Star Optima l Apricod d Rule- based Expon0 6 530.9 530.9 530.9 Expon0 8 77.09 77.09 77.09 Expon1 0 0.034 0.034 0.034 Optim al Apricod d Rule- based Linear0 6 531.4 531.4 531.4 Linear0 8 430.5 430.5 430.5 Linear1 0 348.7 348.7 348.7 R unning Tim es forthe 'Linear'Problem s y = 0.1473x 3 -0.8595x 2 + 2.5006x -1.5964 R 2 = 0.9997 y = 0.0254x 2 + 0.0363x + 0.0725 R 2 = 0.9983 0 10 20 30 40 50 6 8 10 12 14 16 18 20 N o.ofvariables A pricodd R ule-based R unning Tim es forthe 'Expon'Problem s 0 100 200 300 400 500 6 8 10 12 N o.ofvariables A pricodd R ule-based Context-Specific Coordination Structure Table size exponential in #variables Messages are tables Agents communicate even if not necessary Fixed coordination structure What we want: Use structure in tables Variable coordination structure Exploit context specific independence! A 1 A 4 A 2 A 3 1 Q 4 Q 3 Q 2 Q Local value rules represent context- specific structure: <q 1 , Plumbing_not_done A1 = Plumb A2 = Plumb : -100> Set of rules Q i for each agent Must coordinate to maximize total value: ∑ i a a Q g , , 1 max K Rule-based variable elimination [Zhang and Poole ’99] Maximizing out A 1 Rule-based coordination graph for finding optimal action A - Simplification on instantiation of the state B - Simplification when passing messages C - Simplification on maximization Simplification by approximation Variable agent communication structure Coordination structure is dynamic Long-term Utility = Value of MDP Value computed by linear programming: One variable V(x) for each state One constraint for each state x and action a Number of states and actions exponential! , ) ( ) ( : subject to ) ( : minimize ⎩ ⎩ ⎩ ∀ ≥ ∑ a x x a x x x , Q V V Decomposable Value Function Linear combination of restricted domain basis functions: ∑ = i i i h w V ) ( ) ( ~ x x Each h i is a rule over small part(s) of a complex system: The value of having two agents in the same house The value of two agents are painting a house together Must find w giving good approximate value function Single LP Solution for Factored MDPs a , x ) x , a ( ) x ( : to subject : minimize ⎩ ⎩ ⎩ ⎩ ⎩ ∀ ≥ ∑ ∑ ∑ i i i i i x i i Q h w w h One variable w i for each basis function Polynomially many LP variables One constraint for every state and action Factored MDP Plumbing i Painting i Plumbing i ’ Painting i ’ R A 2 Required Tasks Dependent Tasks Agent 2 Plumbing, Painting Agent 1 Foundation, Electricity, Plumbing Agent 3 Electricity, Painting Agent 4 Decoration [Schweitzer and Seidmann ‘85] [Guestrin et al. ’01] Rule-based variable elim. Exponentially smaller LP than table-based! A 1 A 4 A 2 A 3 A 5 A 6 1 . 0 : 3 2 x a a ∧ ∧ 3 : 4 3 x a a ∧ ∧ 3 : 4 2 1 x a a a ∧ ∧ ∧ 5 : 2 1 x a a ∧ ∧ 1 : 3 1 x a a ∧ ∧ 7 : 6 x a ∧ 4 : 5 1 x a a ∧ ∧ 2 : 6 5 x a a ∧ ∧ 3 : 6 1 x a a ∧ ∧ A 1 A 4 A 2 A 3 A 5 A 6 1 . 0 : 3 2 a a ∧ 3 : 4 3 a a ∧ 3 : 4 2 1 a a a ∧ ∧ 5 : 2 1 a a ∧ 7 : 6 a 4 : 5 1 a a ∧ 2 : 6 5 a a ∧ A Instantiate current state: x = true A 1 A 4 A 2 A 3 A 5 A 6 B Eliminat e Variable A 1 C Local Maximization A 4 A 2 A 3 A 5 A 6 1 : 4 x a ∧ 1 : 4 a 1 . 0 : 3 2 a a ∧ 3 : 4 3 a a ∧ 5 : 2 a 7 : 6 a 4 : 5 a 2 : 6 5 a a ∧ 1 : 4 a 4 : 5 1 a a ∧ 5 : 2 1 a a ∧ 3 : 4 2 1 a a a ∧ ∧ 4 : 5 a 5 : 2 a 1 . 0 : 3 2 a a ∧ 3 : 4 3 a a ∧ 3 : 4 2 1 a a a ∧ ∧ 5 : 2 1 a a ∧ 7 : 6 a 4 : 5 1 a a ∧ 2 : 6 5 a a ∧ 1 : 4 a Outline Given long-term utilities i Q i (x,a) Local message passing computes maximizing action Variable coordination structure Long-term planning to obtain i Q i (x,a) Linear programming approach Exploit context-specific structure [Bellman et al. ‘63], [Tsitsiklis & Van Roy ’96], [Koller & Parr ’99,’00], [Guestrin et al. ’01] ∑ + = ' ) ( ) , | ' ( ) , ( ) , ( x x V a x x P a x R a x Q γ Factored Value function V = w i h i Factored Q function Q = Q i Foundation ! {Electricity, Plumbing} ! Painting ! Decoration 2 Agents, 1 house Agent 1 = {Foundation, Electricity, Plumbing} Agent 2 = {Plumbing, Painting and Decoration} 4 Agents, 2 houses Agent 1 = {Painting, Decoration}; moves Agent 2 = {Foundation, Electricity, Plumbing, Painting} house 1 Agent 3 = {Foundation, Electricity} house 2 Agent 4 = {Plumbing, Decoration} house 2 Example 1: Example 2: Actual value of resulting policies Actual value of resulting policies Our rule-based approach Apricodd Algorithm based on Linear programming Value iteration Types of independence exploited Additive and context- specific Only context- specific “Basis function” representation Specified by user Determined by algorithm Introduction Context-Specific Coordination, Given Q i ’s Long-Term Planning, Computing Q i ’s Experimental Results Use Coordination graph [Guestrin et al. ’01] Use variable elimination for maximization: [Bertele & Brioschi ‘72] Limited communication for optimal action choice Comm. bandwidth = induced width of coord. graph Here we need only 23, instead of 63 sum operations. A 1 A 4 A 2 A 3 1 Q 4 Q 3 Q 2 Q ) , ( ) , ( ) , ( max 3 2 1 3 1 2 2 1 1 , 3 2 1 A A g A A Q A A Q A A A + + = ) , ( ) , ( max ) , ( ) , ( max 4 2 4 4 3 3 3 1 2 2 1 1 , 4 3 2 1 A A Q A A Q A A Q A A Q A A A A + + + = ) , ( ) , ( ) , ( ) , ( max 4 2 4 4 3 3 3 1 2 2 1 1 , , , 4 3 2 1 A A Q A A Q A A Q A A Q A A A A + + + Computing Maximizing Action: Coordination Graph For every action of A 2 and A 3 , maximum value for A 4 h i and Q i depend on small sets of variables and actions Polynomial-time algorithm generates compact LP , ) , ( ) ( : subject to ⎩ ⎩ ⎩ ⎩ ⎩ ∀ ≥ ∑ ∑ a x x a Q x h w i i i i i ) ( ) , ( : subject to max 0 , ⎩ ⎩ ⎩ − ∑ ≥ i i i i x a x h w x a Q ) , ( ) , ( max ) , ( ) , ( max 0 4 3 2 1 , , D B f D C f C A f B A f D C B A + + + ≥ ) , ( ) , ( ) , ( ) , ( max 0 4 3 ) , ( 1 ) , ( 1 2 1 , , D B f D C f g g C A f B A f C B C B C B A + ≥ + + ≥