Top Banner
A 1 A 4 A 2 A 3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford University Construction Crew Problem: Dynamic Resource Allocation Joint Decision Space Represent as MDP: Action space: joint action a for all agents State space: joint state x of all agents Reward function: total reward r Action space is exponential: Action is assignment a = {a 1 ,…, a n } State space: Exponential in # variables Global decision requires complete observation , , Context-Specific Structure Summary: Context-Specific Coordination Summary of Algorithm 1. Pick local rule-based basis functions h i 2. Single LP algorithm for Factored MDPs obtains Q i ’s 3. Variable coordination graph computes maximizing action Construction Crew Problem SysAdmin: Rule-based x Table-based Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous decisions Limited observability Limited communication Multiagent Coordination Examples Comparing to Apricodd [Boutilier et al. ’96-’99] Conclusions and Extensions Multiagent planning algorithm: Variable coordination structure; Limited context-specific communication; Limited context-specific observability. Solve large MDPs! Extensions to hierarchical and relational models Stanford University Stanford University ! CMU Agent 2 Plumbing, Painting Agent 1 Foundation, Electricity, Plumbing Agent 3 Electricity, Painting Agent 4 Decoration WANTED: Agents that coordinate to build and maintain houses, but only when necessary! Foundation ! {Electricity, Plumbing} ! Painting ! Decoration Local Q-function Approximation M 4 M 1 M 3 M 2 Q 3 Q(A 1 ,…,A 4 , X 1 ,…,X 4 ) ¼ Q 1 (A 1 , A 4 , X 1 ,X 4 ) + Q 2 (A 1 , A 2 , X 1 ,X 2 ) + Q 3 (A 2 , A 3 , X 2 ,X 3 ) + Q 4 (A 3 , A 4 , X 3 ,X 4 ) Associated with Agent 3 Observe only X 2 and X 3 ed observability: agent i only observes variables in Q i Must choose action to maximize i Q i Problems with Coordination Graph Tasks last multiple time steps Failures cause chain reactions Multiple houses 0 100 200 300 400 0 5 10 15 20 25 num ber ofagents csi non-csi Bidirectional Ring y = 0.53x 2 -0.96x -0.01 R 2 = 0.99 0 100 200 300 400 500 0 5 10 15 20 25 30 num ber ofagents non-csi csi y = 0.000049 exp(2.27x) R = 0.999 2 Server Reverse Star Optima l Apricod d Rule- based Expon0 6 530.9 530.9 530.9 Expon0 8 77.09 77.09 77.09 Expon1 0 0.034 0.034 0.034 Optim al Apricod d Rule- based Linear0 6 531.4 531.4 531.4 Linear0 8 430.5 430.5 430.5 Linear1 0 348.7 348.7 348.7 R unning Tim es forthe 'Linear'Problem s y = 0.1473x 3 -0.8595x 2 + 2.5006x -1.5964 R 2 = 0.9997 y = 0.0254x 2 + 0.0363x + 0.0725 R 2 = 0.9983 0 10 20 30 40 50 6 8 10 12 14 16 18 20 N o.ofvariables A pricodd R ule-based R unning Tim es forthe 'Expon'Problem s 0 100 200 300 400 500 6 8 10 12 N o.ofvariables A pricodd R ule-based Context-Specific Coordination Structure Table size exponential in #variables Messages are tables Agents communicate even if not necessary Fixed coordination structure What we want: Use structure in tables Variable coordination structure Exploit context specific independence! A 1 A 4 A 2 A 3 1 Q 4 Q 3 Q 2 Q Local value rules represent context- specific structure: <q 1 , Plumbing_not_done A1 = Plumb A2 = Plumb : -100> Set of rules Q i for each agent Must coordinate to maximize total value: i a a Q g , , 1 max K Rule-based variable elimination [Zhang and Poole ’99] Maximizing out A 1 Rule-based coordination graph for finding optimal action A - Simplification on instantiation of the state B - Simplification when passing messages C - Simplification on maximization Simplification by approximation Variable agent communication structure Coordination structure is dynamic Long-term Utility = Value of MDP Value computed by linear programming: One variable V(x) for each state One constraint for each state x and action a Number of states and actions exponential! , ) ( ) ( : subject to ) ( : minimize a x x a x x x , Q V V Decomposable Value Function Linear combination of restricted domain basis functions: = i i i h w V ) ( ) ( ~ x x Each h i is a rule over small part(s) of a complex system: The value of having two agents in the same house The value of two agents are painting a house together Must find w giving good approximate value function Single LP Solution for Factored MDPs a , x ) x , a ( ) x ( : to subject : minimize i i i i i x i i Q h w w h One variable w i for each basis function Polynomially many LP variables One constraint for every state and action Factored MDP Plumbing i Painting i Plumbing i Painting i R A 2 Required Tasks Dependent Tasks Agent 2 Plumbing, Painting Agent 1 Foundation, Electricity, Plumbing Agent 3 Electricity, Painting Agent 4 Decoration [Schweitzer and Seidmann ‘85] [Guestrin et al. ’01] Rule-based variable elim. Exponentially smaller LP than table-based! A 1 A 4 A 2 A 3 A 5 A 6 1 . 0 : 3 2 x a a 3 : 4 3 x a a 3 : 4 2 1 x a a a 5 : 2 1 x a a 1 : 3 1 x a a 7 : 6 x a 4 : 5 1 x a a 2 : 6 5 x a a 3 : 6 1 x a a A 1 A 4 A 2 A 3 A 5 A 6 1 . 0 : 3 2 a a 3 : 4 3 a a 3 : 4 2 1 a a a 5 : 2 1 a a 7 : 6 a 4 : 5 1 a a 2 : 6 5 a a A Instantiate current state: x = true A 1 A 4 A 2 A 3 A 5 A 6 B Eliminat e Variable A 1 C Local Maximization A 4 A 2 A 3 A 5 A 6 1 : 4 x a 1 : 4 a 1 . 0 : 3 2 a a 3 : 4 3 a a 5 : 2 a 7 : 6 a 4 : 5 a 2 : 6 5 a a 1 : 4 a 4 : 5 1 a a 5 : 2 1 a a 3 : 4 2 1 a a a 4 : 5 a 5 : 2 a 1 . 0 : 3 2 a a 3 : 4 3 a a 3 : 4 2 1 a a a 5 : 2 1 a a 7 : 6 a 4 : 5 1 a a 2 : 6 5 a a 1 : 4 a Outline Given long-term utilities i Q i (x,a) Local message passing computes maximizing action Variable coordination structure Long-term planning to obtain i Q i (x,a) Linear programming approach Exploit context-specific structure [Bellman et al. ‘63], [Tsitsiklis & Van Roy ’96], [Koller & Parr ’99,’00], [Guestrin et al. ’01] + = ' ) ( ) , | ' ( ) , ( ) , ( x x V a x x P a x R a x Q γ Factored Value function V = w i h i Factored Q function Q = Q i Foundation ! {Electricity, Plumbing} ! Painting ! Decoration 2 Agents, 1 house Agent 1 = {Foundation, Electricity, Plumbing} Agent 2 = {Plumbing, Painting and Decoration} 4 Agents, 2 houses Agent 1 = {Painting, Decoration}; moves Agent 2 = {Foundation, Electricity, Plumbing, Painting} house 1 Agent 3 = {Foundation, Electricity} house 2 Agent 4 = {Plumbing, Decoration} house 2 Example 1: Example 2: Actual value of resulting policies Actual value of resulting policies Our rule-based approach Apricodd Algorithm based on Linear programming Value iteration Types of independence exploited Additive and context- specific Only context- specific “Basis function” representation Specified by user Determined by algorithm Introduction Context-Specific Coordination, Given Q i ’s Long-Term Planning, Computing Q i ’s Experimental Results Use Coordination graph [Guestrin et al. ’01] Use variable elimination for maximization: [Bertele & Brioschi ‘72] Limited communication for optimal action choice Comm. bandwidth = induced width of coord. graph Here we need only 23, instead of 63 sum operations. A 1 A 4 A 2 A 3 1 Q 4 Q 3 Q 2 Q ) , ( ) , ( ) , ( max 3 2 1 3 1 2 2 1 1 , 3 2 1 A A g A A Q A A Q A A A + + = ) , ( ) , ( max ) , ( ) , ( max 4 2 4 4 3 3 3 1 2 2 1 1 , 4 3 2 1 A A Q A A Q A A Q A A Q A A A A + + + = ) , ( ) , ( ) , ( ) , ( max 4 2 4 4 3 3 3 1 2 2 1 1 , , , 4 3 2 1 A A Q A A Q A A Q A A Q A A A A + + + Computing Maximizing Action: Coordination Graph For every action of A 2 and A 3 , maximum value for A 4 h i and Q i depend on small sets of variables and actions Polynomial-time algorithm generates compact LP , ) , ( ) ( : subject to a x x a Q x h w i i i i i ) ( ) , ( : subject to max 0 , i i i i x a x h w x a Q ) , ( ) , ( max ) , ( ) , ( max 0 4 3 2 1 , , D B f D C f C A f B A f D C B A + + + ) , ( ) , ( ) , ( ) , ( max 0 4 3 ) , ( 1 ) , ( 1 2 1 , , D B f D C f g g C A f B A f C B C B C B A + + +
1

A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.

Jan 11, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.

A1

A4

A2 A3

Context-Specific Multiagent Coordination and Planning with Factored MDPsCarlos Guestrin Shobha Venkataraman Daphne KollerStanford University

Construction Crew Problem: Dynamic Resource Allocation

Joint Decision Space Represent as MDP:

Action space: joint action a for all agents State space: joint state x of all agents Reward function: total reward r

Action space is exponential: Action is assignment a = {a1,…, an}

State space: Exponential in # variables Global decision requires complete observation

,

,

Context-Specific Structure

Summary: Context-Specific Coordination

Summary of Algorithm

1. Pick local rule-based basis functions hi

2. Single LP algorithm for Factored MDPs obtains Qi’s

3. Variable coordination graph computes maximizing action

Construction Crew Problem

SysAdmin: Rule-based x Table-based

Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous

decisions Limited observability Limited communication

Multiagent Coordination Examples

Comparing to Apricodd [Boutilier et al. ’96-’99]

Conclusions and Extensions

Multiagent planning algorithm: Variable coordination structure; Limited context-specific communication; Limited context-specific observability.

Solve large MDPs!

Extensions to hierarchical and relational models

Stanford UniversityStanford University ! CMU

Agent 2Plumbing, Painting

Agent 1Foundation, Electricity, Plumbing

Agent 3Electricity, Painting

Agent 4Decoration

WANTED: Agents that coordinate to

build and maintain houses, but only when necessary!

Foundation ! {Electricity, Plumbing} ! Painting ! Decoration

Local Q-function Approximation

M4

M1

M3

M2

Q3

Q(A1,…,A4, X1,…,X4) ¼

Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +

Q3(A2, A3, X2,X3) + Q4(A3, A4,

X3,X4) Associated with Agent 3

Observe only

X2 and X3

Limited observability: agent i only observes variables in Qi

Must choose action to maximize i Qi

Problems with Coordination Graph

Tasks last multiple time steps Failures cause chain reactions Multiple houses

0

100

200

300

400

0 5 10 15 20 25

number of agents

running time (seconds)

csi

non-csi

Bidirectional Ring

y = 0.53x2 - 0.96x - 0.01

R2 = 0.990

100

200

300

400

500

0 5 10 15 20 25 30

number of agents

running time (seconds)

non-csi

csi

y = 0.000049 exp(2.27x)R = 0.9992

Server

Reverse Star

Optimal

Apricodd

Rule-based

Expon06

530.9 530.9 530.9

Expon08

77.09 77.09 77.09

Expon10

0.034 0.034 0.034

Optimal

Apricodd

Rule-based

Linear06

531.4 531.4 531.4

Linear08

430.5 430.5 430.5

Linear10

348.7 348.7 348.7

Running Times for the 'Linear' Problems

y = 0.1473x 3 - 0.8595x2 + 2.5006x - 1.5964

R2 = 0.9997

y = 0.0254x 2 + 0.0363x + 0.0725

R2 = 0.9983

0

10

20

30

40

50

6 8 10 12 14 16 18 20No. of variables

Time (in seconds)Apricodd

Rule-based

Running Times for the 'Expon' Problems

0

100

200

300

400

500

6 8 10 12No. of variables

Time (in seconds)

Apricodd

Rule-based

Context-Specific Coordination Structure

Table size exponential in #variables Messages are tables

Agents communicate even if not necessary Fixed coordination structure

What we want: Use structure in tables Variable coordination structure

Exploit context specific independence!

A1

A4

A2 A3

1Q

4Q 3Q

2Q

Local value rules represent context-specific structure:

<q1 , Plumbing_not_done A1 = Plumb A2 = Plumb : -100>

Set of rules Qi for each agent Must coordinate to maximize total

value: ∑ iaa

Qg,,1

maxK

Rule-based variable elimination [Zhang and Poole ’99]

Maximizing out A1

Rule-based coordination graph for finding optimal action

A - Simplification on instantiation of the state B - Simplification when passing messages C - Simplification on maximization Simplification by approximation

Variable agent communication structure Coordination structure is dynamic

Long-term Utility = Value of MDP Value computed by linear programming:

One variable V(x) for each state One constraint for each state x and action a Number of states and actions exponential!

,

)()( :subject to

)(:minimize

⎩⎨⎧∀

ax

xax

xx

,QV

V

Decomposable Value Function

Linear combination of restricted domain basis functions: ∑=

i iihwV )()(~

xx

Each hi is a rule over small part(s) of a complex system: The value of having two agents in the same house The value of two agents are painting a house together

Must find w giving good approximate value function

Single LP Solution for Factored MDPs

a,x

)x,a()x( :to subject

:minimize

⎪⎩

⎪⎨⎧

≥∑∑

ii

iii

xii

Qhw

wh

One variable wi for each basis function Polynomially many LP variables

One constraint for every state and action

Factored MDP

Plumbingi

Paintingi

Plumbingi’

Paintingi’

R A2

Required Tasks

Dependent Tasks

Agent 2Plumbing, Painting

Agent 1Foundation, Electricity, Plumbing

Agent 3 Electricity,

Painting

Agent 4Decoration

[Schweitzer and Seidmann ‘85]

[Guestrin et al. ’01]

Rule-based variable elim. Exponentially smaller LP than table-based!

A1

A4 A2

A3

A5 A6

1.0:32 xaa ∧∧

3:43 xaa ∧∧

3:421 xaaa ∧∧∧

5:21 xaa ∧∧1:31 xaa ∧∧

7:6 xa ∧4:51 xaa ∧∧2:65 xaa ∧∧ 3:61 xaa ∧∧

A1

A4 A2

A3

A5 A6

1.0:32 aa ∧

3:43 aa ∧

3:421 aaa ∧∧

5:21 aa ∧

7:6a4:51 aa ∧2:65 aa ∧

A

Instantiate current state: x = true

A1

A4 A2

A3

A5 A6

B Eliminate Variable A1

C

Local MaximizationA4 A2

A3

A5 A6

1:4 xa ∧ 1:4a

1.0:32 aa ∧

3:43 aa ∧

5:2a

7:6a4:5a

2:65 aa ∧

1:4a

4:51 aa ∧5:21 aa ∧

3:421 aaa ∧∧

4:5a

5:2a

1.0:32 aa ∧

3:43 aa ∧

3:421 aaa ∧∧

5:21 aa ∧

7:6a4:51 aa ∧2:65 aa ∧

1:4a

Outline

Given long-term utilities i Qi(x,a) Local message passing computes maximizing

action Variable coordination structure

Long-term planning to obtain i Qi(x,a) Linear programming approach Exploit context-specific structure

[Bellman et al. ‘63], [Tsitsiklis & Van Roy ’96], [Koller & Parr ’99,’00], [Guestrin et al. ’01]

∑+='

)(),|'(),(),(x

xVaxxPaxRaxQ γ

Factored Value function V = wi hi

Factored Q function Q = Qi

Foundation ! {Electricity, Plumbing} ! Painting ! Decoration

2 Agents, 1 house Agent 1 = {Foundation, Electricity, Plumbing} Agent 2 = {Plumbing, Painting and Decoration}

4 Agents, 2 houses Agent 1 = {Painting, Decoration}; moves Agent 2 = {Foundation, Electricity, Plumbing, Painting} house 1 Agent 3 = {Foundation, Electricity} house 2 Agent 4 = {Plumbing, Decoration} house 2

Example 1:

Example 2:

Actual value of resulting policies Actual value of resulting policies

Our rule-based approach

Apricodd

Algorithm based on Linear programming Value iteration

Types of independence exploited

Additive and context-specific

Only context-specific

“Basis function” representation

Specified by user Determined by algorithm

Introduction Context-Specific Coordination, Given Qi’s Long-Term Planning, Computing Qi’s Experimental Results

Use Coordination graph [Guestrin et al. ’01]

Use variable elimination for maximization: [Bertele & Brioschi ‘72]

Limited communication for optimal action choice

Comm. bandwidth = induced width of coord. graph

Here we need only 23, instead of 63 sum operations.

A1

A4

A2 A3

1Q

4Q 3Q

2Q

),(),(),(max321312211

,321

AAgAAQAAQA A A

),(),(max),(),(max 424433312211,

4321

AAQAAQAAQAAQA A A A

),(),(),(),(max424433312211

,,, 4321

AAQAAQAAQAAQA A A A

Computing Maximizing Action: Coordination Graph

For every action of A2 and A3,

maximum value for A4

hi and Qi depend on small sets of variables and actions

Polynomial-time algorithm generates compact LP

,

),()( :subject to⎪⎩

⎪⎨⎧

≥∑∑ax

xaQxhwi

ii

ii )(),( :subject to max0

,⎩⎨⎧

−∑≥i

iiixa

xhwxaQ

[ ]),(),(max),(),(max0 4321,,

DBfDCfCAfBAfDCBA

+++≥

),(),(

),(),(max0

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

+≥

++≥