A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
Post on 11-Jan-2016
220 Views
Preview:
Transcript
A1
A4
A2 A3
Context-Specific Multiagent Coordination and Planning with Factored MDPsCarlos Guestrin Shobha Venkataraman Daphne KollerStanford University
Construction Crew Problem: Dynamic Resource Allocation
Joint Decision Space Represent as MDP:
Action space: joint action a for all agents State space: joint state x of all agents Reward function: total reward r
Action space is exponential: Action is assignment a = {a1,…, an}
State space: Exponential in # variables Global decision requires complete observation
,
,
Context-Specific Structure
Summary: Context-Specific Coordination
Summary of Algorithm
1. Pick local rule-based basis functions hi
2. Single LP algorithm for Factored MDPs obtains Qi’s
3. Variable coordination graph computes maximizing action
Construction Crew Problem
SysAdmin: Rule-based x Table-based
Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous
decisions Limited observability Limited communication
Multiagent Coordination Examples
Comparing to Apricodd [Boutilier et al. ’96-’99]
Conclusions and Extensions
Multiagent planning algorithm: Variable coordination structure; Limited context-specific communication; Limited context-specific observability.
Solve large MDPs!
Extensions to hierarchical and relational models
Stanford UniversityStanford University ! CMU
Agent 2Plumbing, Painting
Agent 1Foundation, Electricity, Plumbing
Agent 3Electricity, Painting
Agent 4Decoration
WANTED: Agents that coordinate to
build and maintain houses, but only when necessary!
Foundation ! {Electricity, Plumbing} ! Painting ! Decoration
Local Q-function Approximation
M4
M1
M3
M2
Q3
Q(A1,…,A4, X1,…,X4) ¼
Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +
Q3(A2, A3, X2,X3) + Q4(A3, A4,
X3,X4) Associated with Agent 3
Observe only
X2 and X3
Limited observability: agent i only observes variables in Qi
Must choose action to maximize i Qi
Problems with Coordination Graph
Tasks last multiple time steps Failures cause chain reactions Multiple houses
0
100
200
300
400
0 5 10 15 20 25
number of agents
running time (seconds)
csi
non-csi
Bidirectional Ring
y = 0.53x2 - 0.96x - 0.01
R2 = 0.990
100
200
300
400
500
0 5 10 15 20 25 30
number of agents
running time (seconds)
non-csi
csi
y = 0.000049 exp(2.27x)R = 0.9992
Server
Reverse Star
Optimal
Apricodd
Rule-based
Expon06
530.9 530.9 530.9
Expon08
77.09 77.09 77.09
Expon10
0.034 0.034 0.034
Optimal
Apricodd
Rule-based
Linear06
531.4 531.4 531.4
Linear08
430.5 430.5 430.5
Linear10
348.7 348.7 348.7
Running Times for the 'Linear' Problems
y = 0.1473x 3 - 0.8595x2 + 2.5006x - 1.5964
R2 = 0.9997
y = 0.0254x 2 + 0.0363x + 0.0725
R2 = 0.9983
0
10
20
30
40
50
6 8 10 12 14 16 18 20No. of variables
Time (in seconds)Apricodd
Rule-based
Running Times for the 'Expon' Problems
0
100
200
300
400
500
6 8 10 12No. of variables
Time (in seconds)
Apricodd
Rule-based
Context-Specific Coordination Structure
Table size exponential in #variables Messages are tables
Agents communicate even if not necessary Fixed coordination structure
What we want: Use structure in tables Variable coordination structure
Exploit context specific independence!
A1
A4
A2 A3
1Q
4Q 3Q
2Q
Local value rules represent context-specific structure:
<q1 , Plumbing_not_done A1 = Plumb A2 = Plumb : -100>
Set of rules Qi for each agent Must coordinate to maximize total
value: ∑ iaa
Qg,,1
maxK
Rule-based variable elimination [Zhang and Poole ’99]
Maximizing out A1
Rule-based coordination graph for finding optimal action
A - Simplification on instantiation of the state B - Simplification when passing messages C - Simplification on maximization Simplification by approximation
Variable agent communication structure Coordination structure is dynamic
Long-term Utility = Value of MDP Value computed by linear programming:
One variable V(x) for each state One constraint for each state x and action a Number of states and actions exponential!
,
)()( :subject to
)(:minimize
⎩⎨⎧∀
≥
∑
ax
xax
xx
,QV
V
Decomposable Value Function
Linear combination of restricted domain basis functions: ∑=
i iihwV )()(~
xx
Each hi is a rule over small part(s) of a complex system: The value of having two agents in the same house The value of two agents are painting a house together
Must find w giving good approximate value function
Single LP Solution for Factored MDPs
a,x
)x,a()x( :to subject
:minimize
⎪⎩
⎪⎨⎧
∀
≥∑∑
∑
ii
iii
xii
Qhw
wh
One variable wi for each basis function Polynomially many LP variables
One constraint for every state and action
Factored MDP
Plumbingi
Paintingi
Plumbingi’
Paintingi’
R A2
Required Tasks
Dependent Tasks
Agent 2Plumbing, Painting
Agent 1Foundation, Electricity, Plumbing
Agent 3 Electricity,
Painting
Agent 4Decoration
[Schweitzer and Seidmann ‘85]
[Guestrin et al. ’01]
Rule-based variable elim. Exponentially smaller LP than table-based!
A1
A4 A2
A3
A5 A6
1.0:32 xaa ∧∧
3:43 xaa ∧∧
3:421 xaaa ∧∧∧
5:21 xaa ∧∧1:31 xaa ∧∧
7:6 xa ∧4:51 xaa ∧∧2:65 xaa ∧∧ 3:61 xaa ∧∧
A1
A4 A2
A3
A5 A6
1.0:32 aa ∧
3:43 aa ∧
3:421 aaa ∧∧
5:21 aa ∧
7:6a4:51 aa ∧2:65 aa ∧
A
Instantiate current state: x = true
A1
A4 A2
A3
A5 A6
B Eliminate Variable A1
C
Local MaximizationA4 A2
A3
A5 A6
1:4 xa ∧ 1:4a
1.0:32 aa ∧
3:43 aa ∧
5:2a
7:6a4:5a
2:65 aa ∧
1:4a
4:51 aa ∧5:21 aa ∧
3:421 aaa ∧∧
4:5a
5:2a
1.0:32 aa ∧
3:43 aa ∧
3:421 aaa ∧∧
5:21 aa ∧
7:6a4:51 aa ∧2:65 aa ∧
1:4a
Outline
Given long-term utilities i Qi(x,a) Local message passing computes maximizing
action Variable coordination structure
Long-term planning to obtain i Qi(x,a) Linear programming approach Exploit context-specific structure
[Bellman et al. ‘63], [Tsitsiklis & Van Roy ’96], [Koller & Parr ’99,’00], [Guestrin et al. ’01]
∑+='
)(),|'(),(),(x
xVaxxPaxRaxQ γ
Factored Value function V = wi hi
Factored Q function Q = Qi
Foundation ! {Electricity, Plumbing} ! Painting ! Decoration
2 Agents, 1 house Agent 1 = {Foundation, Electricity, Plumbing} Agent 2 = {Plumbing, Painting and Decoration}
4 Agents, 2 houses Agent 1 = {Painting, Decoration}; moves Agent 2 = {Foundation, Electricity, Plumbing, Painting} house 1 Agent 3 = {Foundation, Electricity} house 2 Agent 4 = {Plumbing, Decoration} house 2
Example 1:
Example 2:
Actual value of resulting policies Actual value of resulting policies
Our rule-based approach
Apricodd
Algorithm based on Linear programming Value iteration
Types of independence exploited
Additive and context-specific
Only context-specific
“Basis function” representation
Specified by user Determined by algorithm
Introduction Context-Specific Coordination, Given Qi’s Long-Term Planning, Computing Qi’s Experimental Results
Use Coordination graph [Guestrin et al. ’01]
Use variable elimination for maximization: [Bertele & Brioschi ‘72]
Limited communication for optimal action choice
Comm. bandwidth = induced width of coord. graph
Here we need only 23, instead of 63 sum operations.
A1
A4
A2 A3
1Q
4Q 3Q
2Q
),(),(),(max321312211
,321
AAgAAQAAQA A A
),(),(max),(),(max 424433312211,
4321
AAQAAQAAQAAQA A A A
),(),(),(),(max424433312211
,,, 4321
AAQAAQAAQAAQA A A A
Computing Maximizing Action: Coordination Graph
For every action of A2 and A3,
maximum value for A4
hi and Qi depend on small sets of variables and actions
Polynomial-time algorithm generates compact LP
,
),()( :subject to⎪⎩
⎪⎨⎧
∀
≥∑∑ax
xaQxhwi
ii
ii )(),( :subject to max0
,⎩⎨⎧
−∑≥i
iiixa
xhwxaQ
[ ]),(),(max),(),(max0 4321,,
DBfDCfCAfBAfDCBA
+++≥
),(),(
),(),(max0
43),(
1
),(121
,,
DBfDCfg
gCAfBAf
CB
CB
CBA
+≥
++≥
top related