Page 1
Counterfactual Multi-Agent Policy Gradients
Shimon WhitesonDept. of Computer Science
University of Oxford
joint work with Jakob Foerster, Gregory Farquhar,Triantafyllos Afouras, and Nantas Nardelli
July 6, 2017
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 1 / 31
Page 2
Single-Agent Paradigm
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 2 / 31
Page 3
Multi-Agent Paradigm
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 3 / 31
Page 4
Multi-Agent Systems are Everywhere
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 4 / 31
Page 5
Types of Multi-Agent Systems
Cooperative:I Shared team rewardI Coordination problem
Competitive:I Zero-sum gamesI Individual opposing rewardsI Minimax equilibria
Mixed:I General-sum gamesI Nash equilibriaI What is the question?
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 5 / 31
Page 6
Coordination Problems are Everywhere
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 6 / 31
Page 7
Multi-Agent MDP
All agents see the global state s
Individual actions: ua ∈ U
State transitions: P(s ′|s,u) : S ×U× S → [0, 1]
Shared team reward: r(s,u) : S ×U→ R
Equivalent to an MDP with a factored action space
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 7 / 31
Page 8
Dec-POMDP
Observation function: O(s, a) : S × A→ Z
Action-observation history: τ a ∈ T ≡ (Z × U)∗
Decentralised policies: πa(ua|τ a) : T × U → [0, 1]
Natural decentralisation: communication and sensory constraints
Artificial decentralisation: coping with joint action space
Centralised learning of decentralised policies
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 8 / 31
Page 9
Key Challenges
Curse of dimensionality in actions
Multi-agent credit assignment
Modelling other agents’ information state
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 9 / 31
Page 10
Single-Agent Policy Gradient Methods
Optimise πθ with gradient ascent on expected return:
Jθ = Es∼ρπ(s),u∼πθ(s,·) [r(s, u)]
Good when:I Greedification is hard, e.g., continuous actionsI Policy is simpler than value function
Policy gradient theorem [Sutton et al. 2000]:
∇θJθ = Es∼ρπ(s),u∼πθ(s,·) [∇θ log πθ(u|s)Qπ(s, u)]
REINFORCE [Williams 1992]:
∇θJθ ≈ g(τ) =T∑t=0
∇θ log πθ(ut |st)Rt
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 10 / 31
Page 11
Single-Agent Actor-Critic Methods [Sutton et al. 00]Reduce variance in g(τ) by learning a critic Q(s, u):
g(τ) =T∑t=0
∇θ log πθ(ut |st)Q(st , ut)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 11 / 31
Page 12
Single-Agent Baselines
Further reduce variance with a baseline b(s):
g(τ) =T∑t=0
∇θ log πθ(ut |st)(Q(st , ut)− b(st))
b(s) = V (s) =⇒ Q(s, u)− b(s) = A(s, a), the advantage function:
g(τ) =T∑t=0
∇θ log πθ(ut |st)A(st , ut)
TD-error rt + γV (st+1)− V (s) is an unbiased estimate of A(st , ut):
g(τ) =T∑t=0
∇θ log πθ(ut |st)(rt + γV (st+1)− V (st))
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 12 / 31
Page 13
Single-Agent Deep Actor-Critic Methods
Actor and critic are both deep neural networks
I Convolutional and recurrent layers
I Actor and critic share layers
Both trained with stochastic gradient descent
I Actor trained on policy gradient
I Critic trained on TD(λ) or Sarsa(λ):
Lt(ψ) = (y (λ) − C (·t , ψ))2
y (λ) = (1− λ)∞∑n=1
λn−1G(n)t
G(n)t =
n∑k=1
γk−1rt+k + γnC (·t+n, ψ)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 13 / 31
Page 14
Independent Actor-Critic
Inspired by independent Q-learning [Tan 1993]I Each agent learns independently with its own actor and criticI Treats other agents as part of the environment
Speed learning with parameter sharingI Different inputs, including a, induce different behaviourI Still independent: critics condition only on τ a and ua
Variants:I IAC-V: TD-error gradient using V (τ a)I IAC-Q: Advantage-based gradient using A(τ a, ua) = Q(τ a, ua)− V (τ a)
Limitations:I Nonstationary learningI Hard to learn to coordinateI Multi-agent credit assignment
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 14 / 31
Page 15
Counterfactual Multi-Agent Policy Gradients
Centralised critic: stabilise learning to coordinate
Counterfactual baseline: tackle multi-agent credit assignment
Efficient critic representation: scale to large NNs
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 15 / 31
Page 16
Centralised Critic
Centralisation → Hard greedification → actor-critic
ga(τ) =T∑t=0
∇θ log πθ(uat |τ at )(rt + γV (st+1)− V (st))
critic A2t
st rt o2
t
h2
o1t
h1
(h1, ) A1
t
u1t
u1t
u2t
u2t
(h2, )
actor 2actor 1
environment
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 16 / 31
Page 17
Wonderful Life Utility [Wolpert & Tumer 2000]
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 17 / 31
Page 18
Difference Rewards [Tumer & Agogino 2007]
Per-agent shaped reward:
Da(s,u) = r(s,u)− r(s, (u−a, ca))
where ca is a default action
Important property:
Da(s, (u−a, u̇a)) > Da(s,u) =⇒ r(s, (u−a, u̇a)) > r(s, (u−a, a))
Limitations:
I Need (extra) simulation to estimate counterfactual r(s, (u−a, ca))
I Need expertise to choose ca
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 18 / 31
Page 19
Counterfactual Baseline
Use Q(s,u) to estimate difference rewards:
ga(τ) =T∑t=0
∇θ log πθ(uat |τ at )Aa(st ,ut)
Aa(s,u) = Q(s,u)−∑ua
πa(ua|τ a)Q(s, (u−a, ua))
Baseline marginalises out ua
Critic obviates need for extra simulations
Marginalised action obviates need for default
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 19 / 31
Page 20
Efficient Critic Representation
Critic A2t
st
rt
o2t
h2
o1t
h1
ᶢ(h1, ᶗ)
A1t
u1t
u1t
u2t
u2t
ᶢ(h2, ᶗ)
Actor 2Actor 1
(oat, a, ua
t-1) (u-a
t, s
t, oa
t, a, u
t-1)
hat
(hat)(ha
t-1)
ᶢat =ᶢ(ha
t, ᶗ)
(ᶗ)
{Q(ua=1, u-at,..),. .,Q(ua=|U|, u-a
t,..)}
(uat, ᶢa
t)
Aat
COMA
GRU
(b) (c)
Environment
(a)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 20 / 31
Page 21
Starcraft
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 21 / 31
Page 22
Starcraft Micromanagement [Synnaeve et al. 2016]
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 22 / 31
Page 23
Centralised Performance
Local Field of View (FoV) Full FoV, Central Control
map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA
heur. DQN GMEZOmean best
3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 23 / 31
Page 24
Decentralised Starcraft Micromanagement
x
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 24 / 31
Page 25
Heuristic Performance
Local Field of View (FoV) Full FoV, Central Control
map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA
heur. DQN GMEZOmean best
3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 25 / 31
Page 26
Baseline Algorithms
IAC-V: independent actor-critic with V (τ a)
IAC-Q: independent actor-critic with A(τ a, ua) = Q(τ a, ua)− V (τ a)
Central-V: centralised critic V (s) with TD-error-based gradient
Central-QV:
I Centralised critics Q(s,u) and V (s)
I Advantage gradient A(s,u) = Q(s,u)− V (s)
I COMA but with b(s) = V (s)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 26 / 31
Page 27
Results (3m, 5m, 5w, 2d-3z)
20k 40k 60k 80k 100k120k140k# Episodes
0102030405060708090
Ave
rag
e W
in %
COMAcentral-V
central-QVheuristic
10k 20k 30k 40k 50k 60k 70k# Episodes
0102030405060708090
Ave
rag
e W
in %
IAC-V IAC-Q
5k 10k 15k 20k 25k 30k 35k# Episodes
0102030405060708090
Ave
rag
e W
in %
5k 10k 15k 20k 25k 30k 35k 40k# Episodes
0
10
20
30
40
50
60
70
Ave
rag
e W
in %
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 27 / 31
Page 28
Compared to Centralised Controllers
Local Field of View (FoV) Full FoV, Central Control
map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA
heur. DQN GMEZOmean best
3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 28 / 31
Page 29
Future Work
Factored centralised critics for many agents
Multi-agent exploration
Starcraft macromanagement
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 29 / 31
Page 30
Paper
Counterfactual Multi-Agent Policy Gradients
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras,Nantas Nardelli, and Shimon Whiteson
https://arxiv.org/abs/1705.08926
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 30 / 31
Page 31
Thank You Microsoft!
This work was made possible thanks to a generousdonation of Azure cloud credits from Microsoft.
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 31 / 31