MULTI-AGENT REINFORCEMENT LEARNING 1 Sparse Interactions
MULTI-AGENT REINFORCEMENT LEARNING
1
Sparse Interactions
REINFORCEMENT LEARNING• Agent acting in an unknown environment,
learning to maximise a numerical reward signal
a(t)
s(t+1)
r(t+1)
Environment
s(t) s(t+1)a(t)
r(t+1)
s(t+2)
r(t+2)
a(t+1) a(t+2)...
2
MARKOV DECISION PROCESS• SINGLE AGENT!!!
• States = set of states of the agent• Actions = set of actions the agent can take• Transition function• Reward function
M = �S, A, T,R⇥
S
A
T : S �A⇥ S
R : S ⇥A⇥ S ! R
3
Q-LEARNING• model-free, reinforcement learning algorithm• Stores Q-values for every state-action pair• Update rule:
Q(s, a) = Q(s, a) + �
�rt + ⇥argmax
a�Q(s�, a�)�Q(s, a)
⇥
4
SIMPLE EXAMPLE
5
RL WITH BOLTZMANN EXPLORATION
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
0 1000 2000 3000 4000 5000
010
0020
0030
0040
00
episodes
step
s to
goa
l
Agent 1Agent 2
6
RL WITH Ε-GREEDY (Ε = 0.9)
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
0 1000 2000 3000 4000 5000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
7
MULTI-AGENT REINFORCEMENT LEARNING
• Agents influence each other• Possibly conflicting interests
Environment
a1
a2 joint action a(t)
...an
r1(t+1)
s(t+1)
s(t+1)
s(t+1)
r2(t+1)
rn(t+1)
joint state s(t+1)reward r(t+1)
8
• Observations• Expensive communication
MARKOV GAMES
• the number of agents• a finite set of states• with Ak the action set of agent k• the transition function• the reward function of agent k
s(t) s(t+1)
a1(t)a2(t)
...an(t)
r1(t+1)r2(t+1)
...rn(t+1)
...s(t+2)
r1(t+2)r2(t+2)
...rn(t+2)
a1(t+1)a2(t+1)
...an(t+1)
a1(t+2)a2(t+2)
...an(t+2)
9
n
S = s1, . . . , sN
A = A1, . . . , AN
T = S ⇥A1 ⇥ . . .⇥AN ⇥ S ! [0, 1]
Rk = S ⇥A1 ⇥ . . .⇥AN ⇥ S ! R
SPARSE INTERACTIONS1 agentTransitions & rewards are only dependent on 1 agent
2 agentsFar away and not interacting with each other Transitions & rewards are independent of state/action of other agents
2 agentsClose to each other and interacting!!!i.e. transitions & rewards are dependent
G
G2 G1
G2 G1
10
SPARSE INTERACTIONS
2 agentsClose to each other and interacting!!!i.e. transitions & rewards are dependentG2 G1
Assumptions:Agents can do something useful aloneInteractions are sparsef.i. Air traffic control, automated warehouses, ...
10
TAXONOMY BASED ON STRATEGIC INTERACTIONS
Local state Joint state
Independentactions
Joint action(view or selection)
Single agent RL
Nash-Q, CE-Q,...SuperAgent
JAL
MMDP-ILA (Vrancx et al. 2008) MG-ILA (Vrancx et al. 2008)
State and actions must be communicated among agentsState-action space is exponential in the number of agents
11
TAXONOMY BASED ON STRATEGIC INTERACTIONS
Local state Joint state
Independentactions
Joint action(view or selection)
Single agent RL
Nash-Q, CE-Q,...SuperAgent
JAL
MMDP-ILA (Vrancx et al. 2008) MG-ILA (Vrancx et al. 2008)
Utile Coordination (Kok et al. 2005)
Learning of Coordination (Melo et al. 2009)2Observe (De Hauwere et al. 2009)
CQ-Learning (De Hauwere et al. 2010)FCQ-Learning (De Hauwere et al. 2011)
11
INTUITION OF SPARSE INTERACTIONS
When should agents observe the state information of other agents to avoid coordination problems?
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
G G2 G1 G2 G1
12
Is there influence from another agent?Is there influence from another agent?
MODELING INTERACTIONS• Dynamics of the system are a Markov game• Model sparse interactions as a DEC-SIMDP (Melo
et al., 2010)� =
�Mk, (M I,l, SI,l)
�
{MDP for each agent k in the
absence of other agents (containing local states)
G G2 G1
{
Team Markov game for the local interaction between K agents in L
interaction states (containing system states)
G2 G1
13
OUTLINE
Learning of Coordination2Observe
CQ-LearningFCQ-Learning
Transfer learning
14
Learning of Coordination
15
LEARNING OF COORDINATION• Add Pseudo COORDINATE action• External Active Perception• Cost for coordination
16
THE ALGORITHM
17
RESULTS
18
2Observe
19
PROBLEM SETTING
• Learn when to act upon sensory input• Adaptive obstacle avoidance• Save energy
20
INTERACTIONS AS A FUNCTION• State space contains sensor data• Sensor information is only partly relevant• Interaction area is relative to the agent• Special kind of sparse interactions,
modeled as a DEC-LIMDP (Section 4.2)• Ik: Sk → S1 x ... x SM• Approximating this function using a generalized
learning automaton: 2Observe
21
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
SOLUTION METHOD: 2OBSERVE
22
GLA approximating theInteraction function
Single agent Q-learning selecting actions based on local state information
Communication protocol between the agents to avoid a collision in the next timestep
EXPERIMENTAL SETTING
• Reach goal• Avoid collisions
23
EXPERIMENTAL RESULTS (TUNNELTOGOAL)
0 2000 4000 6000 8000 10000
010
2030
4050
episodes
step
s to
goa
l
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
collis
ions
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
coor
dina
tions
2Observe coordinations
24
EXPERIMENTAL RESULTS (TUNNELTOGOAL)
0 2000 4000 6000 8000 10000
010
2030
4050
episodes
step
s to
goa
l
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
collis
ions
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
coor
dina
tions
2Observe coordinations
24
EXPERIMENTAL RESULTS (TUNNELTOGOAL)
0 2000 4000 6000 8000 10000
010
2030
4050
episodes
step
s to
goa
l
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
collis
ions
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
coor
dina
tions
2Observe coordinations
24
EXPERIMENTAL RESULTS (2) (TUNNELTOGOAL)
• Interactions are relative to the agent• GLA can approximate this interaction area
25
CQ-Learning
G2 G1
26
PROBLEM SETTING
• Agents only interact where their policies interfere• Locally adapt policy
G G2 G1 G2 G1
27
REPRESENTATION IDEA
Expand
Generalise
32
7 98
5
1
4 6
4-1 4-2 4-3 6-1 6-2
28
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
SOLUTION METHOD: CQ-LEARNING
Statistical test on the rewards
29
Single agent Q-learning selecting actions based on local state information
Q-learning, based on the combination of local state information and the state information of another
agent
CQ-LEARNING : STATISTICAL TESTS
30
• Agents have been learning alone in the environment
• Agent k acts independently using only local state information (sk) in a multi-agent environment
• Perform statistical test against baseline
• Samples its rewards, based on the state information of other agents & performs the same test
sk1 sk2 sk3 sk411.010.09.010.011.0
20.020.020.019.020.0
15.015.014.815.014.9
10.019.09.020.020.0
... ... ... ...
10.0 20.0 15.0 20.0
20.019.020.0
20.020.020.0
10.0
9.010.0
... ... ...
i
sk ⇒ � sk , sl �4 4 3
sk4
sl1 sl
2 sl3
Expand
Expected reward:
30
CQ-LEARNING BASELINE FOR STATISTICAL TESTS
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
31
CQ-LEARNING BASELINE FOR STATISTICAL TESTS
Initial rewards (sliding window)
for a particular state action pair :
Wk1
Compare Wk against Wk1 2
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
31
Wk2
EXPERIMENTAL RESULTS (1)Env Alg #states #actions #coll #steps
Grid game 2 Indep 9 4 2.7 22.2± 17.9JS 81 4 0.1 4.0± 0.2
(min steps: 3) JSA 81 16 0.0 4.7± 0.1LOC 9.9± 0.5 5 0.1 4.0± 0.4CQ 10± 0.0 4 0.0 3.6± 0.3
CQ NI 10.9± 2.0 4 0.1 4.0± 0.3
Env Alg #states #actions #coll #steps
ISR Indep 43 4 0.4 9.3± 44.8JS 1849 4 0.1 5.7± 1.6
(min steps: 4) JSA 1849 16 0.0 7.6± 1.4LOC 51.3± 82.3 5 0.2 6.7± 7.5CQ 49.0± 2.3 4 0.1 5.1± 0.7
CQ NI 49.9± 7.8 4 0.1 6.0± 1.9
32
EXPERIMENTAL RESULTS (2)• Sample run
G2 G1
33
FCQ-Learning
G2 G1
34
PROBLEM SETTING
• Reflected in immediate reward signal• Too late to solve the problem
1 2 2 1
Reward: +20 Reward: +10
35
DETECTING RELEVANT STATES
• Changes in reward signal are reflected in the Q-values
0 50 100 150 200 250 300 350 400 450 5000
2
4
6
8
10
12
14
16
18
20
Episodes
��������
(12,4)(13,4)(14,4)(15,4)(10,3)(9,2)(8,2)(7,2)(6,2)(1,3)
36
G
FCQ-LEARNING STATISTICAL TESTS
37
sk1 sk2 sk3 sk411.110.911.011.111.0
20.019.919.920.020.0
15.015.014.815.014.9
20.018.817.416.115.9
... ... ... ...11.0 20.0 15.0 20.0
20.019.020.0
20.020.020.0
10.0
9.010.0
... ... ...
sk4
sl1 sl
2 sl3
• Agent k has been learning alone, and its Q-values have converged
• Agent k acts independently using only local state information (sk) in a multi-agent environment
• Performs statistical test against the single agent Q-values
• Samples rewards monte carlo and perform a comparison test to determine what information should be included
sk ⇒ � sk , sl �4 4 3
Expand
Learned Q-value:
EXPERIMENTAL RESULTSEnvironment Algorithm #states #actions #collisions #steps reward
Grid game 2 Indep 9 4 2.4± 0.0 22.7± 30.4 �24.3± 35.6JS 81 4 0.1± 0.0 6.3± 0.3 18.2± 0.6
LOC 9.0± 0.0 5 1.8± 0.0 10.3± 2.7 �6.8± 8.0FCQ 19.4± 4.4 4 0.1± 0.0 8.1± 13.9 17.6± 3.7
FCQ NI 21.7± 3.1 4 0.1± 0.0 7.1± 6.9 17.9± 0.7
Environment Algorithm #states #actions #collisions #steps reward
Bottleneck Indep 43 4 n.a. n.a. n.a.JS 1849 4 0.0± 0.0 23.3± 30.8 13.1± 36.1
LOC 54.0± 0.8 5 1.7± 0.6 167.2± 19, 345.1 �157.5± 10, 327.0FCQ 124.5± 32.8 4 0.1± 0.0 17.3± 1.3 16.6± 0.4
FCQ NI 135.0± 88.7 4 0.2± 0.0 19.2± 5.6 15.4± 2.3
38
EXPERIMENTAL RESULTS
• Order to reach the goal:• Red Agent• Blue Agent• Green Agent
+20+20+20
39
Transfer LearningGeneralized learning
automaton
Single agent Q-learning Coordination through communication
Coordination is not needed
Coordination is needed
Generalized learning automaton
Single agent Q-learning Coordination through communication
Source agent Target agent
2Observe algorithm 2Observe algorithm
Coordination is not needed
Coordination is needed
40
TRANSFER LEARNING
“Transfer of learning occurs when learning in one context enhances (positive transfer) or undermines (negative transfer) a
related performance in another context.”
(D. Perkins, G. Salomon, Transfer of Learning, 1992, International Encyclopedia of Education)
41
MOTIVATIONS FOR TRANSFER LEARNING
• Learning tabula rasa can be extremely slow• Lots of data / time may be needed• Every algorithm has biases: why use an
uninformed bias?• Humans always use past knowledge
• What knowledge is relevant?• How can it be effectively leveraged?
42
TRANSFER LEARNING WITH 2OBSERVE
43
Generalized learning automaton
Single agent Q-learning Coordination through communication
Coordination is not needed
Coordination is needed
Generalized learning automaton
Single agent Q-learning Coordination through communication
Source agent Target agent
2Observe algorithm 2Observe algorithm
Coordination is not needed
Coordination is needed
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
Is there influence from another
Single agent Q-learningselecting actions based on
local state information
Communication protocolbetween the agents to avoid
a collision in the next timestep
GLA approximating theInteraction function
RESULTS
0 2000 4000 6000 8000 10000
05
1015
20
Iterations
# st
eps
to g
oal
Agent 1Agent 2Agent 3
0 2000 4000 6000 8000 10000
05
1015
2025
Iterations
# st
eps
to g
oal
Agent 1Agent 2Agent 3
0 2000 4000 6000 8000 10000
05
1015
2025
Iterations#
step
s to
goa
l
Agent 1Agent 2Agent 3
44
RESULTS (COORDINATION)
0 2000 4000 6000 8000 10000
02
46
810
12
Iterations
# C
ollis
ions
/Coo
rdin
atio
ns
CollisionsCoordinations
0 2000 4000 6000 8000 10000
02
46
810
Iterations
# C
ollis
ions
/Coo
rdin
atio
ns
CollisionsCoordinations
0 2000 4000 6000 8000 10000
02
46
810
Iterations#
Col
lisio
ns/C
oord
inat
ions
CollisionsCoordinations
45
GENERALISATION WITH CQ-LEARNING
Neural network
Δ(x)
Δ(y)
a1
a2
0 | 1
32
7 98
5
1
4 6
4-1 4-2 4-3 6-1 6-2
Expand
Generalise
Generalisation learned with 2Observe
Local state space
GENERALISATION WITH CQ-LEARNING
Neural network
Δ(x)
Δ(y)
a1
a2
0 | 1
32
7 98
5
1
4 6
4-1 4-2 4-3 6-1 6-2
Expand
Generalise
Generalisation learned with CQ-learning
Local state space
GENERALISATION WITH CQ-LEARNING (2)
Safe initialisation Danger initialisation
47
GENERALISATION WITH CQ-LEARNING (2)
EASTWEST
NORTH
SOUTH
48
TRANSFER LEARNING WITH CQ-LEARNING
Augment
32
7 98
5
1
4 6
32
7 98
5
1
4 6
State space Agent k State space Agent l
Generalise
Rule learningsystem (Ripper)
Transfer trained classi er
+ Qaug-table
Source task
CQ-learning
Target task
Trained classi er
Single agent Q-learningQ-learning initialised
with Qaug-table
Coordination is not needed
Coordination is needed
49
TRANSFER LEARNING WITH CQ-LEARNING (2)
50
RESULTS
0 200 400 600 800 1000
050
100
150
200
250
300
episodes
step
s to
goa
l
CQ−learningTransfer learning
0 200 400 600 800 1000
050
100
150
200
250
300
episodesst
eps
to g
oal
CQ−learningTransfer learning
0 200 400 600 800 1000
050
100
150
200
250
300
episodes
step
s to
goa
l
CQ−learningTransfer learning
51
RESULTS (2)
0 200 400 600 800 1000
0.0
0.5
1.0
1.5
2.0
episodesco
llisio
ns
CQ−learningTransfer learning
0 200 400 600 800 1000
0.0
0.5
1.0
1.5
2.0
episodes
collis
ions
CQ−learningTransfer learning
0 200 400 600 800 1000
0.0
0.5
1.0
1.5
2.0
episodes
collis
ions
CQ−learningTransfer learning
52
CONCLUSIONS• In multi-agent environments with sparse interactions,
learning these interaction states improves the learning process
• Interaction states can be learned through increased penalties for miscoordination
• GLA can approximate interaction areas relative to the agent• Interaction states can be identified using statistical tests on
the reward signal (immediate + future)• Information about interaction states can be generalized and
transferred between agents and environments
53