Page 1
Convergent Learning in Unknown Graphical Games
Dr Archie Chapman, Dr David Leslie,Dr Alex Rogers and Prof Nick Jennings
School of Mathematics, University of Bristol and School of Electronics and Computer Science
University of Southampton
[email protected]
Page 2
Playing games?
)0,0()1,1()1,1(
)1,1()0,0()1,1(
)1,1()1,1()0,0(
Paper
Scissors
RockPaperScissorsRock
Page 4
Playing games?Dense deployment of sensors to detect pedestrian and vehicle activity within an urban environment.
Berkeley Engineering
Page 5
Learning in games
• Adapt to observations of past play
• Hope to converge to something “good”
• Why?!• Bounded rationality justification of equilibrium• Robust to behaviour of “opponents”• Language to describe distributed optimisation
Page 6
Notation
• Players
• Discrete action sets
• Reward functions
• Mixed strategies
• Joint mixed strategy space
• Reward functions extend to
},...,1{ NiiA
R ni AAAr 1:
)( iii AN 1
R:ir
Page 7
Best response / Equilibrium
• Mixed strategies of all players other than i is
• Best response of player i is
• An equilibrium is a satisfying, for all i,
i
),(argmax)( iiiii rbii
)( iii b
Page 8
Fictitious playEstimate strategies of other players
Game matrix
Select best action given estimates
Update estimates
Page 9
Belief updates
• Belief about strategy of player i is the MLE
• Online updating
t
aa i
ti
iti
)()(
1111 )( ttt
tt b
Page 10
• Processes of the form
where and
• F is set-valued (convex and u.s.c.)
• Limit points are chain-recurrent sets of the differential inclusion
Stochastic approximation
1111 )( tttttt eMXFXX
0)|( 1 tt XME 0te
)(XFX
Page 11
Best-response dynamics
• Fictitious play has M and e identically 0, and
• Limit points are limit points of the best-response differential inclusion
• In potential games (and zero-sum games and some others) the limit points must be Nash equilibria
)(b
tt1
Page 12
Generalised weakened fictitious play
• Consider any process such that
where and
and also an interplay between and M.
• Convergence properties are unchanged
tttttt Mbt
111 )(
,0t t0t
Page 13
Fictitious playEstimate strategies of other players
Game matrix
Select best action given estimates
Update estimates
Page 14
Learning the game
?)(?,?)(?,?)(?,
?)(?,?)(?,?)(?,
?)(?,?)(?,?)(?,
Paper
Scissors
RockPaperScissorsRock
ti
ti
ti earR )(
Page 15
Reinforcement learning
• Track the average reward for each joint action
• Play each joint action frequently enough
• Estimates will be close to the expected value
• Estimated game converges to the true game
Page 16
Q-learned fictitious playEstimate strategies of other players
Game matrix
Select best action given estimates
Estimated game matrix
Select best action given estimates
Update estimates
Page 17
Theoretical result
Theorem – If all joint actions are played infinitely often then beliefs follow a GWFP
Proof: The estimated game converges to the true game, so selected strategies are -best responses.
Page 18
Playing games?Dense deployment of sensors to detect pedestrian and vehicle activity within an urban environment.
Berkeley Engineering
Page 19
It’s impossible!
• N players, each with A actions• Game matrix has AN entries to learn
• Each individual must estimate the strategy of every other individual
• It’s just not possible for realistic game scenarios
Page 20
Marginal contributions
• Marginal contribution of player i is
total system reward – system reward if i absent
• Maximised marginal contributions implies system is at a (local) optimum
• Marginal contribution might depend only on the actions of a small number of neighbours
Page 22
Sensors – rewards
• Global reward for action a is
• Marginal reward for i is
• Actually use
j
an
jj
jg
jEIEaU )(
eventsobserved is
nobservatio and events
1)(
ij
ananiggi
jjEaUaUar
by observed
)(1)(
events)()()(
ij
ananti
tj
tjR
by observed
)(1)(
Page 23
Marginal contributions
Page 24
Local learningEstimate strategies of other players
Game matrix
Select best action given estimates
Estimated game matrix
Select best action given estimates
Update estimates
Page 25
Local learningEstimate strategies of neighbours
Game matrix
Select best action given estimates
Estimated game matrixfor local interactions
Select best action given estimates
Update estimates
Page 26
Theoretical result
Theorem – If all joint actions of local games are played infinitely often then beliefs follow a GWFP
Proof: The estimated game converges to the true game, so selected strategies are -best responses.
Page 28
So what?!
• Convergence to (local) optimum with only noisy information and local communication
• Individual rationality: always choose an action to maximise expected reward
• Robustness: If an individual doesn’t “play cricket”, the others will reach a “Nash response”.
Page 29
Summary
• Learning the game while playing is essential
• This can be accommodated within the GWFP framework
• Exploiting the neighbourhood structure of marginal contributions is essential for feasibility