Multiagent (Deep) Reinforcement LearningMARTIN PILΓT ([email protected])
Reinforcement learningThe agent needs to learn to perform tasks in environment
No prior knowledge about the effects of tasks
Maximized its utility
Mountain Car problem ββ¦ Typical RL toy problem
β¦ Agent (car) has three actions β left, right, none
β¦ Goal β get up the mountain (yellow flag)
β¦ Weak engine β cannot just go to the right, needs to gain speed by going downhill first
Reinforcement learningFormally defined using a Markov Decision Process (MDP) (S, A, R, p)
β¦ π π‘ β π β state space
β¦ ππ‘ β π΄ β action space
β¦ ππ‘ β π β reward space
β¦ π(π β², π|π , π) β probability that performing action π in state π leads to state π β² and gives reward π
Agentβs goal: maximize discounted returns πΊπ‘ = π π‘+1 + πΎπ π‘+2 + πΎ2π π‘+3β¦ = π π‘+1 + πΎπΊπ‘+1
Agent learns its policy: π(π΄π‘ = π|ππ‘ = π )β¦ Gives a probability to use action π in state π
State value function: ππ π = πΈπ πΊπ‘ ππ‘ = π
Action value function: ππ π , π = πΈπ πΊπ‘ ππ‘ = π , π΄π‘ = π]
Q-LearningLearns the π function directly using the Bellmanβs equations
π π π‘ , ππ‘ β 1 β πΌ π π π‘ , ππ‘ + πΌ(ππ‘ + πΎmaxππ π π‘+1, π )
During learning β sampling policy is used (e.g. the π-greedy policy β use a random action with probability π, otherwise choose the best action)
Traditionally, π is represented as a (sparse) matrix
Problemsβ¦ In many problems, state space (or action space) is continuous βmust perform some kind of
discretization
β¦ Can be unstable
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3-4 (1992): 279-292.
Deep Q-Learningπ function represented as a deep neural network
Experience replay⦠stores previous experience (state, action, new state,
reward) in a replay buffer β used for training
Target network⦠Separate network that is rarely updated
Optimizes loss function
πΏ π = πΈ π + πΎmaxπβ²π π , π; ππ
β β π π , π; π2
β¦ π, πβ - parameters of the network and target network
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. βHuman-Level Control through Deep Reinforcement Learning.β Nature 518, no. 7540 (February 2015): 529β33. https://doi.org/10.1038/nature14236.
Deep Q-LearningSuccessfully used to play single player Atari games
Complex input states β video of the game
Action space quite simple β discrete
Rewards β changes in game score
Better than human-level performanceβ¦ Human-level measured against βexpertβ who
played the game for around 20 episodes of max. 5 minutes after 2 hours of practice for each game.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. βHuman-Level Control through Deep Reinforcement Learning.β Nature 518, no. 7540 (February 2015): 529β33. https://doi.org/10.1038/nature14236.
Actor-Critic MethodsThe actor (policy) is trained using a gradient that depends on a critic (estimate of value function)
Critic is a value function⦠After each action checks if things have gone better or worse than expected
β¦ Evaluation is the error πΏπ‘ = ππ‘+1 + πΎπ π π‘+1 β π(π π‘)
β¦ Is used to evaluate the action selected by actorβ¦ If πΏ is positive (outcome was better than expected) β probability of selecting ππ‘ should be strengthened (otherwise lowered)
Both actor and critic can be approximated using NNβ¦ Policy (π(π , π)) update - Ξπ = πΌπ»π(logππ(π , π))π(π , π)
β¦ Value (π(π , π)) update - Ξπ€ = π½ π π , π + πΎπ π π‘+1, ππ‘+1 β π π π‘ , ππ‘ π»wq(st, at)
Works in continuous action spaces
Multiagent LearningLearning in multi-agent environments more complex β need to coordinate with other agents
Example β level-based foraging (β)β¦ Goal is to collect all items as fast as possible
β¦ Can collect item, if sum of agent levels is greater than item level
Goals of LearningMinmax profile
β¦ For zero-sum games β (ππ , ππ) is minimax profile if ππ ππ , ππ = βππ(ππ , ππ)
β¦ Guaranteed utility against worst-case opponent
Nash equilibriumβ¦ Profile π1, β¦ , ππ is Nash equilibrium if βπβππ
β²: ππ ππβ², πβπ β€ ππ(π)
β¦ No agent can improve utility unilaterally deviating from profile (every agent plays best-response to other agents)
Correlated equilibriumβ¦ Agents observe signal π₯π with joint distribution π π₯1, β¦ , π₯π (e.g. recommended action)
β¦ Profile π1, β¦ , ππ is correlated equilibrium if no agent can improve its expected utility by deviating from recommended actions
β¦ NE is special type of CE β no correlation
Goals of LearningPareto optimum
β¦ Profile π1, β¦ , ππ is Pareto-optimal if there is not other profile πβ² such that βπ: ππ πβ² β₯ π_π(π) and
βπ: ππ πβ² > ππ(π)
β¦ Cannot improve one agent without making other agent worse
Social Welfare & Fairness⦠Welfare of profile is sum of utilities of agents, fairness is product of utilities
β¦ Profile is welfare or fairness optimal if it has the maximum possible welfare/fairness
No-Regretβ¦ Given history Ht = π0, β¦ , ππ‘β1 agent πβs regret for not having taken action ππ is
π π ππ =
π‘
π’π ππ,πβππ‘ β π’π( ππ
π‘, πβππ‘ )
β¦ Policy ππ achieves no-regret if βππ: limπ‘ββ
1
π‘π π ππ π»
π‘ β€ 0.
Joint Action LearningLearns π-values for joint actions π β π΄
β¦ joint action of all agents π = (π1, β¦ , ππ), where ππ is the action of agent π
ππ‘+1 ππ‘, π π‘ = 1 β πΌ ππ‘ ππ‘, π π‘ + πΌπ’π
π‘
β¦ π’ππ‘ - utility received after joint action ππ‘
Uses opponent model to compute expected utilities of actionβ¦ πΈ ππ = πβπ π πβπ π
π‘+1( ππ , πβπ , π π‘+1) β joint action learning
β¦ πΈ ππ = πβπ π πβπ|ππ ππ‘+1 (ππ , πβπ , π π‘+1) β conditional joint action learning
Opponent models predicted from history as relative frequencies of action played (conditional frequencies in CJAL)
π β greedy sampling
Policy Hill ClimbingLearn policy ππ directly
Hill-climbing in policy space
β¦ πππ‘+1 = ππ
π‘ π ππ‘ , πππ‘ + πΏ if ππ
π‘ is the best action according to π(π π‘ , πππ‘)
β¦ πππ‘+1 = ππ
π‘ π ππ‘ , πππ‘ β
1
π΄π β1otherwise
Parameter πΏ is adaptive β larger if winning and lower if losing
Counterfactual Multi-agent Policy GradientsCentralized training and de-centralized execution (more information available in training)
Critic conditions on the current observed state and the actions of all agents
Actors condition on their observed state
Credit assignment β based on difference rewardsβ¦ Reward of agent π ~ the difference between the reward received by the system if joint action π was
used, and reward received if agent π would have used a default actionβ¦ Requires assignment of default actions to agents
β¦ COMA β marginalize over all possible actions of agent π
Used to train micro-management of units in StarCraft
Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. βCounterfactual Multi-Agent Policy Gradients.β ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.
Counterfactual Multi-agent Policy Gradients
Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. βCounterfactual Multi-Agent Policy Gradients.β ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.
Ad hoc TeamworkTypically whole team of agents provided by single organization/team.
β¦ There is some pre-coordination (communication, coordination, β¦)
Ad hoc teamwork⦠Team of agents provided by different organization need to cooperate
β¦ RoboCup Drop-In Competition β mixed players from different teams
β¦ Many algorithms not suitable for ad hoc teamwork β¦ Need many iterations of game β typically limited amount of time
β¦ Designed for self-play (all agents use the same strategy) β no control over other agents in ad hoc teamwork
Ad hoc TeamworkType-based methods
β¦ Assume different types of agents
β¦ Based on interaction history β compute belief over types of other agents
β¦ Play own actions based on beliefs
β¦ Can also add parameters to types
Other problems in MALAnalysis of emergent behaviors
β¦ Typically no new learning algorithms, but single-agent learning algorithms evaluated in multi-agent environment
β¦ Emergent language β¦ Learn agents to use some language
β¦ E.g. signaling game β two agents are show two images, one of them (sender) is told the target and can send a message (from fixedvocabulary) to the receiver; both agents receive a positive reward if the receiver identifies the correct image
Learning communication⦠Agent can typically exchange vectors of numbers for communication
β¦ Maximization of shared utility by means of communication in partially observable environment
Learning cooperation
Agent modelling agents
References and Further Readingβ¦ Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. βCounterfactual Multi-Agent Policy Gradients.β
ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.
β¦ Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. βHuman-Level Control through Deep Reinforcement Learning.β Nature 518, no. 7540 (February 2015): 529β33. https://doi.org/10.1038/nature14236.
β¦ Albrecht, Stefano, and Peter Stone. βMultiagent Learning - Foundations and Recent Trends.β http://www.cs.utexas.edu/~larg/ijcai17_tutorial/
β¦ Nice presentation about general multi-agent learning (slides available)
β¦ Open AI Gym. https://gym.openai.com/
β¦ Environments for reinforcement learning
β¦ Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. βContinuous Control with Deep Reinforcement Learning.β ArXiv:1509.02971 [Cs, Stat], September 9, 2015. http://arxiv.org/abs/1509.02971.
β¦ Actor-Critic method for reinforcement learning with continuous actions
β¦ Hernandez-Leal, Pablo, Bilal Kartal, and Matthew E. Taylor. βIs Multiagent Deep Reinforcement Learning the Answer or the Question? A Brief Survey.β ArXiv:1810.05587 [Cs], October 12, 2018. http://arxiv.org/abs/1810.05587.
β¦ A survey on multiagent deep reinforcement learning
β¦ Lazaridou, Angeliki, Alexander Peysakhovich, and Marco Baroni. βMulti-Agent Cooperation and the Emergence of (Natural) Language.β ArXiv:1612.07182 [Cs], December 21, 2016. http://arxiv.org/abs/1612.07182.
β¦ Emergence of language in multiagent communication