Top Banner
Multiagent (Deep) Reinforcement Learning MARTIN PILÁT ([email protected])
18

Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Mar 12, 2019

Download

Documents

hakhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Multiagent (Deep) Reinforcement LearningMARTIN PILÁT ([email protected])

Page 2: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Reinforcement learningThe agent needs to learn to perform tasks in environment

No prior knowledge about the effects of tasks

Maximized its utility

Mountain Car problem β†’β—¦ Typical RL toy problem

β—¦ Agent (car) has three actions – left, right, none

β—¦ Goal – get up the mountain (yellow flag)

β—¦ Weak engine – cannot just go to the right, needs to gain speed by going downhill first

Page 3: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Reinforcement learningFormally defined using a Markov Decision Process (MDP) (S, A, R, p)

β—¦ 𝑠𝑑 ∈ 𝑆 – state space

β—¦ π‘Žπ‘‘ ∈ 𝐴 – action space

β—¦ π‘Ÿπ‘‘ ∈ 𝑅 – reward space

β—¦ 𝑝(𝑠′, π‘Ÿ|𝑠, π‘Ž) – probability that performing action π‘Ž in state 𝑠 leads to state 𝑠′ and gives reward π‘Ÿ

Agent’s goal: maximize discounted returns 𝐺𝑑 = 𝑅𝑑+1 + 𝛾𝑅𝑑+2 + 𝛾2𝑅𝑑+3… = 𝑅𝑑+1 + 𝛾𝐺𝑑+1

Agent learns its policy: πœ‹(𝐴𝑑 = π‘Ž|𝑆𝑑 = 𝑠)β—¦ Gives a probability to use action π‘Ž in state 𝑠

State value function: π‘‰πœ‹ 𝑠 = πΈπœ‹ 𝐺𝑑 𝑆𝑑 = 𝑠

Action value function: π‘„πœ‹ 𝑠, π‘Ž = πΈπœ‹ 𝐺𝑑 𝑆𝑑 = 𝑠, 𝐴𝑑 = π‘Ž]

Page 4: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Q-LearningLearns the 𝑄 function directly using the Bellman’s equations

𝑄 𝑠𝑑 , π‘Žπ‘‘ ← 1 βˆ’ 𝛼 𝑄 𝑠𝑑 , π‘Žπ‘‘ + 𝛼(π‘Ÿπ‘‘ + 𝛾maxπ‘Žπ‘„ 𝑠𝑑+1, π‘Ž )

During learning – sampling policy is used (e.g. the πœ–-greedy policy – use a random action with probability πœ–, otherwise choose the best action)

Traditionally, 𝑄 is represented as a (sparse) matrix

Problemsβ—¦ In many problems, state space (or action space) is continuous β†’must perform some kind of

discretization

β—¦ Can be unstable

Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3-4 (1992): 279-292.

Page 5: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Deep Q-Learning𝑄 function represented as a deep neural network

Experience replayβ—¦ stores previous experience (state, action, new state,

reward) in a replay buffer – used for training

Target networkβ—¦ Separate network that is rarely updated

Optimizes loss function

𝐿 πœƒ = 𝐸 π‘Ÿ + 𝛾maxπ‘Žβ€²π‘„ 𝑠, π‘Ž; πœƒπ‘–

βˆ’ βˆ’ 𝑄 𝑠, π‘Ž; πœƒ2

β—¦ πœƒ, πœƒβˆ’ - parameters of the network and target network

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. β€œHuman-Level Control through Deep Reinforcement Learning.” Nature 518, no. 7540 (February 2015): 529–33. https://doi.org/10.1038/nature14236.

Page 6: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Deep Q-LearningSuccessfully used to play single player Atari games

Complex input states – video of the game

Action space quite simple – discrete

Rewards – changes in game score

Better than human-level performanceβ—¦ Human-level measured against β€œexpert” who

played the game for around 20 episodes of max. 5 minutes after 2 hours of practice for each game.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. β€œHuman-Level Control through Deep Reinforcement Learning.” Nature 518, no. 7540 (February 2015): 529–33. https://doi.org/10.1038/nature14236.

Page 7: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Actor-Critic MethodsThe actor (policy) is trained using a gradient that depends on a critic (estimate of value function)

Critic is a value functionβ—¦ After each action checks if things have gone better or worse than expected

β—¦ Evaluation is the error 𝛿𝑑 = π‘Ÿπ‘‘+1 + 𝛾𝑉 𝑠𝑑+1 βˆ’ 𝑉(𝑠𝑑)

β—¦ Is used to evaluate the action selected by actorβ—¦ If 𝛿 is positive (outcome was better than expected) – probability of selecting π‘Žπ‘‘ should be strengthened (otherwise lowered)

Both actor and critic can be approximated using NNβ—¦ Policy (πœ‹(𝑠, π‘Ž)) update - Ξ”πœƒ = π›Όπ›»πœƒ(logπœ‹πœƒ(𝑠, π‘Ž))π‘ž(𝑠, π‘Ž)

β—¦ Value (π‘ž(𝑠, π‘Ž)) update - Δ𝑀 = 𝛽 𝑅 𝑠, π‘Ž + π›Ύπ‘ž 𝑠𝑑+1, π‘Žπ‘‘+1 βˆ’ π‘ž 𝑠𝑑 , π‘Žπ‘‘ 𝛻wq(st, at)

Works in continuous action spaces

Page 8: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Multiagent LearningLearning in multi-agent environments more complex – need to coordinate with other agents

Example – level-based foraging (β†’)β—¦ Goal is to collect all items as fast as possible

β—¦ Can collect item, if sum of agent levels is greater than item level

Page 9: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Goals of LearningMinmax profile

β—¦ For zero-sum games – (πœ‹π‘– , πœ‹π‘—) is minimax profile if π‘ˆπ‘– πœ‹π‘– , πœ‹π‘— = βˆ’π‘ˆπ‘—(πœ‹π‘– , πœ‹π‘—)

β—¦ Guaranteed utility against worst-case opponent

Nash equilibriumβ—¦ Profile πœ‹1, … , πœ‹π‘› is Nash equilibrium if βˆ€π‘–βˆ€πœ‹π‘–

β€²: π‘ˆπ‘– πœ‹π‘–β€², πœ‹βˆ’π‘– ≀ π‘ˆπ‘–(πœ‹)

β—¦ No agent can improve utility unilaterally deviating from profile (every agent plays best-response to other agents)

Correlated equilibriumβ—¦ Agents observe signal π‘₯𝑖 with joint distribution πœ‰ π‘₯1, … , π‘₯𝑛 (e.g. recommended action)

β—¦ Profile πœ‹1, … , πœ‹π‘› is correlated equilibrium if no agent can improve its expected utility by deviating from recommended actions

β—¦ NE is special type of CE – no correlation

Page 10: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Goals of LearningPareto optimum

β—¦ Profile πœ‹1, … , πœ‹π‘› is Pareto-optimal if there is not other profile πœ‹β€² such that βˆ€π‘–: π‘ˆπ‘– πœ‹β€² β‰₯ π‘ˆ_𝑖(πœ‹) and

βˆƒπ‘–: π‘ˆπ‘– πœ‹β€² > π‘ˆπ‘–(πœ‹)

β—¦ Cannot improve one agent without making other agent worse

Social Welfare & Fairnessβ—¦ Welfare of profile is sum of utilities of agents, fairness is product of utilities

β—¦ Profile is welfare or fairness optimal if it has the maximum possible welfare/fairness

No-Regretβ—¦ Given history Ht = π‘Ž0, … , π‘Žπ‘‘βˆ’1 agent 𝑖’s regret for not having taken action π‘Žπ‘– is

𝑅𝑖 π‘Žπ‘– =

𝑑

𝑒𝑖 π‘Žπ‘–,π‘Žβˆ’π‘–π‘‘ βˆ’ 𝑒𝑖( π‘Žπ‘–

𝑑, π‘Žβˆ’π‘–π‘‘ )

β—¦ Policy πœ‹π‘– achieves no-regret if βˆ€π‘Žπ‘–: limπ‘‘β†’βˆž

1

𝑑𝑅𝑖 π‘Žπ‘– 𝐻

𝑑 ≀ 0.

Page 11: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Joint Action LearningLearns 𝑄-values for joint actions π‘Ž ∈ 𝐴

β—¦ joint action of all agents π‘Ž = (π‘Ž1, … , π‘Žπ‘›), where π‘Žπ‘– is the action of agent 𝑖

𝑄𝑑+1 π‘Žπ‘‘, 𝑠𝑑 = 1 βˆ’ 𝛼 𝑄𝑑 π‘Žπ‘‘, 𝑠𝑑 + 𝛼𝑒𝑖

𝑑

β—¦ 𝑒𝑖𝑑 - utility received after joint action π‘Žπ‘‘

Uses opponent model to compute expected utilities of actionβ—¦ 𝐸 π‘Žπ‘– = π‘Žβˆ’π‘– 𝑃 π‘Žβˆ’π‘– 𝑄

𝑑+1( π‘Žπ‘– , π‘Žβˆ’π‘– , 𝑠𝑑+1) – joint action learning

β—¦ 𝐸 π‘Žπ‘– = π‘Žβˆ’π‘– 𝑃 π‘Žβˆ’π‘–|π‘Žπ‘– 𝑄𝑑+1 (π‘Žπ‘– , π‘Žβˆ’π‘– , 𝑠𝑑+1) – conditional joint action learning

Opponent models predicted from history as relative frequencies of action played (conditional frequencies in CJAL)

πœ– – greedy sampling

Page 12: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Policy Hill ClimbingLearn policy πœ‹π‘– directly

Hill-climbing in policy space

β—¦ πœ‹π‘–π‘‘+1 = πœ‹π‘–

𝑑 𝑠𝑖𝑑 , π‘Žπ‘–π‘‘ + 𝛿 if π‘Žπ‘–

𝑑 is the best action according to 𝑄(𝑠𝑑 , π‘Žπ‘–π‘‘)

β—¦ πœ‹π‘–π‘‘+1 = πœ‹π‘–

𝑑 𝑠𝑖𝑑 , π‘Žπ‘–π‘‘ βˆ’

1

𝐴𝑖 βˆ’1otherwise

Parameter 𝛿 is adaptive – larger if winning and lower if losing

Page 13: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Counterfactual Multi-agent Policy GradientsCentralized training and de-centralized execution (more information available in training)

Critic conditions on the current observed state and the actions of all agents

Actors condition on their observed state

Credit assignment – based on difference rewardsβ—¦ Reward of agent 𝑖 ~ the difference between the reward received by the system if joint action π‘Ž was

used, and reward received if agent 𝑖 would have used a default actionβ—¦ Requires assignment of default actions to agents

β—¦ COMA – marginalize over all possible actions of agent 𝑖

Used to train micro-management of units in StarCraft

Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. β€œCounterfactual Multi-Agent Policy Gradients.” ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

Page 14: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Counterfactual Multi-agent Policy Gradients

Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. β€œCounterfactual Multi-Agent Policy Gradients.” ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

Page 15: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Ad hoc TeamworkTypically whole team of agents provided by single organization/team.

β—¦ There is some pre-coordination (communication, coordination, …)

Ad hoc teamworkβ—¦ Team of agents provided by different organization need to cooperate

β—¦ RoboCup Drop-In Competition – mixed players from different teams

β—¦ Many algorithms not suitable for ad hoc teamwork β—¦ Need many iterations of game – typically limited amount of time

β—¦ Designed for self-play (all agents use the same strategy) – no control over other agents in ad hoc teamwork

Page 16: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Ad hoc TeamworkType-based methods

β—¦ Assume different types of agents

β—¦ Based on interaction history – compute belief over types of other agents

β—¦ Play own actions based on beliefs

β—¦ Can also add parameters to types

Page 17: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

Other problems in MALAnalysis of emergent behaviors

β—¦ Typically no new learning algorithms, but single-agent learning algorithms evaluated in multi-agent environment

β—¦ Emergent language β—¦ Learn agents to use some language

β—¦ E.g. signaling game – two agents are show two images, one of them (sender) is told the target and can send a message (from fixedvocabulary) to the receiver; both agents receive a positive reward if the receiver identifies the correct image

Learning communicationβ—¦ Agent can typically exchange vectors of numbers for communication

β—¦ Maximization of shared utility by means of communication in partially observable environment

Learning cooperation

Agent modelling agents

Page 18: Multiagent (Deep) Reinforcement Learning - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui_seminar/talks/2018ZS/Multiagent...Β Β· COMA –marginalize over all possible actions of agent 𝑖

References and Further Readingβ—¦ Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. β€œCounterfactual Multi-Agent Policy Gradients.”

ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

β—¦ Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. β€œHuman-Level Control through Deep Reinforcement Learning.” Nature 518, no. 7540 (February 2015): 529–33. https://doi.org/10.1038/nature14236.

β—¦ Albrecht, Stefano, and Peter Stone. β€œMultiagent Learning - Foundations and Recent Trends.” http://www.cs.utexas.edu/~larg/ijcai17_tutorial/

β—¦ Nice presentation about general multi-agent learning (slides available)

β—¦ Open AI Gym. https://gym.openai.com/

β—¦ Environments for reinforcement learning

β—¦ Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. β€œContinuous Control with Deep Reinforcement Learning.” ArXiv:1509.02971 [Cs, Stat], September 9, 2015. http://arxiv.org/abs/1509.02971.

β—¦ Actor-Critic method for reinforcement learning with continuous actions

β—¦ Hernandez-Leal, Pablo, Bilal Kartal, and Matthew E. Taylor. β€œIs Multiagent Deep Reinforcement Learning the Answer or the Question? A Brief Survey.” ArXiv:1810.05587 [Cs], October 12, 2018. http://arxiv.org/abs/1810.05587.

β—¦ A survey on multiagent deep reinforcement learning

β—¦ Lazaridou, Angeliki, Alexander Peysakhovich, and Marco Baroni. β€œMulti-Agent Cooperation and the Emergence of (Natural) Language.” ArXiv:1612.07182 [Cs], December 21, 2016. http://arxiv.org/abs/1612.07182.

β—¦ Emergence of language in multiagent communication