EQUILIBRIUM AND CONTROL IN COMPLEX INTERCONNECTED SYSTEMS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Sachin Adlakha August 2010
134
Embed
stacks.stanford.edukx619tt4623/thesis_adlakha... · I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EQUILIBRIUM AND CONTROL IN COMPLEX
INTERCONNECTED SYSTEMS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Sachin Adlakha
August 2010
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/kx619tt4623
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Andrea Goldsmith, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Ramesh Johari
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Sanjay Lall
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Large-scale complex systems such as power grids, transportation systems, and social
networks are reshaping every aspect of modern society. Despite their ubiquitous na-
ture, the design and understanding of such complex networks is still very challenging.
Decision making in such systems is complicated by the fact that an agent’s optimal
choice depends on the choices made by other agents in the system. In a smart grid,
the power consumption of an individual user could depend on the demand profile of
other users, some of who may be physically far away. An investment decision by an
agent in an online auction is affected by the strategic choices of other agents partic-
ipating in the auction. Thus, a node’s decision is affected by the presence and the
actions of other nodes in the system. The multitude of dependencies arising in such
environments lead to an extremely complicated decision making process for a single
agent.
Often in complex systems, the decision maker has partial information about the
state of the system. For example, a centralized load balancer in a server farm obtains
the state of the queues via a communication network. This network introduces delays
and losses which result in partial information at the decision maker. This further
complicates the decision making process.
In this thesis, we study equilibrium and control in complex interconnected systems.
In the first part of the thesis, we investigate centralized decision making in a networked
system in presence of delays. Specifically, we show that even in the presence of delays,
a centralized decision maker can make optimal decisions with only a subset of the
past history of the system.This history depends on the structure of the system as well
as the associated delay pattern. From a practical point of view, these results show
iv
that one can make optimal decisions with only finite memory about the past, thus
eliminating the need to store the entire history. Thus, for example, a centralized load
balancer in a server farm can use algorithms based on only a finite past to evenly
distribute load across multiple servers.
In the second part of the thesis, we look at decentralized decision making in a
reactive environment. We describe a mean field approach to decision making in large-
scale systems. The basic premise of this approach is to treat other agents as a single
entity with some aggregate behavior. We develop a unified framework to study mean
field equilibrium in large-scale stochastic games. Under a set of simple assumptions,
we prove the existence of a mean field equilibrium. A key insight developed from our
result shows that the existence result is closely related to the approximation of mean
field equilibrium to the actual behavior. Thus, a single agent can make near optimal
decisions based only on aggregate behavior of other agents.
We conclude the thesis with various interesting extensions and open challenges in
the design and understanding of complex interconnected systems.
v
Acknowledgments
This thesis is a culmination of a journey that started at Stanford about five years ago.
I was fortunate to have the guidance and friendship of several people who made this
journey enjoyable. First and foremost, I thank my adviser, Prof. Andrea Goldsmith,
for being a wonderful adviser and a mentor. She took a leap of faith by agreeing
to fund me from the first day, thus enabling me to come to Stanford. Without her
confidence and trust, I would not have been at Stanford, much less write this thesis.
She has always been very encouraging and supportive of me as I built collaborations
with different people and developed my research interests. She also made the Wireless
Systems Laboratory feel more like home. She was very generous in inviting us to
various parties at her home. It was a joy to meet her family - discussions with
Arturo sharpened my thought process and Daniel and Nicole’s company was always
a welcome change from the daily grind of research. For that, I sincerely thank her
and her family.
Early on in my research career, I was fortunate to work and interact with Prof.
Sanjay Lall. His breadth and depth of knowledge constantly amazed me and made
me realize how little I know. Besides being a great researcher, he is also a wonderful
and a very generous teacher. It is from him that I learned the art of learning. He
personally spent countless hours teaching me everything I know about control systems
and Markov decision processes. He also mentored me and taught me how to write
good papers and give good talks. Every paper I ever write, every talk I give in future
will always bear his signature. For all his time and efforts, I will always be thankful
to him.
My sincere thanks are also due to my co-adviser Prof Ramesh Johari. Ramesh’s
vi
enthusiasm for research, his drive for perfection, and his sheer ability to work hard
constantly amazed me. During my entire graduate school career, he was the bench-
mark I strove to achieve. He constantly challenged my limits and helped me realize
my potential. Besides working on research, he spent a lot of time giving me guidance
and career advice. His guidance allowed me to understand my strengths, realize my
weaknesses, and helped me push my limits. My experience at Stanford would have
never been the same, had I not had the pleasure of working with him. For the count-
less hours he spent thinking about my work, for genuinely caring about my work and
my career, and for making me realize my potential, I shall forever be indebted to him.
A significant portion of this thesis is based on the work of Prof. Gabriel Weintraub
of Columbia University. He not only provided the seeds of this work, he also helped
me guide through it. Gabriel comes from a very different background and has a very
unique perspective on research. He was generous with his time and shared his ideas
with me. For his guidance and help at every step of my work, I express my deepest
gratitude.
My Ph.D at Stanford was almost not going to happen had it not been for one
person, who had more faith in me than I had in myself. Convinced that Stanford was
beyond my reach, I had almost decided not to apply. It was only at the urging of Ram
that I finally decided to take the chance. He even promised to pay the application
fee which he still owes me. But what I owe him can never be repaid. He has been a
true friend, believing in me more than I ever believed in myself. During all the ups
and downs of this grueling journey, he was a constant source of encouragement and
support. Mere words of thanks can never do justice to all that he has done for me.
My stay at Stanford gave me an opportunity to make some wonderful friends. I
would like to thank Mayank Jain for being extra-ordinarily helpful at every step of
the way. He was always generous with his time - spending several hours going over the
details of my work with me. He is also the reason that I survived Stanford without
ever owning a car. His generosity will always be remembered. Part of this research
started as a course project that was jointly done with Vineet Abhishek. Even though
he left Stanford for greener pastures, the seeds we jointly sowed as a course project,
flourished as part of my thesis. The joint project also provided an opportunity to
vii
know him better and to grow as friends. Several arguments and discussions over tea,
and our regular dinners at “Treehouse” shall always be fondly remembered.
Life at Stanford would have never been the same without the company of several
friends. Forum Parmar - whose infectious laughter lightened the mood of most serious
of all conversations, Mridul Agarwal - who provided me company at various hiking
trips we made, Dinkar Gupta - whose extraordinary culinary skills and stimulating
company provided for some wonderful dinner nights, Saurabh Jain - who dazzled us
with some wonderful desserts, and Kannan, Kadambari and Abhay - who provided
wonderful company, made these past five years worth living and remembering. During
last few years, I also had the pleasure of knowing Suchitra Vijayan, first as Ram’s wife
and then as a very caring friend. The times spent complaining about Ram, discussing
politics, and exchanging recipes will always be fondly remembered.
The daily grind of school was made more bearable by the presence and company
of Vinay Majjigi who provided me company every time I needed a break. He was
very generous in sharing with us his Mom’s food which made me miss home a tad
bit less. Dan O’Neill offered me some very valuable advice and shared his years of
experience and his unique perspective on life. It would not be an exaggeration to say
that various discussions with him will certainly have an impact on whatever future
career I pursue.
The members of the Wireless Systems Laboratory provided a very intellectually
stimulating environment. Ivana Maric was a very patient and adjusting office mate
who suffered as we converged on the right temperature in our office. Bruno Sinopoli
jump started my research career as soon as I joined Stanford. Various members of
the Wireless Systems Laboratory (both past and present) made this a fun place to
be.
Special thanks are due to Maria Kazandjieva for being my running buddy and
for her wonderful company, to Michelle Hewlett, Samar Fahmy, Sara Lefort, Hattie
Dong and Thomas John for providing a reason (other than work) to come to office
every day, to Sophie and Jonathan for being friends from the first day I came to the
US, to Patrick Burke for helping me with every computer related issue, and to Pat
Oshiro, Bernadette Aguiao and to Joice DeBolt for making bureaucratic work less of
viii
a hassle.
During the last five years, Sanjay Bhal and his family opened the doors of their
heart and provided me a home away from home. Poorvi Bhabhi ensured that I never
missed home cooked food. Kuhoo’s angelic face and innocent remarks made me forget
the stress of work and life. Kaustubh and Prisha never made me miss my nephew
and niece in India. The friendship and their warm hospitality made even the hardest
periods of life bearable.
Last, but not the least, my deepest gratitude is for my family who were always
there to support me through various challenges of my life. My sisters (Pooja, Prerna,
and Kashika) and my brothers (Ashish and Ketan), who endured my absence as I
focused more on my work, were always very supportive. My parents endured several
hardships so that I could follow my passion – they always supported me at every stage
of my life, and sacrificed their dreams so that I could achieve mine. This journey would
never have been possible without their love, support and encouragement. This thesis
is dedicated to them!
ix
Contents
Abstract iv
Acknowledgments vi
1 Introduction 1
1.1 Networked Control Systems with Delayed Information . . . . . . . . . 3
Note that A in equation (2.5), gt in equation (2.6) and the conditional probability
in equation (2.7) are independent of the POMDP policy K. Furthermore, equa-
tion (2.5) shows that given the action sequence or the policy the evolution of ξt is
Markov.
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 19
From the above definition, it is clear that associated with any POMDP is a suffi-
cient information state MDP (A, g). Let hi-mdpt be the history of the sufficient infor-
mation state MDP at time t. Then, we have
hi-mdpt =
(u0:t−1, ξ0:t
).
We will use ii-mdpt to denote a realization of hi-mdp
t as
ii-mdpt =
(a0:t−1, q0:t
).
As before, we define a sufficient information state MDP policy as a mapping from the
history of the information state MDP to an action at time t. Let Kt be a sufficient
information state MDP policy. As before, we can interpret Kt as
Kt(at, it) = Prob(ut = at | h
i-mdpt = it
).
The following theorem shows that we can find an optimal POMDP policy by consid-
ering the associated MDP over the sufficient information state.
Theorem 8. Consider a POMDP (A,C, g) and let Ppomdp be the set of all POMDP
policies. Let (A, g) be the sufficient information state MDP associated with the given
POMDP and let Pi-mdp be the set of all sufficient information state MDP policies.
Then, for any T , we have
minKt∈Ppomdp
t=0,1,···T
T∑
t=0
E[gt(zt, at)
]= min
Kt∈Pi-mdp
t=0,1,···T
T∑
t=0
E[gt(qt, at)
]
Proof. The proof follows from standard dynamic programming techniques as given
in Chapter 6 of [40].
From the above theorem, it is clear that one can find an optimal policy for a
POMDP by transforming it into a sufficient information state MDP. Given an optimal
sufficient information state policy Kopt one may immediately compute the optimal
POMDP policy by composing Kopt with γ. The optimal sufficient information state
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 20
policy Kopt may be found using standard dynamic programming recursion. From [48],
we know that the optimal policy for an MDP is a function of its current state. In
other words, the optimal policy for a POMDP is just a function of its sufficient
information state ξt. One such sufficient information state is the entire history of
the POMDP, where γt is an identity function [40]. As we show below, for a certain
class of POMDPs (in particular for networked MDPs), the sufficient information state
includes only finite past history of observations and control actions. In other words,
for a certain class of POMDPs, the function γt is a projection operator. Also note
that the above theorem can be easily extended to the infinite horizon case (both
average cost as well as discounted cost), as long as the limiting value of the sum of
the costs is well defined. For the discounted infinite horizon case, we can incorporate
the discount factor in the time dependent cost function.
2.3 Networked Markov Decision Processes
A networked Markov decision process (N-MDP) is a weighted directed graph G =
(V , E), where V = 1, . . . , n is a finite set of vertices and E ⊂ V × V is a set of
edges. Each vertex i ∈ V represents a Markov decision process. An edge (i, j) ∈ E
if the MDP at vertex i directly affects the MDP at vertex j. Associated with each
edge (i, j) ∈ E is a non-negative integer weight, Mij, which specifies the delay for the
dynamics of vertex i to propagate to vertex j. We assume without loss of generality
that (i, i) /∈ E .
Associated with each j ∈ V, let Paj be the set of all vertices with an incoming
edge to vertex j, specifically
Paj = i ∈ V | (i, j) ∈ E .
Similarly, for each j ∈ V, let Chj be the set of all vertices connected by an edge
outgoing from vertex j, specifically
Chj = i ∈ V | (j, i) ∈ E .
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 21
Thus, Paj is the set of vertices that affect the system at node j and Chj is the set of
vertices that are affected by the system at node j. At each time t, the state of the
MDP at vertex i belongs to a finite set X i. The decision or the control action taken
at vertex i is drawn out of a finite set U i.
Remark. In the remainder of this section, we denote X−i =∏
j∈Pai X j. Also de-
note X (n) =∏n
i=1Xi as the cartesian product of state space corresponding to all
vertices. Similarly, let U (n) =∏n
i=1 Ui.
Definition 9. A networked Markov decision process is a tuple (A, g) where
1. A is a set of transition matrices Ait, t ≥ 0 | i ∈ V with Ai
0 : X i → [0, 1] for
all i ∈ V, such that for all z ∈ X i, we have
Ai0 (z) ≥ 0 and
∑
z
Ai0 (z) = 1.
For t ≥ 0, we have At : X i × X i × X−i × U i → [0, 1] such that, for all i ∈ V
and for all a ∈ U i and z ∈ X−i we have
Ait(z1, z2, z, a) ≥ 0, ∀ z1, z2 ∈ X
i,∑
z1
Ait(z1, z2, z, a) = 1, ∀ z2 ∈ X
i.
2. g is a sequence g0, g1, . . . with gt : X (n) × U (n) → [0, 1].
As an example of a networked Markov decision process, consider a networked
system consisting of four subsystems as shown in Figure 2.1. The system dynamics
are
xit+1 = f i
(xi
t, xjt−Mji
| j ∈ Pai, uit, w
it
), (2.8)
for all i ∈ V. Here uit ∈ U
i is the control action applied to subsystem i at time t. The
random variables xi0, w
it for t ≥ 0 and i ∈ V are independent, i.e., the noise processes
are independent across both time and subsystems. The directed graph corresponding
to this networked MDP is shown in Figure 2.2.
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 22
S4 S3
S2
S1
Controller
M43
M34
M23M42
M21 M12N4 N3
N2
N1
Figure 2.1: A network of interconnected subsystems with delays. Subsystem i isdenoted by Si, the network propagation delay from Si to Sj is denoted by Mij andthe measurement delay from Si to the controller is denoted by Ni.
Associated with this system is a networked MDP (A, g) as defined below. For p ∈
X i, let Ai0(p) = Prob(xi
0 = p) be the probability mass functions of the initial state of
subsystem i ∈ V. The initial states x10, . . . , x
n0 are chosen independently. For t > 0,
let
Ait(z, p, q, a) = Prob
(
xit = z | xi
t−1 = p, xjt−1−Mji
= qj | j ∈ Pai, uit−1 = a
)
, (2.9)
be the conditional probability mass function of state xit given the previous states xi
t−1
and xjt−1−Mji
| j ∈ Pai and the applied input uit−1. It is easy to verify that the
sequence A satisfies the properties in Definition 9. The sequence gt(xt, ut) represents
the cost at time t and depends on the state of the system xt = (x1t , . . . , x
nt ) as well as
the action ut = (u1t , . . . , u
nt ) applied at time t.
In a networked MDP, the controller needs to choose a control action corresponding
to each vertex i ∈ V. The actions are chosen based on the information available to
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 23
4 3
2
1
Figure 2.2: Directed graph for the network of Figure 2.1.
the controller at time t. Associated with each vertex i ∈ V of a networked MDP,
we have a non-negative integer Ni which specifies the delay in receiving the state
measurement from system i. We define hn-mdpt to be the information available to the
decision-maker at time t, given by
hn-mdpt =
(x1
0:t−N1, u1
0:t−1, . . . , xn0:t−Nn
, un0:t−1
).
Also define in-mdpt to be a realization of hn-mdp
t as
in-mdpt =
(z10:t−N1
, a10:t−1, . . . , z
n0:t−Nn
, an0:t−1
).
Thus, the observations received by the decision-maker at time t consist of the state
of the subsystem i delayed by Ni time steps. A networked MDP policy specifies the
decisions taken at time t.
Definition 10 (Networked-MDP Policy). A networked MDP policy is a sequence
K = (K0, K1, . . . ) where
K0 : U (n) ×n∏
i=1
(X i)1−Ni → [0, 1]
and
Kt : U (n) ×n∏
i=1
(X i)t+1−Ni ×
n∏
i=1
(U i)t→ [0, 1],
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 24
for all t ∈ Z++ such that
K0(a, z) ≥ 0 ∀a ∈ U (n), ∀z ∈n∏
i=1
(X i)1−Ni ,
∑
a
K0(a, z) = 1 ∀z ∈n∏
i=1
(X i)1−Ni ,
and for all t ∈ Z++, for all a1 ∈ U
(n), z ∈∏n
i=1 (X i)t+1−Ni and a2 ∈
∏ni=1 (U i)
twe
have
Kt (a1, z, a2) ≥ 0,∑
a1
Kt (a1, z, a2) = 1.
Note that for all times t, the product∏n
i=1 (X i)t+1−Ni in the above definition is
taken over those i for which t+ 1−Ni is strictly positive. For the networked systems
as given in equation (2.8), a general mixed control policy is defined as a sequence of
transition matrices Kt, t ≥ 0 given by
Kt(at, it) = Prob(ut = at | hn-mdpt = it).
2.3.1 Networked MDP as a POMDP
In networked MDPs, although the controller receives state information from the sub-
systems, these states are delayed by different amounts. Thus, a networked MDP can
be written as a POMDP. Consider a networked MDP as given in Definition 9. Let
us define a new state xt =xi
t−b′:t | i ∈ V, where we choose b′ = maxi,j∈V Mij +
maxi∈V Ni. The state x is chosen such that in the resulting system the observation at
time t is only a function of the current state at time t. It is easy to check that there
exists a function f such that
xt+1 = f (xt, ut, wt) .
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 25
Associated with this function is a transition probability mass function At (zt+1, zt, at),
where zt is the realization of the state xt. The observation at any time t is given as
yt = h(xt).
Corresponding to this observation process is a probability mass function Ct (st, zt),
where st is the realization of the observation yt and is given as
st =zi
t−Ni| i ∈ V
.
The cost function is given as
gt (xt, ut) = gt (xt, ut) . (2.10)
It is easy to check that the functions At, Ct and gt satisfy the properties given in
Definition 4. The networked MDP can thus be written as a POMDP(A, C, g
).
As shown in the above subsection, we can write any networked MDP as a POMDP.
In the next chapter, we compute the sufficient information state (as defined in Defi-
nition 7) for networked MDPs.
Chapter 3
Information State for Networked
MDPs
In this chapter, we establish the main result associated with the information state
for networked MDPs. This result establishes that the sufficient information state
for networked Markov decision processes consists only of a finite number of past
observations. As we will see, these finite numbers or bands depend only the network
structure and the associated delays. We begin by making the following definitions.
Definition 11. Let
di = maxNi, maxk∈Pai
(Nk −Mki − 1) (3.1)
and define the integers bi by
bi = maxdi, maxk∈Chi
(dk + Mik) −Ni (3.2)
Remark. In the remainder of this chapter, we use the following additional notation.
We define a new function Pt for t ≥ 0 by
Pt = A10:tA
20:t . . . A
n0:t.
26
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 27
Define
αt = zi0:t−Ni
, ai0:t−1 | i ∈ V,
βt = zit−Ni−bi:t−Ni
, ait−di:t
| i ∈ V.
Furthermore, the notation z /∈ αt means the set
z | z /∈ αt = zit−Ni+1:t | i ∈ V,
and the notation z /∈ βt and a /∈ βt mean the sets
z | z /∈ βt = zi0:t−Ni−bi−1 | i ∈ V.
a | a /∈ βt = ai0:t−di−1 | i ∈ V.
Recall that any list of variables xt1:t2 with t2 < t1 is interpreted as empty.
The following theorem is the main result for networked MDPs. It defines a suf-
ficient information state for a networked Markov decision process. It shows that a
networked MDP can be converted into a fully observable MDP with a state that is
bounded and does not grow with time. Note that a networked MDP can be written
as a POMDP(A, C, g
), with state x.
Theorem 12. Consider a networked Markov decision process. Then,
ξt =ui
t−di:t−1, xit−Ni−bi:t−Ni
| i ∈ V
. (3.3)
is a sufficient information state for the networked MDP.
To prove this theorem, we check the conditions of a sufficient information state
as given in Definition 7. The following key lemma shows that ξt as defined in equa-
tion (3.3) satisfies the first condition of a sufficient information state as given in
equation (2.5).
Lemma 13. Consider a networked Markov decision process (A, g) and a networked
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 28
MDP policy K. Define
At+1(qt+1, qt, at) , Prob(ξt+1 = qt+1 | ξt = qt, ut = at
where we have used the notation γt (s0:t, a0:t−1) = qt.
Proof. Note that the sequence ξ0:t consists of the variables xi0:t−Ni
, ui0:t−1 | i ∈ V.
Also from section 2.3.1, we know that yt =xi
t−Ni| i ∈ V
. The lemma follows
trivially from these two facts.
Proof of Theorem 12. From Lemmas 13, 14, and 15, we get that ξt as defined in
equation (3.3) is a sufficient information state for a networked MDP.
S4 S3
S2
S1
Controller
M43
M34
M23M42
M21 M12N4 N3
N2N1
P4 P3
P2 P1
Figure 3.1: A networked Markov decision process with action delays. The controlaction delay to subsystem Si is denoted by Pi.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 34
3.1 Networked MDP with Action Delays
In this section, we extend our result to the case where the control action does not
take effect immediately. Consider a networked Markov decision process as shown in
Figure 3.1. The system dynamics are
xit+1 = f i
(xi
t, xjt−Mji
| j ∈ Pai, uit−Pi
, wit
),
for all i ∈ V. Here uit−Pi
is the control action applied to subsystem i at time t− Pi.
To obtain a sufficient information state for a networked MDP with action delays,
we convert this system into a networked MDP with no action delays. To do this, let
us define a new state xit = (xi
t, uit−Pi:t−1) for all i ∈ V. As before, if any Pi = 0, we
interpret the list uit−Pi:t−1 as empty and thus xi
t = xit. This new state is chosen such
that the state evolution of each subsystem at time t + 1 depends on the current state
and action at time t. Thus, a networked MDP with action delays can be reformulated
as a networked MDP with no action delays with system dynamics given as
xit+1 = f i
(xi
t, xjt−Mji
| j ∈ Pai, uit, w
it
),
for all i ∈ V. Using Theorem 12, we know that a sufficient information state for this
new system consists of past states xit−bi−Ni:t−Ni
and past control actions uit−di:t−1 for
all i ∈ V. Let us define a new band di as
di =
di if Pi = 0
bi + Ni + Pi otherwise(3.14)
Using this definition, it is easy to check that a sufficient information state for a net-
worked MDP with action delays consists of past states xit−bi−Ni:t−Ni
and past control
actions uit−di:t−1
for all i ∈ V. This gives us the following theorem.
Theorem 16. Consider a networked Markov decision process with action delays.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 35
Then,
ξt =
uit−di:t−1
, xit−Ni−bi:t−Ni
| i ∈ V
.
is a sufficient information state for a networked MDP with action delays.
3.2 Discussion
From Theorem 12, we note that every networked MDP has a sufficient information
state ξt given by equation (3.3), which depends on only the finite history of the states
and control actions. Thus, from Definition 7 we have that, associated with every
networked MDP is a tuple (A, g) where At is the transition matrix given by
At+1(qt+1, qt, at) = Prob(ξt+1 = qt+1 | ξt = qt, ut = at
),
and gt is the cost function associated with this new MDP. The cost function is given
by equation (2.6). From, the Theorem 8, we note that an optimal controller for the
original POMDP can be found by considering the associated sufficient information
state MDP. An optimal controller can be found using dynamic programming [48, 20]
over the state space Q generated by ξt. This holds for both finite horizon as well as
infinite horizon average cost as well as infinite horizon discounted cost models. In the
next subsection we show that the previously known results on single systems with
delayed state [8] can be obtained as a special case of our main result.
3.2.1 Single System with Delayed State Observations and
Action Delays
We consider control of a single system with a delayed state measurement. This is
precisely the information pattern considered in [8], and we show that in this case the
above results imply those of [8]. We have dynamics
xt+1 = f(xt, ut, wt)
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 36
where, since the system is composed of exactly one subsystem, we have x1t = xt.
The controller must choose ut at time t, when it has access to u0, . . . , ut−1 and
x0, x1, . . . , xt−N1 . Then, from Definition 11 we have
d1 = N1 and b1 = 0
and so the optimal control action ut is a memoryless function of ut−N1 , . . . , ut−1
and xt−N1 . Thus, the optimal controller applied at time t is a function of the last
observed state and the previous N1 actions, which is exactly the result of [8].
A single system with both observation and action delays was analyzed in [39].
Consider a single system with both delayed state measurement of N1 steps and a
delay of P1 steps in control action. From Theorem 16, we know that the control
action at time t is a function of ut−N1−P1 , . . . , ut−1 and the state xt−N1 , which is
exactly the result obtained in [39].
S2S1
Controller
M12
M21
N2N1
Figure 3.2: A network of two interconnected subsystems with delays. Here the controlinput is only applied to subsystem 1.
3.3 Numerical Examples
In this section we consider two numerical examples where we compute the optimal
controller for networked Markov decision processes. In the first example, we study
linear scalar systems with delays. For a special class of such systems, one can compute
controllers using an approach based on the Youla parametrization, in combination
with convex optimization, as in [24]. We observe that for a certain class of systems, the
optimal controller has exactly the same amount of past history as given in Theorem 12.
This shows that bands computed in the main theorem are tight in the sense that there
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 37
are systems where using any less amount of information would yield sub-optimal
controllers. As a second example, we study controller design for two interacting
queues. Using the knowledge of the bands as computed from Theorem 12, we use
dynamic programming to explicitly compute the optimal controller. The knowledge
of the bands allows us to greatly simplify the computation of the optimal controller.
3.3.1 Linear Systems with Delays
As a first example, we compute an optimal controller for the special case of a linear
scalar system with delays. For simplicity, we consider a two system case as shown
in Figure 3.2. Note that the control action is only applied to subsystem 1. For this
system, the controller is only required to store bi + 1 values of the state of system i
and d1 values of the past inputs, where
b1 = max0, N2 + M12 −N1,
b2 = max0, N1 + M21 −N2,
d1 = maxN1, N2 −M21 − 1.
(3.15)
The system dynamics are given by
x1t+1 = f 1(x1
t , ut, x2t−M21
, w1t ),
x2t+1 = f 2(x2
t , x1t−M12
, w2t ).
(3.16)
The information available to the controller at time t is given as
yt = (a0:t−1, z10:t−N1
, z20:t−N2
)
The system under consideration has a continuous state space, and the results pre-
sented above may be extended to this scenario under appropriate technical assump-
tions on the probability measures. Specifically, we consider system dynamics which
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 38
are a special case of those in equation (3.16) given by
x1t+1 = x1
t + 0.25x2t−2 + ut + w1
t
x2t+1 = 0.25x1
t−2 + x2t + w2
t
The noise processes w1t and w2
t are zero mean, unit variance white Gaussian noise
processes. The initial states x10 and x2
0 are independent of each other and are normally
distributed with variance 10−5. The objective is to minimize the cost
J = E
((T−1∑
t=0
(‖xt‖2 + ‖ut‖
2)
)
+ ‖xT‖2
)
,
which is a standard quadratic cost. We will use a time horizon of T = 10. The
propagation and measurement delays are
M12 = 2, M21 = 2, and N1 = 0, N2 = 1
so that the controller receives the observations from subsystem 2 after a single time-
step delay. For this system, equation (3.15) gives the memory requirements of the
optimal controller as
b1 = 3, b2 = 1 and d1 = 0.
Therefore at each time t the optimal input ut is given by a memoryless function of
ymemt , that is of the data x1
t−3, x1t−2, x
1t−1, x
1t , x
2t−2, x
2t−1.
To compute the optimal controller for this problem, we use an approach based on
the Youla parametrization, in combination with convex optimization, as in [24]. A
similar approach is used in [49] to compute optimal decentralized controllers. The
optimal controller for this problem is
u0
u1
...
u9
= −F
x10
x11...
x19
−G
x20
x21...
x29
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 39
where F and G are given by
F ≈1
10
8.2
1.2 8.1
1.1 1.1 7.9
1.1 1 1 7.7
0.9 0.8 0.8 7.4
0.6 0.5 0.5 6.9
0.3 0.3 0.2 6.5
6.2
6
5
G ≈1
10
0
0 5.8
1.7 5.4
1.7 4.8
1.6 4
1.6 3
1.6 1.8
1.5 0.8
1.5 0.5
1.2 0
Hence we have
µ(ymemt ) = −
T−1∑
s=0
Ftsx1s −
T−1∑
s=0
Gtsx2s.
It is apparent from the above matrices that the control input at time t depends only on
the past history of x according to the memory limits b1, b2 and d1 as in Equations (3.1)
and (3.2).
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 40
M12M21
S2S1
Figure 3.3: A system of two interacting queues. Here the solid line represent jobs oftype R which enter system 1, and are then transported to system 2 after a delay ofM12. Similarly, the dashed line represents jobs of type B which enter system 2 andare transported to system 1 after a delay of M21. Of the two queues at each system,the top queue is the high-priority queue.
3.3.2 Controller Design for Finite State Systems
We consider a network of two interconnected queues as shown in Figure 3.3. Our
example is inspired by the model of interacting queues studied in [30]; however our
objective here is to illustrate the computation of an optimal controller for such sys-
tems. As opposed to previous works, we introduce delays between queues as well as
delays in receiving queue state information at a centralized controller. We however
assume that any control inputs have immediate effects.
Informally, the system description is as follows. Jobs of type R arrive at system 1
while jobs of type B arrive at system 2. The arrival process at each system is in-
dependent and identically distributed over time. Furthermore, we assume that the
arrival process at the two systems are independent of each other. At each system,
the server maintains two kinds of queues, the high priority queue and the low priority
queue. Jobs of type R are placed in the high priority queue at system 1, where they
are processed and are moved to system 2 after a delay of M12 time units. At system
2, these jobs are placed in the low priority queue. On the other hand, jobs of type B
enter the high priority queue at system 2 and after being processed at system 2, they
are moved to system 1 after a delay of M21 time units. At system 1, these jobs are
placed in the low priority queue. At each system, if the queue is full the incoming
jobs are dropped.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 41
The server at each system has two modes of operation, a slow and a fast mode.
In the slow mode, in each time unit, the server serves one job from the high priority
queue (provided the queue is non-empty). In the fast mode, as long as the queues
are non-empty, the server serves one job from each of the high and the low priority
queues. After being processed, the high priority jobs are moved to the other system,
while the low priority jobs exit the system. A centralized controller receives delayed
information about the total number of jobs in each queue, and decides what mode
each server should operate at. At each time step, a cost depending on the number of
jobs in each queue and the mode of operation of each server is incurred.
To describe the above system mathematically, we let xiR(t) be the number of R
jobs in queue i at time t. Similarly, we let xiB(t) be the number of B jobs in queue i
at time t. Here both xiR(t) and xi
R(t) are in the set 0, 1, 2, . . . ,Q, where Q is the
queue length at each system. For simplicity, we assume that all the queues are of same
length. The control action is ui(t) ∈ 0, 1, for i = 1, 2. Here ui(t) = 0 represents the
slow mode of the server. The system dynamics are
x1R(t + 1) = maxminx1
R(t) + w1(t),Q − 1, 0
x1B(t + 1) = maxminx1
B(t) + 1x2B
(t−M21)>0,Q − u1(t), 0
x2B(t + 1) = maxminx2
B(t) + w2(t),Q − 1, 0
x2R(t + 1) = maxminx2
R(t) + 1x1R
(t−M12)>0,Q − u2(t), 0
where 1x>0 is the indicator function. Here wit, i = 1, 2 are the number of packets that
arrive at each queue at time t. We let the state of each system be the number of R
and B packets at each time, i.e., xi(t) =(xi
R(t), xiB(t)
)for i ∈ 1, 2. It is easy to
check that there exist functions f 1 and f 2 such that the state dynamics are given by
x1(t + 1) = f 1(x1(t), x2(t−M21), u
1(t), w1(t)),
x2(t + 1) = f 1(x2(t), x1(t−M12), u
2(t), w2(t))
Let gs(x1(t), x2(t)) be the cost associated with the state and ga(u
1(t), u2(t)) be
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 42
the cost associated with the actions. We assume that the state cost is
gs(x1(t), x2(t)) =
(x1
R(t) + x1B(t) + x2
R(t) + x2B(t)
)2.
The action cost is
ga(u1(t), u2(t)) =
(u1(t) + 1 + u2(t) + 1
)2,
where we assume that for ui(t) = 0, the cost incurred is 1 unit. The total cost at
time t is thus
g(x1(t), x2(t), a1(t), a2(t)
)= (1− α)gs(x
1(t), x2(t)) + αga(u1(t), u2(t)),
where α is the weighting factor. The objective is to minimize the infinite horizon
discounted cost
J = E
(∞∑
t=0
βtg(
x1(t), x2(t), u1(t), u2(t)))
,
= E
(∞∑
t=0
βt(
(1− α)gs + αga
)
)
,
= (1− α)Js + αJa
where β is the discount factor. Here Js and Ja are the infinite horizon discounted
costs associated with the state and action.
For purposes of numerical computation, we let Q = 1 in our specific example. The
state space consists of (0, 0), (0, 1), (1, 0), (1, 1), where the first element in the tuple
represents the number of R jobs. The arrival process at both systems is assumed to be
Bernoulli with the probability of arrival at the first system given by Prob(w1(t) = 1) =
0.1 and the probability of arrival at the second system given by Prob(w2(t) = 1) = 0.3.
The inter subsystem propagation delays and observation delays are chosen to be
M12 = 2, M21 = 1, and N1 = 2, N2 = 1
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 43
0 1 2 3 4 5 6 7 8 915
16
17
18
19
20
21
22
23
24
25
Js
Ja
Figure 3.4: Infinite horizon discounted action cost Ja (averaged over all initial states)vs. the infinite horizon discounted state cost Js. The curve is plotted by varying theweighting factor α.
Using equations (3.1) and (3.2), we find
b1 = 1, b2 = 2 and d1 = 2, d2 = 1.
For the discount factor β = 0.75, Figure 3.4, shows the tradeoff curve for Ja vs. Js.
This curve shows the trade off between the action cost and the state cost for different
values of the weighting factor α.
This section illustrates that the knowledge of the bands simplifies the computation
of the optimal controller. Without the knowledge of these bands, one would compute
the optimal controller by treating this networked MDP as a POMDP and using dy-
namic programming over the belief state. The knowledge of these bands allows us to
write this networked MDP as a fully observed MDP over the sufficient information
state and greatly simplifies the computation of the optimal controller.
As shown in Theorem 12, the sufficient information state for networked MDPs
depends only on the finite past history of the observations. This finite history or bands
depend only on the network structure and the associated delays. In the next chapter,
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 44
we look at a special case of networked MDPs over a finite time horizon. Based on the
ideas from Bayesian networks, we provide an alternate proof of Theorem 12. This
alternate proof provides an intuitive explanation for the bands given in Definition 11.
In particular, it shows that the finiteness of the bands occurs because given the finite
history of states and actions, the current state of the system is independent of the
remaining states and actions.
Chapter 4
A Bayesian Network Approach to
Network MDPs
In this chapter, we restrict our attention to networked MDPs over a finite time hori-
zon. For such a special class of networked MDPs, we provide an alternate proof of
Theorem 12 based on the ideas from Bayesian networks. We show that the finite
history of states and actions that was obtained in the previous chapter is exactly
the same as the information required to estimate the current state of the system.
This, along with the separation principle, provides an alternate proof and additional
insights into the finite memory of the controllers for networked MDPs. It shows that
the finiteness of the bands occurs because given the finite history of states and actions,
the current state of the system is independent of the remaining states and actions.
We begin by describing concepts from Bayesian networks.
4.1 Bayesian Networks
A Bayesian network [37], Nb = Gb,Pb consists of
• A directed acyclic graph Gb = (Vb, Eb), and
• A set of conditional probability distributions Pb.
45
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS46
Here the subscript b stands for Bayesian and is used to distinguish the Bayesian
network graph from the networked MDP graph G as defined in the previous section.
Associated with each vertex v ∈ Vb of the graph Gb, is a random variable Xv taking
values in a particular set. A directed edge e ∈ Eb between vertices describes the
conditional dependence between the random variables corresponding to the vertices.
If there is a directed edge from a vertex v1 to v2, we say that v2 is a child of v1
and that v1 is a parent of v2. The set of parent vertices of a vertex v is denoted by
parent(v).
The set of probability distributions Pb contains one distribution P(Xv|Xparent(v)
)
for every v ∈ Vb. The joint distribution of all the variables Xk, k = 1, . . . , n is given
as
Prob(X1, . . . Xn
)=
n∏
k=1
Prob(Xk | parents(Xk)
)
An example of a Bayesian network is shown in Figure 4.1. Here the graph Gb consists
of vertices A,B,C,D,E, F and edges A→ C,B → C,C → D,C → E,D → F.
The set of probabilities is given as
Pb = P (A), P (B), P (C|A,B), P (D|C), P (E|C), P (F |D).
Note that since the variables A and B have no parents, the probability set contains
their unconditional probabilities.
A B
C
D E
F
Figure 4.1: A Bayesian network with 6 variables
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS47
d-Separation. As mentioned before, the graph Gb encodes the conditional de-
pendencies between the variables. Conditional independence between variables is de-
termined by the property of d-separation. If two variables X and Y are d-separated
in the graph by a third variable Z, then the variables X and Y are conditionally
independent given the variable Z.
Definition 17. A path π in the graph Gb = Vb, Eb is said to be d-separated by a set
of nodes Z ∈ Vb if and only if one of the following holds
• π contains a chain i→ z → j such that i, j ∈ π and z ∈ Z,
• π contains a fork i← z → j such that i, j ∈ π and z ∈ Z and
• π contains an inverted fork (or a collider) i → z ← j such that i, j ∈ π and
neither z nor any of its children are in Z.
The concept of d-separation is closely tied to that of a Markov blanket. Before
we define the Markov blanket, we introduce some notation.
Remark: Consider a set of variables X = X1, . . . , Xn. Denote P(X) to be
the set consisting of all parents of variables in the set X, not including the variables
themselves. Similarly, we denote CH(X) (and PCH(X)) to be the set consisting of
all children (parents of children) of variables in the set X, not including the variables
themselves.
Definition 18 (Markov Blanket). The Markov blanket of a set of variables X =
X1, . . . , Xn
(denoted by MB(X)) is given as
MB(X) = P(X) ∪ CH(X) ∪ PCH(X) (4.1)
The following theorem (see [37] for the proof) states that the variables in the set
X are independent of the rest of the graph given its Markov blanket.
Theorem 19. Given a finite Bayesian network and two distinct variables X and
Y /∈ MB(X), we have
Prob(X|MB(X), Y
)= Prob
(X|MB(X)
)
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS48
The Markov blanket of the set of variables shields the variables from the rest of
the graph. Thus, the Markov blanket is the only knowledge required to predict the
value of the variables. Furthermore, if all the variables in a Markov blanket of X are
known, then X is d-separated from the rest of the graph [37].
4.2 Networked MDPs as Bayesian Networks
In this section, we model networked Markov decision processes as Bayesian networks
in a natural way. Consider a networked MDP given by a graph G = V , E, where
we let V = 1, . . . , n. As before, for each i ∈ V. we have xit ∈ X
i. For the remainder
of this chapter, we would consider the evolution of the networked MDP over a finite
horizon T . Associated with this networked MDP, we can construct a finite Bayesian
network Nb = Gb,Pb. The vertex set Vb is given as
Vb =vstate
i,t | i ∈ V, t = 0, 1, . . . , T ⋃
vactioni,t | i ∈ V, t = 0, 1, . . . , T − 1
.
Associated with a vertex vstatei,t is the random variable xi
t, taking values in the finite
set X i, that corresponds to the state of subsystem i at time t. Similarly, associated
with a vertex vactioni,t is the random variable ui
t, taking values in the finite set U i, that
corresponds to the control action applied to subsystem i at time t. The edge set Eb
consists of the following edges.
Eb =vstate
i,t → vstatei,t+1, v
statej,t−Mji
→ vstatei,t+1, v
actioni,t → vstate
i,t+1, vstatei,0:t−Ni
→ vactionk,t ,
vactioni,0:t−1 → vaction
k,t | j ∈ I i, i, k ∈ V, t ∈ N.
Here vstatei,0:t−Ni
→ vactionk,t is interpreted as a directed edge between vstate
i,τ → vactionk,t for
every τ = 0, . . . , t − Ni. An edge vstatej,t−Mji
→ vstatei,t+1 means that the random variable
xjt−Mji
affects the random variable xit+1. Similar interpretations exist for other edges
in the edge set Eb. The set of conditional probability densities Pb consists of all the
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS49
transition probabilities, that is
Pb =Ai
t | i ∈ V, t = 0, . . . , T∪Kt | t = 0, . . . , T − 1
For a finite time horizon T , let ST be the set of random variables given as
ST =xi
t | i ∈ V, t = 0, 1, . . . , T ⋃
uit | i ∈ V, t = 0, 1, . . . , T − 1
The joint probability density function of all the variables in the set ST can then be
written as
Prob (ST ) = A10:T A2
0:T . . . An0:T K0:T−1
S2S1
Controller
M12
M21
N2N1
Figure 4.2: A network of two interconnected subsystems with delays. Subsystem i isdenoted by Si, the network propagation delay from Si to Sj is denoted by Mij andthe measurement delay from Si to the controller is denoted Ni.
As an example, consider the networked system of Figure 4.2. The system dynamics
equations are given as
x1t+1 = f 1(x1
t , x2t−M21
, u1t , w
1t ),
x2t+1 = f 2(x2
t , x1t−M12
, u2t , w
2t ).
(4.2)
For the purpose of this example, we choose M12 = 2 and M21 = 1. Thus, the transition
probability matrices are given as
A1t
(
z1t , z
1t−1, z
2t−2, a
1t−1
)
= Prob(
x1t = z1
t | x1t−1 = z1
t−1, x2t−2 = z2
t−2, u1t−1 = a1
t−1
)
,
(4.3)
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS50
and
A2t
(
z2t , z
2t−1, z
1t−3, a
2t−1
)
= Prob(
x2t = z2
t | x2t−1 = z2
t−1, x1t−3 = z1
t−3, u2t−1 = a2
t−1
)
,
(4.4)
Associated with this networked control system is a Bayesian network as shown
in Figure 4.3. The directed acyclic graph Gb consists of a vertex for each state of
the two systems and two control actions applied at time t. A directed edge between
two vertices v1 and v2 exists if the variable corresponding to vertex v1 affects the
variable corresponding to vertex v2. For example, a directed edge exists between the
vertex corresponding to x2t−2 and the vertex corresponding to x1
t . Similarly, a directed
edge exists between the vertex corresponding to control action u2t−1 and the vertex
corresponding to x2t . The set of probability distributions Pb consists of the transition
probabilities A1t , A2
t and Kt for all t ≥ 0.
4.3 Alternate Proof of the Information State for
Networked MDPs
In this section, we provide an alternate proof of the finiteness of the information state
for networked MDPs. We start by making the following definition.
Definition 20. Define
hmemt =
(x1
t−N1−b1:t−N1, u1
t−d1:t−1, . . . , xnt−Nn−bn:t−Nn
, unt−dn:t−1
)(4.5)
to the finite history of observations at time t and denote
imemt =
(z1
t−N1−b1:t−N1, a1
t−d1:t−1, . . . , znt−Nn−bn:t−Nn
, ant−dn:t−1
)
to be a realization of hmemt . Further define the set Hmem
t as
Hmemt =
n∏
i=1
(X i)bi+1
×n∏
i=1
(U i)di .
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS51
x1 x2u1 u2
t
t− 1
t− 2
t− 3
t− 4
t− 5
Figure 4.3: The Bayesian network associated with the 2-subsystems networked MDPof Figure 4.2. Here the circle represents the state of the two subsystems and thesquare represents the control input. For this Bayesian network, we chose M21 = 1and M12 = 2. The edges from state variables to control inputs have been omitted forvisual clarity.
From the separation principle [12], we know that the optimal control action is a
function of the belief state. We define the set of belief states at time t as follows.
Definition 21. Let Mt be a set defined as
Mt =Λt : X (n) ×Ht → [0, 1] | Λt(zt, it) ≥ 0,
∑
zt
Λt(zt, it) = 1,
where we denote X (n) =∏n
i=1Xi to be the cartesian product of the state space corre-
sponding to all vertices.
Here, Λt(zt, it) is interpreted as the conditional probability density of the current
state of the system given the entire observation history at time t. That is
Λt(zt, it) = Prob (xt = zt | ht = it)
Let Ft : Ht → Mt be an operator that maps the entire observation history at
time t to an element in Mt. That is, the operator Ft maps the observation history
to a belief state. Furthermore, let Tt : Mt → A be the operator that maps the
belief state to a control action. From the separation principle [12], we know that the
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS52
optimal control K∗t , as function of the observation history it, is given as
K∗t = Tt Ft
That is, K∗t (at, it) = Tt (at, Λt(·, it)).
To prove the main theorem, we show that for networked MDPs, there exists an
optimal controller that depends only on imemt . Let P : Ht → H
memt be the projection
operator that projects the entire observation history to a truncated history as defined
in equation (4.5). The following theorem shows that there exists an operator Fmemt :
Hmemt →Mt such that
Ft = Fmemt P
Theorem 22. For a networked Markov decision process, there exists Λ∗0, . . . Λ
∗T such
that
Λt (zt, it) = Λ∗t (zt, i
memt ) ∀ t = 0, 1, . . . T. (4.6)
Thus, there exists an optimal controller K∗0 , . . . , K
∗T−1 such that
K∗t (at, it) = Tt (at, Λ
∗t (·, imem
t ))
= Kt (at, imemt ) ∀ t = 0, 1, . . . T − 1. (4.7)
Thus, bi’s are the bounds on the length of the observation history that an optimal
estimator needs to maintain beyond it current observation.
Before we present the proof of Theorem 22, we first prove a key lemma.
Lemma 23. Suppose there exists an optimal K∗j , j = t + 1, . . . , T − 1 such that
K∗j (aj, ij) = Kj
(aj, i
memj
)
for all aj. Then
K∗t (at, it) = Kt (at, i
memt )
for all at.
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS53
Proof. From the separation principle [12], we know that
K∗t (at, it) = T (at, Λt (·, it))
Thus, to prove the lemma it suffices to show that Λt (zt, it) = Λ∗t (at, i
memt ). At time
t, the controller knows it = zi0:t−Ni
, ai0:t−1 | i ∈ V. Let
Sut =
(x1
t−N1+1:t, . . . , xnt−Nn+1:t
)
be the states that are unknown at the controller at time t. Here the superscript u is
used to indicate that these states are unknown to the controller at time t. Note that
states of subsystem i are part of Sut if and only if Ni ≥ 1. This is because if Ni = 0,
then the current state of subsystem i is known to the controller. Let
Zut =
(z1
t−N1+1:t, . . . , znt−Nn+1:t
)
be a realization of Sut . Let Lt (Zu
t , it) be the joint conditional probability of the
variables in the set Sut given it. That is,
Lt (Zut , it) = Prob
(Su
t = Zut | ht = it
).
Define
L∗t (Zu
t , imemt ) = Prob
(Su
t = Zut | h
memt = imem
t
).
If we can show that there exists L∗t such that
Lt (Zut , it) = L∗
t (Zut , imem
t ) , (4.8)
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS54
then it follows that
Λt (zt, it) =∑
zit−Ni+1:t−1|i∈V
Lt (Zut , it)
=∑
zit−Ni+1:t−1|i∈V
L∗t (Zu
t , imemt )
= Λ∗t (zt, i
memt ) (4.9)
Thus, to prove the lemma it suffices to find an L∗ satisfying equation (4.8). To prove
the existence of an L∗, we show that the Markov blanket of the set Sut consists of the
variables imemt . Theorem 19 would then prove the existence of L∗.
Note that Sut contains xj
t−τjfor τj = 0, 1, . . . , Nj − 1 and j = 1, 2, . . . , n. From
equation (4.1), we know that the Markov blanket of Sut consists of parents, children
and parents of children of the variables in the set Sut . We focus on a single variable
xjt−τj
and find its parents, its children and all the parents of its children.
To find the parents of xjt−τj
, we look at the transition probability of this variable.
From equation (2.9), we note that xjt−τj
depends on
P(xj
t−τj
)=
xjt−τj−1, u
jt−τj−1, x
st−(τj+1+Msj)
| s ∈ Ij
, (4.10)
and hence these variables are the parents of xjt−τj
.
To find the children of xjt−τj
, consider the set Oj of outgoing vertices of subsystem
j and let p ∈ Oj. Consider Apt−t′ and note that this transition probability contains
xjt−t′−1−Mjp
. Thus, xjt−τj
would be a parent of xpt−t′ for all p ∈ Oj, if t− t′− 1−Mjp =
t− τj, which gives that t′ = τj − 1−Mjp.
Note that the children of xjt−τj
also consist of all the control variables that depend
on xjt−τj
. From the assumption in the lemma, we know that the K∗t+1:T−1 are only
a function of the finite past history of states given by imem. Thus, a directed edge
exists between xjt−τj
and ut−t′ for all t′ = τj −Nj − bj : τj −Nj. Thus, the children of
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS55
xjt−τj
consists of
CH(xj
t−τj
)=
xjt−τj+1, x
pt−τj+Mjp+1 | p ∈ O
j ⋃
ukt−τj+Nj :t−τj+Nj+bj
| k ∈ V
.
(4.11)
To find the parents of children of xjt−τj
, we find the parents of the variables given in
equation (4.11). From transition probability equation (2.9), we note that the parents
of xpt−τj+Mjp+1 include
xpt−τj+Mjp
, upt−τj+Mjp
, xrt−τj+Mjp−Mrp
| r ∈ Ip
To find the parents of ukt−τj+Nj :t−τj+Nj+bj
| k ∈ V, we note that from the assump-
tion in the lemma, these control inputs only depend on imemt . Thus, the parents of
ukt−τj+Nj :t−τj+Nj+bj
k ∈ V consist of
xit−τj+Nj−bi−Ni:t−τj+Nj+bj−Ni
, uit−τj+Nj−di:t−τj+Nj+bj−1 | i ∈ V
Thus we have
PCH(xj
t−τj
)=
xst−τj−Msj
, ujt−τj
, xpt−τj+Mjp
, upt−τj+Mjp
,
xrt−τj+Mjp−Mrp
| s ∈ Ij, r ∈ Ip, p ∈ Oj ⋃
xit−τj+Nj−bi−Ni:t−τj+Nj+bj−Ni
,
uit−τj+Nj−di:t−τj+Nj+bj−1 | i ∈ V
(4.12)
Let us denote the set of parents, the children and the parents of children of
xjt−Nj+1:t by Mj. From equations (4.10), (4.11), (4.12), we get that the set Mj
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS56
contains
Mj =
xjt−Nj :t+1, x
st−(Nj+Msj):t−Msj
, xpt−(Nj−1−Mjp):t+Mjp+1, x
rt−(Nj−1−Mjp+Mrp):t−(Mrp−Mjp),
xit−Ni−bi+1:t−Ni+bj+Nj
| s ∈ Ij, p ∈ Oj, r ∈ Ip, i ∈ V ⋃
ujt−Nj :t
, ukt+1:t+Nj+bj
, upt−(Nj−1−Mjp):t+Mjp
,
uit−(di−1):t+Nj+bj−1 | p ∈ O
j, k, i ∈ V.
Let us denoteM = ∪j∈VMj. Note that ukt−sk∈M if sk ≥ Nk or sk ≥ Nj−1−Mjk
for all j ∈ Ik or sk ≥ dk − 1. From the definition 11, this implies that
sk = maxNk, dk − 1, Nj −Mjk − 1 | j ∈ Ik
= dk
Similarly, xkt−qk∈ S if and only if xk
t−qk∈ M. This happens if one of the following
conditions holds.
1. qk ≥ Nk.
2. qk ≥ Nj + Mkj such that k ∈ Ij for some j ∈ V. This happens for all j ∈ Ok.
3. qk ≥ (Nj − 1 − Mjk) such that k ∈ Oj for some j ∈ V. That is if qk =
(Nj − 1−Mjk) for all j ∈ Ik.
4. For the last term, we need to find all j ∈ V such that for all p ∈ Oj, we
have k ∈ Ip. This happens for all j ∈ Ip, such that p ∈ Ok. Thus we have
qk ≥ Nj − 1−Mjp + Mkp for all p ∈ Ok and all j ∈ Ip.
5. qk ≥ bk + Nk − 1
Thus, we get that
qk = max
Nk, Ns + Mks, Nr − 1−Mrk,
Np − 1−Mps + Mks, bk + Nk − 1 | p ∈ Is, s ∈ Ok, r ∈ Ik
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS57
Using the definition of bk and dk, it is easy to verify that qk = bk + Nk. This proves
that the Markov blanket of the variables Sut consists of only imem
t . Thus, there exists
L∗t such that equation (4.8) is satisfied. The lemma then follows from equation (4.9).
Proof of Theorem 22. To prove the main theorem, we first show that at time T−1,
the belief state is only a function of imemT−1 . To see this, note that at time T − 1, the
set of unknown states at the controller SuT has no children. Thus, using a simplified
version of the argument given in the proof of lemma 23, it is easy to verify that there
exists Λ∗T−1 such that
ΛT−1 (aT−1, iT−1) = Λ∗T−1
(aT−1, i
memT−1
).
Thus, there exists an optimal controller K∗T−1 such that
K∗T−1 (aT−1, iT−1) = T
(aT−1, Λ
∗T−1
(·, imem
T−1
))
= KT−1
(aT−1, i
memT−1
)
The proof of the theorem then follows from the inductive argument using Lemma 23.
In previous chapters, we studied networked Markov decision processes with delays
between subsystems. We showed that for networked MDPs, a sufficient information
state is a function of a finite number of past system states and the past controller
inputs. The number of past states as well as past inputs depends only on the un-
derlying graph structure of the networked Markov decision process as well as the
associated delays. We also give explicit bounds on the number of past states and
inputs required to compute an optimal control action for networked MDPs with de-
lays. We also showed that this bound has interesting connections to Markov blanket
in Bayesian networks. This allows us to look at complex networked systems from the
view point of Bayesian networks and provides additional insights into how the delays
between subsystems affects the overall controller performance.
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS58
The results of previous chapters allow us to look at complex interconnected sys-
tems that have a centralized controller or a decision maker. In several systems of
interest, the presence of a centralized decision maker is infeasible. Even if one can
envision a centralized decision maker, it might be costly for every subsystem to trans-
mit its state to the decision maker. In the next chapter, we look at a stochastic game
model of complex interacting systems. In such models, each system makes optimal
decision in a decentralized manner. We study a new notion of equilibrium in such
systems that allows us to compute decentralized policies or strategies for systems with
large number of players.
Chapter 5
A Mean Field Approach to
Studying Large Systems
In several complex systems, a large number of agents interact with each other without
the presence of centralized authority. Even in systems where a centralized authority
may be present, it might be costly for each player or subsystem to transmit its state
to the centralized authority. Imagine a wireless network where a large number of
devices are performing power control to maximize their capacity. Even if there is a
central base station, it is costly for each device to continuously update its channel
state or queue backlog for the base station to perform the power control. Thus, in such
scenarios, each agent (or player) interacts with the other agents in a decentralized
manner to achieve its own objectives. A natural framework to study such systems is
that of stochastic games. Stochastic games [51] have been used to study interactions
between players in stochastic dynamic environments. However, typically such games
can be solved with a very small number of players since the computational complexity
involved in finding optimal equilibrium policies is very large [25]. This limits their
application to models with small dimensions.
In chapters 5 – 7, we study a mean field approach to understanding systems with
large number of interacting players [38, 35, 56, 1, 2, 43, 26, 23]. Mean field theory
has been used in statistical physics to deal with the combinatoric complexity of large
interactions. The basic idea is to treat other particles or agents as a single entity
59
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS60
with some average behavior. Applied to engineering problems, this greatly simplifies
decision making by a single agent: a single agent can make its decision based on
the average behavior of other agents. The equilibrium consistency condition requires
that the average behavior of agents arise from the individual trajectories. Just as
in statistical physics, the mean field behavior allows us to decouple the interactions
between agents and enables us to come up with simple decision making policies.
In this and subsequent chapters, we developed a unified framework to study mean
field equilibrium behavior of large scale stochastic games. In particular, we prove
that under a set of simple assumptions on the model, a mean field equilibrium always
exists. Furthermore, as a simple consequence of this existence theorem,we show that
from the viewpoint of a single agent, a near optimal decision making policy is one that
reacts only to the average behavior of its environment. This result unifies previous
known results on the mean field equilibrium in large scale systems. In developing this
unified framework, we isolate and highlight the key modeling parameters which make
this mean field approach feasible. As a first step in studying the mean field approach,
we begin by defining our model for stochastic games.
5.1 Stochastic Game Model
In this section, we describe our stochastic game model. Compared to standard
stochastic games in the literature [51], in our model, every player has an individual
state. Players are coupled through their payoffs and state transitions. A stochastic
game has the following elements:
Time. The game is played in discrete time. We index time periods by t =
0, 1, 2, . . ..
Players. There are m players in the game; we use i to denote a particular player.
State. The state of player i at time t is denoted by xi,t ∈ X , where X ⊆ Zd is
a subset of the d-dimensional integer lattice. We use x−i,t to denote the state of all
players except player i at time t.
Action. The action taken by player i at time t is denoted by ai,t ∈ A, where
A ⊆ Rq is a subset of q-dimensional Euclidean space.
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS61
Transition Probabilities. The state of a player evolves in a Markov fashion. For-
mally, let ht = x0,a0, . . . ,xt−1,at−1 denote the history up to time t. Conditional
on ht, players’ states at time t are independent of each other. Player i′ state xi,t at
time t depends on the past history ht only through the state of player i at time t− 1,
xi,t−1; the states of other players at time t−1, x−i,t−1; and the action taken by player
i at time t− 1, ai,t−1. We represent the distribution of the next state as a transition
kernel P, where:
P(x′i | xi, ai,x−i) = Prob
(xi,t+1 = x′
i | xi,t = xi, ai,t = ai,x−i,t = x−i
). (5.1)
Note that the evolution of players’ states may be coupled: in general, the next state
of player i depends on the current state not only of player i, but also the current state
of players other than i.
Payoff. In a given time period, if the state of player i is xi, the state of other
players is x−i, and the action taken by player i is ai, then the single period payoff
to player i is π(xi, ai,x−i
)∈ R. Note that the players are coupled via their payoff
function, since the payoff to player i depends on the state of every other player.
Discount Factor. The players discount their future payoff by a discount factor
0 < β < 1. Thus a player i’s infinite horizon payoff is given by:
∞∑
t=0
βtπ(xi,t, ai,t,x−i,t
).
In the model described above, the players’ payoff function and transition kernel
depends on the states of all players. In a variety of games, this coupling between
players is independent of the identity of the players. The notion of anonymity captures
scenarios where the interaction between players is via aggregate information about
the state. Let f(m)−i,t(y) denote the fraction of players (excluding player i) that have
their state as y at time t, i.e.:
f(m)−i,t(y) =
1
m− 1
∑
j 6=i
1xj,t=y, (5.2)
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS62
where 1xj,t=y is the indicator function that the state of player j at time t is y. We
refer to f(m)−i,t as the population state at time t (from player i’s point of view).
Definition 24 (Anonymous Stochastic Game). A stochastic game is called an anony-
mous stochastic game if the payoff function π(xi,t, ai,t,x−i,t) and transition kernel
P(x′i,t | xi,t, ai,t,x−i,t) depend on x−i,t only through f
(m)−i,t. In an abuse of notation,
we write π(xi,t, ai,t,f
(m)−i,t
)for the payoff to player i, and P(x′
i,t | xi,t, ai,t,f(m)−i,t) for the
transition kernel for player i.
For the remainder of the chapters, we focus our attention on anonymous stochastic
games. For ease of notation, we often drop the subscript i and t to denote a generic
transition kernel and a generic payoff function, i.e., we denote a transition kernel
by P(· | x, a, f) and a generic payoff function by π(x, a, f), where f represents the
population state of players other than the player under consideration.
Our results require a topology on population states; we consider a topology in-
duced by the 1-p norm. Given p > 0, the 1-p-norm of a function f : X → R is given
by:
‖f‖1-p =∑
x∈X
‖x‖pp |f(x)|,
where ‖x‖p is the usual p-norm of a vector. When X is finite, then ‖f‖1-p induces
the same topology as the standard Euclidean norm. However, when X is infinite,
the 1-p-norm weights larger states higher than smaller states. In many applications,
other players at larger states have a greater impact on the payoff; in such settings,
continuity of the payoff in f in the 1-p-norm naturally controls for this effect.
Formally, let F be the set of all possible population states on X with finite 1-p
norm, i.e.:
F =f : X → [0, 1] | f(x) ≥ 0,
∑
x∈X
f(x) = 1, ‖f‖1-p <∞. (5.3)
In addition, we let F(m) denote the set of all population states in F over m−1 players,
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS63
i.e.:
F(m) =
f ∈ F : there exists x ∈ Xm−1 with f(y) =1
m− 1
∑
j
1xj=y
.
5.2 Markov Perfect Equilibrium (MPE)
In studying stochastic games, attention is typically focused on a smaller class of
Markov strategy spaces, where the action of a player at each time is a function of
only the current state of every player [29]. In the context of anonymous stochastic
games, a Markov strategy depends on the current state of the player as well as the
current population state. Because a player using such a strategy tracks the evolution
of the other players, we refer to such strategies in our context as cognizant strategies.
Definition 25. Let M be the set of cognizant strategies available to a player. That
is,
M =µ | µ : X × F→ A
. (5.4)
Consider an m-player anonymous stochastic game. At every time t, player i
chooses an action ai,t that depends on its current state and on the current population
state f(m)−i,t ∈ F(m). Letting µi ∈ M denote the cognizant strategy used by player i,
we have ai,t = µi(xi,t,f(m)−i,t). The next state of player i is randomly drawn according
to the kernel P:
xi,t+1 ∼ P(
·∣∣∣ xi,t, µi(xi,t,f
(m)−i,t),f
(m)−i,t
)
. (5.5)
We let µ denote the vector of strategies chosen by the players. We also let µ(m)
denote the strategy vector where every player has chosen strategy µ.
Let V (m)(x, f | µ′,µ(m−1)
)be the expected net present value for a player with
initial state x, and with initial population state f ∈ F(m), given that the player
follows a strategy µ′ and every other player follows the strategy µ. In particular, we
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS64
have
V (m)(x, f | µ′,µ(m−1)
),
E
[ ∞∑
t=0
βtπ(xi,t, ai,t,f
(m)−i,t
) ∣∣ xi,0 = x, f
(m)−i,0 = f ; µi = µ′,µ−i = µ(m−1)
]
. (5.6)
Note that state sequence xi,t and population state sequence f(m)−i,t evolve according to
the dynamics (5.5).
We focus our attention on a symmetric Markov perfect equilibrium (MPE), where
all players use the same cognizant strategy µ. In an abuse of notation, we write
V (m)(x, f | µ(m)
)to refer to the expected discounted value as given in equation (5.6)
when every player follows the same cognizant strategy µ.
Definition 26 (Markov Perfect Equilibrium). The vector of cognizant strategies
µ(m) ∈M is a symmetric Markov perfect equilibrium (MPE) if for all initial states x ∈
X and population states f ∈ F(m) we have
supµ′∈M
V (m)(x, f | µ′,µ(m−1)
)= V (m)
(x, f | µ(m)
).
Thus, a Markov perfect equilibrium is a profile of cognizant strategies that si-
multaneously maximize the expected discounted payoff for every player, given the
strategies of other players. It is a well known fact that computing a Markov perfect
equilibrium for a stochastic game is computationally challenging in general [25]. This
is because to find an optimal cognizant strategy, each player needs to track and fore-
cast the exact evolution of the entire population state. In certain scenarios, it might
be infeasible to exchange or learn this information at every step because of limited
communication capacity between players or limited cognitive ability. In the next
section, we describe a recently proposed scheme for approximating Markov perfect
equilibrium.
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS65
5.3 Mean Field Equilibrium (MFE)
In a game with a large number of players, we might expect that fluctuations of players’
states “average out” and hence the actual population state remains roughly constant
over time. Because the effect of other players on a single player’s payoff and transition
probabilities is only via the population state, it is intuitive that, as the number of
players increases, a single player has negligible effect on the outcome of the game.
Based on this intuition, a scheme for approximating MPE has been proposed via a
solution concept we call mean field equilibrium, or MFE [38, 35, 56, 1, 2, 43, 26, 23].
Mean field equilibrium is also referred to as “oblivious equilibrium” by [56] or as
“Nash certainty equivalence control” by [35].
In MFE, each player optimizes its payoff based on only the long-run average
population state. Thus, rather than keep track of the exact population state, a single
player’s immediate action depends only on his own current state. We call such players
oblivious, and refer to their strategies as oblivious strategies. Formally, we let MO
denote the set of (stationary, nonrandomized) oblivious strategies, defined as follows.
Definition 27. Let MO be the set of oblivious strategies available to a player. That
is,
MO =µ | µ : X → A
. (5.7)
Given a strategy µ ∈ MO an oblivious player i takes an action ai,t = µ(xi,t) at
time t; as before, the next state of the player is randomly distributed according to
Note that because we are considering a mean field model, the player’s state evolves
according to transition kernel with population state f .
We define the oblivious value function V(x | µ, f
)to be the expected net present
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS66
value for any oblivious player with initial state x, when the long run average popula-
tion state is f , and the player uses an oblivious strategy µ. We have
V(x | µ, f
), E
[ ∞∑
t=0
βtπ(xi,t, ai,t, f
)∣∣∣ xi,0 = x; µ
]
. (5.9)
Note that the state sequence xi,t is determined by the strategy µ according to the
dynamics (5.8).
We define the optimal oblivious value function V ∗(x | f) as
V ∗(x | f) = supµ∈MO
V (x | µ, f).
Given a population state f , an oblivious player computes an optimal strategy by
maximizing their oblivious value function. Note that because an oblivious player does
not track the evolution of the population state, under reasonable assumptions their
optimal strategy is only a function of their current state—i.e., it must be oblivious
even if optimizing over cognizant strategies. We capture this optimization step via
the correspondence P defined next.
Definition 28. The correspondence P : F → MO maps a distribution f ∈ F to
the set of optimal oblivious strategies for a player. That is, µ ∈ P(f) if and only if
V(x | µ, f
)= V ∗(x | f) for all x, where V is the oblivious value function given by
equation (5.9).
Note that P maps a distribution to a stationary, nonrandomized oblivious strategy.
This is typically without loss of generality, since in most models of interest there
always exists such an optimal strategy. We later establish under our assumptions
that P(f) is nonempty.
Now suppose that the population state is f , and all players are oblivious and play
using a stationary strategy µ. We expect that the long run population state should
in fact be an invariant distribution of the Markov process with transition kernel (5.8).
We capture this relationship via the correspondence D, defined next.
Definition 29. The correspondence D : MO × F → F maps the oblivious strategy
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS67
µ and population state f to the set of invariant distributions D(µ, f) associated with
the dynamics (5.8).
Note that the image of the correspondence D is empty if the strategy does not
result in an invariant distribution. We later establish conditions under which D(µ) is
nonempty.
We can now define mean field equilibrium. If every agent conjectures that f is
the long run population state, then every agent would prefer to play an optimal
oblivious strategy µ. On the other hand, if every agent plays µ and the population
state is in fact f , then we should expect the long run population state of all players
to be an invariant distribution of (5.8). Mean field equilibrium requires a consistency
condition: the equilibrium population state f must in fact be an invariant distribution
of the dynamics (5.8) under the strategy µ and the same population state f .
Definition 30 (Mean Field Equilibrium). An oblivious strategy µ ∈MO and a dis-
tribution f ∈ F constitute a mean field equilibrium if µ ∈ P(f) and f ∈ D(µ, f).
The notion of mean field equilibrium provides a simple approach to understanding
behavior in large population stochastic dynamic games. However, this notion is not
very meaningful unless we can guarantee that a mean field equilibrium exists in a
wide variety of stochastic games. Even if a mean field equilibrium were to exist in
a particular game of interest, it is natural to wonder whether such an equilibrium
is a good approximation to Markov perfect equilibrium in games with finitely many
players. MFE is unlikely to be useful in practice without conditions that guarantee
it approximates equilibria in finite systems well. Below we address these two fun-
damental questions: the existence of MFE and whether it provides any meaningful
approximation to MPE.
As we shall show below, an important contribution of our thesis is to relate ap-
proximation to existence of MFE. The approximation theorem we provide requires
continuity assumptions on the model primitives; as we demonstrate later, these same
continuity conditions are required (together with convexity and compactness condi-
tions) to ensure an MFE actually exists. Thus we obtain the valuable insight that
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS68
approximation is essentially a corollary of existence. This is practically valuable: es-
tablishing that MFE is a good approximation is effectively a free byproduct, once the
conditions ensuring its existence have been verified.
We begin by studying the approximation result. We first define the appropriate
notion of approximation and show that under very mild assumptions a mean field
equilibrium (if it exists) approximates Markov perfect equilibrium as the number of
players in the game becomes large.
Chapter 6
MFE as an Approximation to MPE
As discussed in the previous chapter, a mean field equilibrium is of practical value only
if it approximates equilibria in finite systems well. In this chapter, we establish one of
our main results: under a parsimonious set of assumptions, a mean field equilibrium is
a good approximation to Markov perfect equilibrium as the number of players grows
large.
6.1 The Asymptotic Markov Equilibrium (AME)
Property
We begin by formalizing the approximation property of interest, referred to as the
asymptotic Markov equilibrium (AME) property. Intuitively, this property requires
that a mean field equilibrium strategy is approximately optimal even when compared
against Markov strategies, as the number of players grows large.
Definition 31 (Asymptotic Markov Equilibrium). A mean field equilibrium (µ, f)
possesses the asymptotic Markov equilibrium (AME) property if for all states x and
sequences of cognizant strategies µm ∈M, we have:
lim supm→∞
V (m)(x, f (m) | µm,µ(m−1)
)− V (m)
(x, f (m) | µ(m)
)≤ 0, (6.1)
69
CHAPTER 6. MFE AS AN APPROXIMATION TO MPE 70
almost surely, where the initial population state f (m) is derived by sampling each other
player’s initial state independently from the probability mass function f .
Note that V (m)(x, f (m) | µ′,µ(m−1)
)is the actual value function of a player as
defined in equation (5.6), when the player uses a cognizant strategy µ′ and every
other player plays an oblivious strategy µ. In particular, we have
V (m)(x, f (m) | µm,µ(m−1)
),
E
[∞∑
t=0
βtπ(xi,t, ai,t,f
(m)−i,t
)∣∣∣ xi,0 = x, f
(m)−i,0 = f (m); µi = µm,µ−i = µ(m−1)
]
,
where the state evolution of the players is given by:
xi,t+1 ∼ P(· | xi,t, µm(xi,t,f
(m)−i,t),f
(m)−i,t
)
xj,t+1 ∼ P(· | xj,t, µ(xj,t),f
(m)−i,t
)∀j 6= i.
Similarly, V (m)(x, f (m) | µ(m)
)is the actual value function of a player as defined
in equation (5.6) when every player is playing the oblivious strategy µ. AME requires
that the error when using the MFE strategy approaches zero almost surely with
respect to the randomness in the initial population state. This definition can be
shown to be stronger than the definition considered by [56], where AME is defined
only in expectation with respect to randomness in the initial population state.1
We emphasize that the AME property is essentially a continuity property in the
population state f . Under reasonable assumptions, we show that the time t popula-
tion state in the system with m players, f(m)−i,t, approaches f almost surely for all t as
m→∞. Therefore, informally, if the payoffs satisfy an appropriate continuity prop-
erty in f , we should expect the AME property to hold. This observation is significant,
because as noted above, continuity is also an essential prerequisite to existence. It is
for this reason that, under fairly general assumptions, the AME property is essentially
a corollary to existence.
1Under our assumptions on the model, convergence in expectation can be established via anapplication of the bounded convergence theorem. In particular, by Lemma 41 it follows that|V (m)(x, f | µ′, µ)| ≤ C(x, 0) <∞ for all f , µ′, and µ.
CHAPTER 6. MFE AS AN APPROXIMATION TO MPE 71
Before proceeding, we require some additional notation. Without loss of generality,
we can view the state Markov process in terms of the increments from the current
state. Specifically, if the current state is x and action a is taken, we can write:
xi,t+1 = xi,t + ξi,t, (6.2)
where we consider ξi,t to be a random increment that is distributed according to the
probability mass function Q(· | x, a, f), where
Q(z′ | x, a, f) = P(x + z′ | x, a, f).
Note that Q(z′ | x, a, f) is positive for only those z′ such that x + z′ ∈ X .
We make the following assumptions over model primitives; these ensure the model
is appropriately “continuous” in the limit.
Assumption 1 (Continuity). 1. Compact action set. The set of feasible actions
for a player, denoted by A, is compact.
2. Bounded increments. There exists M ≥ 0 such that, for all z with ‖z‖∞ > M ,
Q(z | x, a, f) = 0 for all x, a, and f .
3. Payoff and kernel continuity. The payoff π(x, a, f) is jointly continuous in a ∈ A
and f ∈ F for fixed x ∈ X (where F is endowed with the 1-p norm), and the
kernel P(x′ | x, a, f) is jointly continuous in a ∈ A and f ∈ F for each x, x′ ∈ X
(where F is endowed with the 1-p norm).2
4. Growth rate bound. There exist constants K and n ∈ Z+ such that
supa∈A,f∈F
|π(x, a, f)| ≤ K(1 + ‖x‖∞)n
for every x ∈ X , where ‖·‖∞ is the sup norm.
2Here we view P(x′ | x, a, f) as a real valued function of a and f , for fixed x, x′; note that sincewe have also assumed bounded increments, this notion of continuity is equivalent to assuming thatP(· | x, a, f) is jointly continuous in a and f with respect to the topology of weak convergence ondistributions over X .
CHAPTER 6. MFE AS AN APPROXIMATION TO MPE 72
The most consequential of these assumptions are that the model exhibits bounded
increments, and that the payoff growth rate can be bounded. These are not particu-
larly severe restrictions; for a wide range of economic models of interest, it is reason-
able to assume increments are bounded. Further, the polynomial growth rate bound
on the payoff is quite weak, and serves to exclude the possibility of strategies that
yield infinite expected discounted payoff.
Theorem 32 (AME). Let (µ, f) be a mean field equilibrium with f ∈ F, and suppose
Assumption 1 holds. Then the AME property holds for (µ, f).
The proof of the AME property exploits the fact that the 1-p-norm of f must
be finite (since f ∈ F) to show that∥∥∥f
(m)−i,t − f
∥∥∥
1-p→ 0 almost surely as m → ∞;
i.e., the population state of other players approaches f almost surely. Continuity of
the payoff π in f , together with the growth rate bounds in Assumption 1, yields the
desired result. The proof of the AME property is provided in the appendix.
In the next chapter we establish the existence of MFE. The existence results uses
the same continuity assumption (along with other additional assumptions) as required
for the AME property. This shows that the approximation result is a corollary of the
existence result.
Chapter 7
Existence of Mean Field
Equilibrium
The notion of mean field equilibrium allows to us approximate Markov perfect equi-
librium in large stochastic dynamic games. This notion is vacuous unless we can
guarantee that mean field equilibrium exists in a wide variety of games. In this
chapter, we study the existence of mean field equilibria. From Definition 30, we
observe that (µ, f) is a mean field equilibrium if and only if f is a fixed point of
Φ(f) = D(P(f), f), and µ ∈ P(f). Thus our approach is to find conditions un-
der which the correspondence Φ has a fixed point; in particular, we aim to apply
Kakutani’s fixed point theorem to Φ to find a MFE.
Kakutani’s fixed point theorem requires three essential pieces: (1) compactness of
the range of Φ; (2) convexity of both the domain of Φ, as well as Φ(f) for each f ; and
(3) appropriate continuity properties of the operator Φ. As emphasized in the last
chapter, a central technical observation is that the same continuity properties needed
to establish the AME property are essential to proving existence of a MFE.
We start with the following restatement of Kakutani’s theorem.
Theorem 33 (Kakutani). Suppose there exists a set FC ⊆ F such that:
1. FC is convex and compact (in the 1-p norm), with Φ(FC) ⊂ FC;
2. Φ(f) is convex and nonempty for every f ∈ FC; and
73
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 74
3. Φ has a closed graph on FC.
Then there exists a mean field equilibrium (µ, f) with f ∈ FC.
In the remainder of this section, we find exogenous conditions on model primitives
to ensure these requirements are met. We tackle them in reverse order. We first show
that under Assumption 1, Φ has a closed graph. Next, we study conditions under
which Φ(f) can be guaranteed to be convex. Finally, we provide conditions on model
primitives under which there exists a compact, convex set FC with Φ(FC) ⊂ FC .
The conditions we provide are mild, and yet also suffice to guarantee that Φ(f) is
nonempty.
7.1 Closed Graph
In this section we establish that exactly the same the continuity assumptions embod-
ied in Assumption 1 also suffice to ensure that Φ has a closed graph. We begin with
the following lemma.
Lemma 34. For each f , P(f) is compact; further, the correspondence P is upper
hemicontinuous on F.
Proof. By Assumption 1, π(x, a, f) is jointly continuous in a and f . Lemma 42
establishes that the optimal oblivious value function V ∗(x | f) is continuous in f ,
and so as in the proof of that lemma, it follows that for a fixed state x, π(x, a, f) +
β∑
x′ V ∗(x′ | f)P(x′ | x, a, f) is finite and jointly continuous in a and f . Define the
set Px(f) ⊂ A as the set of actions that achieve the maximum on the right hand side
of (A.3); this is nonempty as A is compact (Assumption 1) and the right hand side
is continuous in a. By Berge’s maximum theorem, for each x the correspondence Px
is upper hemicontinuous with compact values [3].
By Lemma 42, µ ∈ P(f) if and only if µ(x) ∈ Px(f) for each x. Note that we have
endowed the set of strategies with the topology of pointwise convergence. The range
space of P is an infinite product of the compact action space A (Assumption 1) over
the countable state space. Hence by Tychonoff’s theorem [3], the range space of P is
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 75
compact. Further, since Px is compact-valued, it follows that P is compact-valued.
Since Px(f) is compact-valued and upper hemicontinuous, the Closed Graph Theorem
ensures that Px has a closed graph [3]. This in turn ensures that P has closed graph;
again by the Closed Graph Theorem, we conclude that P is upper hemicontinuous.
Proposition 35. Suppose that Assumption 1 holds. Then Φ has a closed graph on
F; i.e., the set (f, g) : g ∈ Φ(f) ⊂ F× F is closed (where F is endowed with the 1-p
norm).
Proof. Suppose fk → f in the 1-p norm, and that gk → g in the 1-p norm, where
gk ∈ Φ(fk) for all k. We must show that g ∈ Φ(f). For each k, let µk ∈ P(fk) be an
optimal oblivious strategy such that gk ∈ D(µk, fk). As in the proof of Lemma 34,
the range space of P is compact in the topology of pointwise convergence; therefore,
taking subsequences if necessary, we can assume without loss of generality that µk
converges to some strategy µ ∈MO pointwise. By upper hemicontinuity of P (Lemma
34), we have µ ∈ P(f).
By definition of D, it follows that for all x:
gk(x) =∑
x′
gk(x′)P(x|x′, µk(x
′), fk). (7.1)
Since P(x|x′, a, f) is jointly continuous in action and population state (Assumption
1), it follows that for all x and x′:
P(x|x′, µk(x′), fk)→ P(x|x′, µ(x′), f)
as k → ∞. Further, if gk → g in the 1-p norm, then in particular, gk(x) → g(x) for
all x. Finally, observe that for all a and f , we have P(x|x′, a, f) = 0 for all states x′
such that ‖x′ − x‖∞ > M , since increments are bounded (Assumption 1). Thus:
∑
x′
gk(x′)P(x|x′, µk(x
′), fk)→∑
x′
g(x′)P(x|x′, µ(x′), f)
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 76
as k →∞. Taking the limit as k →∞ on both sides of (7.1) yields:
g(x) =∑
x′
g(x′)P(x|x′, µ(x′), f), (7.2)
which establishes that g ∈ D(µ, f). Since we had µ ∈ P(f), we conclude g ∈ Φ(f),
as required.
7.2 Convexity
Next, we develop conditions to ensure that Φ(f) is nonempty and convex. We start
by considering a simple model, where the action set A is the simplex of randomized
actions on a base set of pure actions. Formally, we have the following definition.
Definition 36. An anonymous stochastic game has a finite action space if there exists
a finite set S such that the following three conditions hold:
1. A consists of all probability distributions over S: A = a ≥ 0 :∑
s a(s) = 1.
2. π(x, a, f) =∑
s a(s)π(x, s, f), where π(x, s, f) is the payoff evaluated at state
x, population state f , and pure action s.
3. P(x′ | x, a, f) =∑
s a(s)P(x′ | x, s, f), where P(x′ | x, s, f) is the kernel eval-
uated at states x′ and x, population state f , and pure action s.
Essentially, the preceding definition allows inclusion of randomized strategies in
our search for mean field equilibrium. This model inherits Nash’s original approach
to establishing existence of an equilibrium for static games, where randomization
induces convexity on the strategy space. We show next that in any game with finite
action spaces, the set Φ(f) is always convex.
Proposition 37. Suppose Assumption 1 holds. In any anonymous stochastic game
with a finite action space, Φ(f) is convex for all f ∈ F.
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 77
Proof. Fix f ∈ F, and let g1, g2 be elements of Φ(f). Let µ1, µ2 ∈ P(f) be strategies
such that gi ∈ D(µi, f), i = 1, 2. Then for i = 1, 2 and all x′ ∈ X , we have:
gi(x′) =
∑
x
gi(x′)P(x′ | x, µi(x), f).
Fix δ, 0 ≤ δ ≤ 1, and for each x, define g(x) by:
g(x) = δg1(x) + (1− δ)g2(x).
We must show g ∈ Φ(f). Define a new strategy µ as follows: for each x such that
g(x) > 0,
µ(x) =δg1(x)µ1(x) + (1− δ)g2(x)µ2(x)
g(x).
For each x such that g(x) = 0, let µ(x) = µ1(x).
We claim that µ ∈ P(f), i.e., µ is an optimal oblivious strategy given f ; and that
g ∈ D(µ, f), i.e., that g is an invariant distribution given strategy µ and population
state f . This suffices to establish that g ∈ Φ(f).
To establish the claim, first observe that under Definition 36, the right hand side
of (A.3) is linear in a. Thus any convex combination of two optimal actions is also
an optimal action. This establishes that for every x, µ(x) achieves the maximum on
the right hand side of (A.3); so we conclude µ ∈ P(f).