stacks.stanford.edukx619tt4623/thesis_adlakha... · I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation

EQUILIBRIUM AND CONTROL IN COMPLEX

INTERCONNECTED SYSTEMS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF ELECTRICAL

ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Sachin Adlakha

August 2010

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/kx619tt4623

© 2010 by Sachin Adlakha. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/kx619tt4623

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Andrea Goldsmith, Primary Adviser


Ramesh Johari


Sanjay Lall

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Large-scale complex systems such as power grids, transportation systems, and social

networks are reshaping every aspect of modern society. Despite their ubiquitous na-

ture, the design and understanding of such complex networks is still very challenging.

Decision making in such systems is complicated by the fact that an agent’s optimal

choice depends on the choices made by other agents in the system. In a smart grid,

the power consumption of an individual user could depend on the demand profile of

other users, some of who may be physically far away. An investment decision by an

agent in an online auction is affected by the strategic choices of other agents partic-

ipating in the auction. Thus, a node’s decision is affected by the presence and the

actions of other nodes in the system. The multitude of dependencies arising in such

environments lead to an extremely complicated decision making process for a single

agent.

Often in complex systems, the decision maker has partial information about the

state of the system. For example, a centralized load balancer in a server farm obtains

the state of the queues via a communication network. This network introduces delays

and losses which result in partial information at the decision maker. This further

complicates the decision making process.

In this thesis, we study equilibrium and control in complex interconnected systems.

In the first part of the thesis, we investigate centralized decision making in a networked

system in presence of delays. Specifically, we show that even in the presence of delays,

a centralized decision maker can make optimal decisions with only a subset of the

past history of the system.This history depends on the structure of the system as well

as the associated delay pattern. From a practical point of view, these results show

iv

that one can make optimal decisions with only finite memory about the past, thus

eliminating the need to store the entire history. Thus, for example, a centralized load

balancer in a server farm can use algorithms based on only a finite past to evenly

distribute load across multiple servers.

In the second part of the thesis, we look at decentralized decision making in a

reactive environment. We describe a mean field approach to decision making in large-

scale systems. The basic premise of this approach is to treat other agents as a single

entity with some aggregate behavior. We develop a unified framework to study mean

field equilibrium in large-scale stochastic games. Under a set of simple assumptions,

we prove the existence of a mean field equilibrium. A key insight developed from our

result shows that the existence result is closely related to the approximation of mean

field equilibrium to the actual behavior. Thus, a single agent can make near optimal

decisions based only on aggregate behavior of other agents.

We conclude the thesis with various interesting extensions and open challenges in

the design and understanding of complex interconnected systems.

v

Acknowledgments

This thesis is a culmination of a journey that started at Stanford about five years ago.

I was fortunate to have the guidance and friendship of several people who made this

journey enjoyable. First and foremost, I thank my adviser, Prof. Andrea Goldsmith,

for being a wonderful adviser and a mentor. She took a leap of faith by agreeing

to fund me from the first day, thus enabling me to come to Stanford. Without her

confidence and trust, I would not have been at Stanford, much less write this thesis.

She has always been very encouraging and supportive of me as I built collaborations

with different people and developed my research interests. She also made the Wireless

Systems Laboratory feel more like home. She was very generous in inviting us to

various parties at her home. It was a joy to meet her family - discussions with

Arturo sharpened my thought process and Daniel and Nicole’s company was always

a welcome change from the daily grind of research. For that, I sincerely thank her

and her family.

Early on in my research career, I was fortunate to work and interact with Prof.

Sanjay Lall. His breadth and depth of knowledge constantly amazed me and made

me realize how little I know. Besides being a great researcher, he is also a wonderful

and a very generous teacher. It is from him that I learned the art of learning. He

personally spent countless hours teaching me everything I know about control systems

and Markov decision processes. He also mentored me and taught me how to write

good papers and give good talks. Every paper I ever write, every talk I give in future

will always bear his signature. For all his time and efforts, I will always be thankful

to him.

My sincere thanks are also due to my co-adviser Prof Ramesh Johari. Ramesh’s

vi

enthusiasm for research, his drive for perfection, and his sheer ability to work hard

constantly amazed me. During my entire graduate school career, he was the bench-

mark I strove to achieve. He constantly challenged my limits and helped me realize

my potential. Besides working on research, he spent a lot of time giving me guidance

and career advice. His guidance allowed me to understand my strengths, realize my

weaknesses, and helped me push my limits. My experience at Stanford would have

never been the same, had I not had the pleasure of working with him. For the count-

less hours he spent thinking about my work, for genuinely caring about my work and

my career, and for making me realize my potential, I shall forever be indebted to him.

A significant portion of this thesis is based on the work of Prof. Gabriel Weintraub

of Columbia University. He not only provided the seeds of this work, he also helped

me guide through it. Gabriel comes from a very different background and has a very

unique perspective on research. He was generous with his time and shared his ideas

with me. For his guidance and help at every step of my work, I express my deepest

gratitude.

My Ph.D at Stanford was almost not going to happen had it not been for one

person, who had more faith in me than I had in myself. Convinced that Stanford was

beyond my reach, I had almost decided not to apply. It was only at the urging of Ram

that I finally decided to take the chance. He even promised to pay the application

fee which he still owes me. But what I owe him can never be repaid. He has been a

true friend, believing in me more than I ever believed in myself. During all the ups

and downs of this grueling journey, he was a constant source of encouragement and

support. Mere words of thanks can never do justice to all that he has done for me.

My stay at Stanford gave me an opportunity to make some wonderful friends. I

would like to thank Mayank Jain for being extra-ordinarily helpful at every step of

the way. He was always generous with his time - spending several hours going over the

details of my work with me. He is also the reason that I survived Stanford without

ever owning a car. His generosity will always be remembered. Part of this research

started as a course project that was jointly done with Vineet Abhishek. Even though

he left Stanford for greener pastures, the seeds we jointly sowed as a course project,

flourished as part of my thesis. The joint project also provided an opportunity to

vii

know him better and to grow as friends. Several arguments and discussions over tea,

and our regular dinners at “Treehouse” shall always be fondly remembered.

Life at Stanford would have never been the same without the company of several

friends. Forum Parmar - whose infectious laughter lightened the mood of most serious

of all conversations, Mridul Agarwal - who provided me company at various hiking

trips we made, Dinkar Gupta - whose extraordinary culinary skills and stimulating

company provided for some wonderful dinner nights, Saurabh Jain - who dazzled us

with some wonderful desserts, and Kannan, Kadambari and Abhay - who provided

wonderful company, made these past five years worth living and remembering. During

last few years, I also had the pleasure of knowing Suchitra Vijayan, first as Ram’s wife

and then as a very caring friend. The times spent complaining about Ram, discussing

politics, and exchanging recipes will always be fondly remembered.

The daily grind of school was made more bearable by the presence and company

of Vinay Majjigi who provided me company every time I needed a break. He was

very generous in sharing with us his Mom’s food which made me miss home a tad

bit less. Dan O’Neill offered me some very valuable advice and shared his years of

experience and his unique perspective on life. It would not be an exaggeration to say

that various discussions with him will certainly have an impact on whatever future

career I pursue.

The members of the Wireless Systems Laboratory provided a very intellectually

stimulating environment. Ivana Maric was a very patient and adjusting office mate

who suffered as we converged on the right temperature in our office. Bruno Sinopoli

jump started my research career as soon as I joined Stanford. Various members of

the Wireless Systems Laboratory (both past and present) made this a fun place to

be.

Special thanks are due to Maria Kazandjieva for being my running buddy and

for her wonderful company, to Michelle Hewlett, Samar Fahmy, Sara Lefort, Hattie

Dong and Thomas John for providing a reason (other than work) to come to office

every day, to Sophie and Jonathan for being friends from the first day I came to the

US, to Patrick Burke for helping me with every computer related issue, and to Pat

Oshiro, Bernadette Aguiao and to Joice DeBolt for making bureaucratic work less of

viii

a hassle.

During the last five years, Sanjay Bhal and his family opened the doors of their

heart and provided me a home away from home. Poorvi Bhabhi ensured that I never

missed home cooked food. Kuhoo’s angelic face and innocent remarks made me forget

the stress of work and life. Kaustubh and Prisha never made me miss my nephew

and niece in India. The friendship and their warm hospitality made even the hardest

periods of life bearable.

Last, but not the least, my deepest gratitude is for my family who were always

there to support me through various challenges of my life. My sisters (Pooja, Prerna,

and Kashika) and my brothers (Ashish and Ketan), who endured my absence as I

focused more on my work, were always very supportive. My parents endured several

hardships so that I could follow my passion – they always supported me at every stage

of my life, and sacrificed their dreams so that I could achieve mine. This journey would

never have been possible without their love, support and encouragement. This thesis

is dedicated to them!

ix

Contents

Abstract iv

Acknowledgments vi

1 Introduction 1

1.1 Networked Control Systems with Delayed Information . . . . . . . . . 3

1.1.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Managing Interactions via a Mean Field Approach . . . . . . . . . . . 6

2 Networked Markov Decision Processes 10

2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Partially Observed Markov Decision Processes . . . . . . . . . . . . . 15

2.2.1 Information State for POMDPs . . . . . . . . . . . . . . . . . 17

2.3 Networked Markov Decision Processes . . . . . . . . . . . . . . . . . 20

2.3.1 Networked MDP as a POMDP . . . . . . . . . . . . . . . . . 24

3 Information State for Networked MDPs 26

3.1 Networked MDP with Action Delays . . . . . . . . . . . . . . . . . . 34

3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 Single System with Delayed State Observations and Action Delays 35

3.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Linear Systems with Delays . . . . . . . . . . . . . . . . . . . 37

3.3.2 Controller Design for Finite State Systems . . . . . . . . . . . 40

x

4 A Bayesian Network Approach to Network MDPs 45

4.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Networked MDPs as Bayesian Networks . . . . . . . . . . . . . . . . 48

4.3 Alternate Proof of the Information State for Networked MDPs . . . . 50

5 A Mean Field Approach to Studying Large Systems 59

5.1 Stochastic Game Model . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Markov Perfect Equilibrium (MPE) . . . . . . . . . . . . . . . . . . . 63

5.3 Mean Field Equilibrium (MFE) . . . . . . . . . . . . . . . . . . . . . 65

6 MFE as an Approximation to MPE 69

6.1 The Asymptotic Markov Equilibrium (AME) Property . . . . . . . . 69

7 Existence of Mean Field Equilibrium 73

7.1 Closed Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.3 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8 Conclusions and Future Work 82

A Proofs 85

A.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A.2 Proof of AME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.3 Compactness: Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Bibliography 116

xi

List of Figures

2.1 A network of interconnected subsystems with delays. Subsystem i is

denoted by Si, the network propagation delay from Si to Sj is denoted

by Mij and the measurement delay from Si to the controller is denoted

by Ni. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Directed graph for the network of Figure 2.1. . . . . . . . . . . . . . . 23

3.1 A networked Markov decision process with action delays. The control

action delay to subsystem Si is denoted by Pi. . . . . . . . . . . . . . 33

3.2 A network of two interconnected subsystems with delays. Here the

control input is only applied to subsystem 1. . . . . . . . . . . . . . . 36

3.3 A system of two interacting queues. Here the solid line represent jobs

of type R which enter system 1, and are then transported to system

2 after a delay of M12. Similarly, the dashed line represents jobs of

type B which enter system 2 and are transported to system 1 after a

delay of M21. Of the two queues at each system, the top queue is the

high-priority queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Infinite horizon discounted action cost Ja (averaged over all initial

states) vs. the infinite horizon discounted state cost Js. The curve

is plotted by varying the weighting factor α. . . . . . . . . . . . . . . 43

4.1 A Bayesian network with 6 variables . . . . . . . . . . . . . . . . . . 46

xii

4.2 A network of two interconnected subsystems with delays. Subsystem

i is denoted by Si, the network propagation delay from Si to Sj is

denoted by Mij and the measurement delay from Si to the controller

is denoted Ni. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 The Bayesian network associated with the 2-subsystems networked

MDP of Figure 4.2. Here the circle represents the state of the two sub-

systems and the square represents the control input. For this Bayesian

network, we chose M21 = 1 and M12 = 2. The edges from state vari-

ables to control inputs have been omitted for visual clarity. . . . . . . 51

xiii

Chapter 1

Introduction

Complex networks pervade every aspect of modern society. Nearly any large scale

system can be modeled as a network of interconnected agents or subsystems dynam-

ically evolving over time: the power grid, social networks, modern economies and

financial systems, automated transportation systems, and “smart” environments all

exhibit this characteristic. A defining feature of a complex network is a high degree

of interconnectedness. This connectivity, which is one of the most attractive features

of complex networks, is also a source of its biggest challenge. Two key characteris-

tics of complex networks are: (a) the environment in which they operate is reactive,

and (b) each node in a complex network usually has partial information about its

environment. Understanding these features is important to the design of complex

networks.

• Reactive Environment: Imagine a power grid consisting of a variety of gen-

erators and storage devices catering to the needs of many users, including a

substantial fraction with plug-in vehicles. A user’s decision to charge a hybrid

vehicle could depend upon other vehicles being charged in the same area. Simi-

larly, the operating point of a generator could depend upon the presence of other

energy storage or generating devices. Thus, any node’s decision is affected by

the presence and the actions of other nodes in the system. The environment

in which a node operates consists not only of the dynamics associated with

the physical environment but also the dynamics associated with the actions of

1

CHAPTER 1. INTRODUCTION 2

other nodes. From the point of view of a single node, the environment is re-

active: other nodes will respond and react to this node’s behavior. A similar

phenomenon arises in wireless systems where the transmit power of a device

depends upon the channel characteristics as well as the transmit power of other

devices operating in the region. The multitude of dependencies arising in such

environments lead to an extremely complicated decision making process for a

single agent.

• Partial Information: Decision making in any complex system is also made

difficult by the lack of complete information available to any decision maker.

Imagine a server farm with thousands of servers that handle incoming requests.

A centralized decision maker aims to allocate the incoming requests to these

servers to effectively balance the load in the server farm. In order to do this,

the decision maker requires the current state of the queue at each server. This

information is usually transmitted to the decision maker via a control network.

Because of the inherent delays and losses in the network, it is possible that the

decision maker would not have the current state of each server. In other circum-

stances, noise in the feedback process can corrupt the information available to

the decision maker. Such artifacts of networks further complicate the decision

making in complex systems.

In this thesis, we focus on the design and understanding of complex networks

from the viewpoint of a single agent. A common theme of this research has been

the answer to the following question: how does a single agent make an optimal

decision in the presence of a reactive environment and/or with partial

information about the state of the environment? The thesis is divided into

two fairly distinct parts that addresses this question. In the first part of the thesis,

we study the problem of decision making in the presence of partial information and

the latter part addresses the issues of reactive environments and their role in decision

making for complex networks.


1.1 Networked Control Systems with Delayed In-

formation

In order to better understand the design of complex systems, it is important to un-

derstand how the presence of delay affects decision making. In order to gain this

understanding, we model a complex network as an interconnected network of sub-

systems. Each subsystem is modeled as a Markov decision process (MDP) and this

network of MDPs is referred to as a networked Markov decision process. Such a net-

work of Markov decision processes is a common model for a variety of distributed

control problems, such as distributed vehicle coordination [22], communication sys-

tems, queuing networks [41, 9] and distributed scheduling over multiple servers [5],

[4], [11]. The subsystems are coupled to each other via communication links that are

noise free, pure delay lines. Thus, in our network model we do not consider packet

losses or noisy observations. We also assume that the delays are fixed but may be

different for each interconnection. We assume that each subsystem has a finite state

space and its state evolution is affected by the delayed state of its neighbors. A

centralized controller receives delayed state measurements from each subsystem and

computes an optimal control action to be applied to each subsystem. The control

action applied to each subsystem takes effect after a certain delay.

Although the controller receives state information from each subsystem, each of

these states is delayed by different amounts. Since the current state of each subsystem

is not available to the controller, this system can be represented as a partially observed

Markov decision process (POMDP). Optimal control design for POMDPs has been

studied extensively in the literature [12, 40, 53]. There are two standard approaches

to optimal control of POMDPs. The first approach generates a policy which is a

function of the entire history of observation; this history is called an information

state. As the time increases, the number of observations increase and hence the

information state grows without bound. In the second approach, the controller is a

function of the posterior distribution of the current state of the system conditioned on

the entire observation history. This distribution is called the belief state and finding

an optimal controller requires solving a dynamic program on a space of belief states.


This posterior distribution or the belief state is infinite dimensional and hence the

computation of a controller is very challenging [53]. Thus, in general, the computation

required to compute an optimal controller is prohibitively large. We are therefore

motivated to find a representation of the belief state that is as small as possible.

Our main contribution in this section of the thesis is to show that for networked

Markov decision processes, the optimal controller is a stateless function of a finite

number of past observations. Thus, networked MDPs can be reduced to an MDP

with a sufficient information state that does not grow with time. This sufficient

information is a subset of the entire information state and it captures all relevant

information required for the optimal control. This significantly reduces the computa-

tion complexity associated with obtaining an optimal controller for networked MDPs.

More interestingly, we show that the amount of information required to make this op-

timal decision is related to the concept of a Markov blanket in Bayesian networks. This

interesting connection between networked control systems and Bayesian networks is

an exciting new common thread between these fields, and it opens doors to using

ideas from Bayesian networks to better understand networked systems with delayed

information. The bound on the number of past observations required to compute an

optimal controller is shown to be tight. In an extension of standard terminology for

linear systems, we refer to these numbers as bands and call the controller with finite

memory banded. We show that for networked MDPs, the bands depend only upon the

network structure and the associated delays. From a practical point of view, these

results show that one can make optimal decisions with only finite memory about the

past, thus eliminating the need to store the entire history. Thus, for example, a cen-

tralized load balancer in a server farm can use algorithms based on only a finite past

to evenly distribute load across multiple servers.

1.1.1 Prior Work

The optimal control of Markov decision processes (MDPs) originates with the work

of Bellman [17]. The term MDP refers to optimal control problems in which the

system state is available for measurement, and such systems are analyzed in [48].


Control problems for which the full state is not observable are called partially observed

Markov decision processes (POMDP), and are considered in [40]. The separation

structure and dynamic programming recursion for such problems are well-known [12,

53]. For finite-state systems on a finite horizon, the optimal state-feedback controller

is memoryless. However, when only partial observations of the state are available,

the optimal controller has a separation structure, and is a function of the posterior

distribution of the current state given all the observations [40, 12, 53]. In the special

case where the system dynamics are linear and the noise is Gaussian, the optimal

controller can be computed recursively in an efficient manner, leading to the classical

linear quadratic Gaussian (LQG) control formulation.

Markov decision processes with delays have also been studied extensively in the

literature. Altman and Nain [8] consider an MDP with delayed state availability and

use it to solve a communication network design problem. Bander and White [13]

extended this result to the case where a partial state observation is available after

a delay. Markov decision processes with control action delays are considered in [6].

In [39], the authors unified these results by considering an MDP with observation

delays, action delays as well as cost delays. They also extended the result to the case

of random delays. Optimal control for linear systems with control action delay was

also considered in [16].

The previous works considered a single Markov decision process with delays in

observations and action. Among the earliest works on distributed systems with delays

is [54] where a separation structure for the one-step delay sharing pattern for a system

with general nonlinear dynamics was obtained. Algorithms to compute the optimal

controller for such a system were obtained in [33, 34] by essentially reducing the

problem to a centralized control problem. An optimal controller is then synthesized

using standard algorithms. More general decentralized control of MDPs has been

shown to be intractable in [18]. An optimal controller for the one-step delay sharing

pattern for LQG was obtained in [50] and [42]. Optimal control of linear systems with

one-step delay sharing was also studied in [55] in an input-output framework. More

generally, in [31, 32] it was shown that for dynamic LQG teams with a partially nested

information structure, the optimal controller at any time is linear. See [15, 14] for a


more complete bibliography on results in team theory. Markov decision processes with

delayed feedback have been used to study flow control in queuing networks [41, 9].

Stochastic games with one-step delayed sharing information pattern have also been

used to study distributed power control in networks [7].

1.2 Managing Interactions via a Mean Field Ap-

proach

In several complex systems, a large number of agents interact with each other without

the presence of a central decision maker (or a controller). In applications as diverse

as power control in wireless networks [28], competition among firms in a market [27]

or non-cooperative control systems [35], one often finds a large number of agents

interacting with each other. In many of these applications, the agents have conflicting

objectives and they interact with each other in a stochastic dynamic environment.

In the absence of any centralized authority in such systems, a natural mathematical

framework to study such complex systems is that of stochastic games.

Stochastic games [51] are dynamic games with probabilistic transitions played by

one or more players. In a stochastic game, players compete with each other in a

number of time steps. Each player has a state which describes all parameters of

interest to the player. The state of a player evolves according to a stochastic process.

Such games provide a framework to model dynamic behavior of agents in a stochastic

environment without the presence of a centralized authority or a controller.

Markov perfect equilibrium is a commonly used equilibrium concept for stochastic

games [51]. In MPE, strategies of players depend only on the current state of all

players, and not on the past history of the game. In general, finding an MPE is

analytically intractable; MPE is typically obtained numerically using dynamic pro-

gramming (DP) algorithms [46]. As a result, the complexity associated with MPE

computation increases rapidly with the number of players, the size of the state space,

and the size of the action sets [25]. This limits its application to problems with small

dimensions. Several techniques have been proposed in the literature to deal with the


complexity of large scale systems [47, 10, 52, 36].

Recently, a scheme for approximating MPE for such large scale games has been

proposed in different application domains via a solution concept we call mean field

equilibrium, or MFE [38, 35, 56, 1, 2, 43, 26, 23]. Mean field equilibrium is also

referred to as “oblivious equilibrium” in [56], and as “Nash certainty equivalence

control” in [35].In mean field equilibrium, a player optimizes given only the long-

run average statistics of other players, rather than the entire instantaneous vector

of its competitors’ states. MFE resolves the computational difficulties associated

with MPE: in MFE, a player is reacting to far simpler aggregate statistics of the

behavior of other players. Further, MFE computation is significantly simpler than

MPE computation, since each player only needs to solve a one-dimensional dynamic

program; thus MFE is appealing from a rationality standpoint as well, as it does

not require agents to track a complex state vector in a system with many interacting

players.

The notion of mean field equilibrium provides a simple approach to understanding

behavior in large population stochastic dynamic games. However, this notion is not

very meaningful unless we can guarantee that a mean field equilibrium exists in a

wide variety of stochastic games. Even if a mean field equilibrium were to exist in a

particular game of interest, it is natural to wonder whether such an equilibrium is a

good approximation to Markov perfect equilibrium in finite games. MFE is unlikely

to be useful in practice without conditions that guarantee it approximates equilibria

in finite systems well. Below we address these two fundamental questions on this

topic: the existence of MFE and whether it provides any meaningful approximation

to MPE.

As we shall show below, an important contribution of our work is to relate ap-

proximation to existence of MFE. The approximation theorem we provide requires

continuity assumptions on the model primitives; as we demonstrate in the second

part of this thesis, these same continuity conditions are required (together with con-

vexity and compactness conditions) to ensure an MFE actually exists. Thus we obtain

the valuable insight that approximation is essentially a corollary of existence. This is

practically valuable: establishing MFE is a good approximation is effectively a free


byproduct, once the conditions ensuring its existence have been verified.

The mean field approach developed in our research has implications for a variety

of engineering applications such as cognitive radio networks, smart grids, etc. Imag-

ine a network of cognitive radios looking for spectrum holes. In order to construct

algorithms for effective spectrum search, a radio must also take into account the pres-

ence of other cognitive devices. Our research suggests design directions for simple

spectrum search algorithms that might only react to average statistics of the envi-

ronment, rather than finely detailed knowledge of other nodes’ behavior. Similarly,

imagine a network of electric cars that are all trying to charge their batteries from the

power grid. In order to construct simple recharging algorithms for electric cars, each

car might base its recharging schedule on some average load profile of the electric

grid. These policies in turn could help prevent the occurrence of load spikes on the

electric grid, since the equilibrium condition requires that these policies give rise to

the initially conjectured average load profile.

The rest of the thesis is organized as follows. In Chapter 2, we describe our model

for networked control systems. As discussed above, we model networked control sys-

tems as a networked Markov decision process. We begin this chapter by formally

defining Markov decision processes (MDPs), partially observed Markov decision pro-

cesses (POMDPs), and networked Markov decision processes (N-MDPs). We also

formally define the information state for POMDPs and show that in general net-

worked MDPs can be written as POMDPs. The first main result - that computes the

sufficient information state for networked MDPs and shows that it is finite set set of

the past history, is proved in Chapter 3. We also show that our model and results

encompass previously know results and we study two numerical examples, where we

compute an actual controller for networked MDPs with delayed state information. In

order to gain more insight into our results, we study a special case of our main result

(particularly, we look at finite time horizon networked MDPs) in Chapter 4. In this

chapter, we provide an alternate proof of our main result using ideas from Bayesian

networks. This alternate approach provides additional insights into the finite mem-

ory of the controllers for networked MDPs. It shows that the finiteness of the bands

occurs because given the finite history of states and actions, the current state of the


system is independent of the remaining states and actions.

In Chapter 5, we describe our model of stochastic game and define the notion of

Markov perfect equilibrium and the mean field equilibrium. In the next chapter, we

define our notion of approximation to MPE and prove that as the number of players in

a game becomes large, the mean field equilibrium is a good approximation to Markov

perfect equilibrium. In Chapter 7, we prove that under a simple set of assumptions,

a mean field equilibrium exists in a wide variety of stochastic games. The existence

result is based on Kakutani’s theorem and we check each of the three conditions of

the theorem in several sections of this chapter. The following chapter then concludes

the thesis and provides a list of interesting and open challenges that are pertinent to

the understanding and design of complex systems.

Chapter 2

Networked Markov Decision

Processes

In this chapter, we present our model of networked Markov decision processes. The

subsystems in a networked MDP are coupled to each other via communication links

that are noise free, pure delay lines. We also assume that the delays are fixed but

may be different for each interconnection. A centralized controller receives delayed

state measurements from each subsystem and computes an optimal control action to

be applied to each subsystem. The control action applied to each subsystem takes

effect after a certain delay. As mentioned before, each subsystem is modeled as a

Markov decision process. Before we formally define a Markov decision process, we

formalize our notation.

Notation. In chapters 2 – 4, we use the following notation. We use superscripts

to denote particular subsystems and subscripts for the time index. Thus xit denotes

the state of the subsystem i at time t. For simplicity, we omit the superscript 1 if

there is only one subsystem. Similarly, we denote by yit the observation received from

subsystem i at time t and by uit the control input applied to subsystem i at time t. We

also denote by z, s and a the realization of the state x, observation y and control action

u. We define xit1:t2

:=(xi

t1, . . . , xi

t2

)to refer to the list of variables corresponding to

the subsystem i from time t1 to t2. If t2 < t1, we interpret this as an empty list. The

10

CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 11

notation x0:t = z0:t is interpreted as an element-wise equality, i.e., x0 = z0, x1 = z1,

etc. To denote the list of variables corresponding to all subsystems, we define xt :=

(x1t , . . . , x

nt ). Similarly, we denote ut := (u1

t , . . . , unt ) as the control action applied to all

subsystems at time t. We define Ai0:t to be the product of the variables corresponding

to times 0, . . . , t, that is Ai0:t := Ai

0Ai1 . . . Ai

t. We will see below that each Ait is a

function of several variables and the product Ai0A

i1 . . . Ai

t is interpreted as a product

of several functions. Similarly, we can define the product A10:tA

20:t . . . A

n0:t as a product

of functions A10A

11 . . . A1

t A20A

21 . . . A2

t . . . An0A

n1 . . . An

t . For a set X , we denote X n to be

the n-fold cartesian product of the set, that is X n = X × · · · × X n-times. We write

Z+ for the set of non-negative integers and write Z

++ for the set of positive integers.

2.1 Markov Decision Processes

A Markov decision process provides a framework for sequential decision making in

a stochastic environment. The decision (also known as the action) taken at time t

affects the evolution of the future system. The goal of the decision-maker is to choose

a sequence of actions to optimize a predetermined criterion. We assume that the

decisions are made at discrete times t ∈ Z+.

At each decision time t, the system occupies a state. We denote the set of all

possible states by a finite set X . At each time t, the decision-maker choses a decision

from a finite set denoted by U . Formally,

Definition 1 (MDP). A Markov decision process is a tuple (A, g) where,

1. A is a sequence A0, A1 . . . with A0 : X → [0, 1], such that A0(z) ≥ 0 for all

z ∈ X and∑

z A0(z) = 1.

For t ≥ 1, we have At : X × X × U → [0, 1], such that

At(z1, z2, a) ≥ 0, ∀ z1, z2 ∈ X and a ∈ U ,∑

z1

At(z1, z2, a) = 1, ∀ z2 ∈ X and a ∈ U .

2. g is a sequence g0, g1, . . . with gt : X × U → R.


Roughly speaking, A0 is the distribution of the initial state, At is the transition

kernel at time t and gt is the cost at time t. As an example of an MDP, consider a

discrete time dynamic system, where the state of the system at time t ≥ 0 is denoted

by xt. The system dynamics are

xt+1 = f (xt, ut, wt) . (2.1)

Here ut is the control action or the decision taken at time t. The random variables

wt for t ≥ 0 are independent noise processes. The initial state x0 is chosen to be

independent of the noise process wt. Associated with this dynamic system is an

MDP (A, g) defined as follows. For all p ∈ X , let A0(p) = Prob(x0 = p) be the

probability mass function of the initial state of the system. For t > 0, let

At (zt, zt−1, at−1) = Prob(xt = zt | xt−1 = zt−1, ut−1 = at−1

)(2.2)

be the conditional probability of state xt given the previous state xt−1 and the applied

input ut−1. It is easy to verify that the sequence A satisfies all the properties as given

in Definition 1. The sequence gt (xt, ut) represents the cost at time t and it depends

on the current state xt of the system as well as the action ut taken at time t. Note

that the dynamic system presented in equation (2.1) and the transition matrix given

in equation (2.2) are both canonical representations of a Markov decision process. In

fact, given any MDP one can represent it either as a dynamic system or as a transition

matrix [48, 20, 21, 4].

As mentioned before, the decision-maker (i.e., the controller) needs to choose an

action ut at time t. This action is chosen based upon the information available to

the controller at that time. We define hmdpt to be the information available to the

controller at time t, given by

hmdpt =

(u0:t−1, x0:t

).


We will also use imdpt to denote a realization of hmdp

t as

imdpt =

(a0:t−1, z0:t

).

Here the sequences z and a specify the values of a realization of x and u, respectively.

An MDP policy (also known as the control policy) specifies the decision or control

action to be taken at each time t.

Definition 2 (MDP Policy). An MDP policy is a sequence K = (K0, K1, . . . ) where

K0 : U × X → [0, 1] and Kt : U × X t+1 × U t → [0, 1] for all t ∈ Z++ such that

K0(a, z) ≥ 0 ∀ a ∈ U and z ∈ X ,∑

a

K0(a, z) = 1 ∀ z ∈ X .

and for all t ∈ Z++ we have

Kt(a1, z, a2) ≥ 0, ∀ a1 ∈ U , z ∈ X t+1 and a2 ∈ Ut,

∑

a1

Kt(a1, z, a2) = 1, ∀ z ∈ X t+1 and a2 ∈ Ut.

Thus, Kt is a history dependent randomized policy, which maps the history of

the MDP to an action at time t. For the discrete time dynamic system given in

equation (2.1), we can interpret the MDP policy as

Kt(at, it) = Prob(ut = at | hmdpt = it).

The MDP policies as described above are called as mixed policies since the decision at

time t is specified by a probability distribution which is a function of the information

available to the controller.

Stochastic Process Generated by an MDP. Consider an MDP (A, g) and an

MDP policy K. Associated with (A, g) and K is a stochastic process that is induced

by the MDP and its policy. For MDPs evolving over a finite time horizon T , we can


define the sample space of the stochastic process as

Ω = X × U × X . . .U × X = X × UT−1 ×X .

A typical element ω ∈ Ω is given by a sequence of states and actions. For example,

for an infinite horizon model, a typical sample path would be given as

ω = z0, a0, z1, a1, . . . .

Definition 3 (MDP Stochastic Process). Suppose (A, g) is an MDP and K is an

MDP policy. Define the state process xt(ω) and the action process ut(ω) by

Prob (x0:t = z0:t, u0:t = a0:t) = A0(z0)t∏

k=1

Ak (zk, zk−1, ak−1)t∏

k=0

Kk (ak, z0:k, a0:k−1) .

(2.3)

Note that this implies that for all t we have

Prob (xt = zt|x0:t−1 = z0:t−1, u0:t−1 = a0:t−1) = Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) ,

Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) = At (zt, zt−1, at−1) ,

Prob (ut = at|x0:t = z0:t, u0:t−1 = a0:t−1) = Kt (at, z0:t, a0:t−1) .

The above equations show that the state xt is conditionally independent of the past

states and actions given the current state xt−1 and the current action ut−1. Thus,

given the policy, the state evolution is Markov, justifying the name.

As mentioned before, the goal of the Markov decision process formulation is to

make sequential decisions in a stochastic environment. The controller’s objective is

to choose an MDP policy K so as to minimize a cost function. Typically, the cost

function has the form

JK (A, g) , E

T∑

t=0

gt (xt, ut) .

Here, the expectation is taken over the noise processes and is with respect to the prob-

ability measure defined in equation (3). The notation JK (A, g) represents the cost of


an MDP (A, g) under an MDP policy K. In this sense, the sequence g represents the

cost function or the objective that the decision-maker wishes to minimize.

2.2 Partially Observed Markov Decision Processes

A POMDP is an extension of an MDP, where the state of the system is not fully

observable [40, 45]. Thus, the decision-maker needs to make the decision with only

partial knowledge of the state of the system. The set of all possible observations as

seen by the decision-maker is denoted by a finite set Y .

Definition 4 (POMDP). A partially observed Markov decision process is a tuple

(A,C, g) where,

1. (A, g) is a Markov decision process.

2. C is a sequence C0, C1 . . . with Ct : Y × X → [0, 1], such that

Ct (s, z) ≥ 0, ∀ s ∈ Y ,∀ z ∈ X ,∑

s

Ct (s, z) = 1, ∀ z ∈ X .

Intuitively, Ct is the observation received by the controller at time t. Akin to

MDPs, the decision in a POMDP is made based on the information available to the

decision-maker. We define hpomdpt to be the information available to the controller at

time t, given by

hpomdpt =

(u0:t−1, y0:t

).

Also, we use ipomdpt to denote a realization of hpomdp

t as

ipomdpt =

(a0:t−1, s0:t

).

Definition 5 (POMDP Policy). A POMDP policy is a sequence K = (K0, K1, . . . )


where K0 : U × Y → [0, 1] and Kt : U × Y t+1 × U t → [0, 1] for all t ∈ Z++ such that

K0(a, s) ≥ 0 ∀ a ∈ U and s ∈ Y ,∑

a

K0(a, s) = 1 ∀ s ∈ Y .

and for all t ∈ Z++ we have

Kt(a1, s, a2) ≥ 0, ∀ a1 ∈ U , s ∈ Y t+1 and a2 ∈ Ut,

∑

a1

Kt(a1, s, a2) = 1, ∀ s ∈ Y t+1 and a2 ∈ Ut.

For partially observed discrete time dynamic processes, the POMDP policy gives

a probability distribution over possible actions or controls as a function of the infor-

mation available to the decision-maker. That is

Kt(at, it) = Prob(ut = at | hpomdpt = it).

Stochastic Process Generated by a POMDP. Consider a POMDP (A,C, g)

and a POMDP policy K. Associated with every (A,C, g) and K is a stochastic

process that is induced by it. For a POMDP evolving over a finite horizon T , we can

define the sample space of the stochastic process as

Ωpomdp = X × Y × U × X . . .Y × U × X

= X × Y × UT−1 ×X

A typical sample path for the infinite horizon POMDP would be given as

ω = z0, s0, a0, z1, s1, a1, . . . .

Definition 6 (POMDP Stochastic Process). Consider a POMDP (A,C, g) along with

a POMDP policy K. Define the state process xt(ω), the observation process yt(ω) and


the action process ut(ω) by

Prob (x0:t = z0:t, y0:t = s0:t, u0:t = a0:t) = A0:tC0:tK0:t. (2.4)

Here we have suppressed the arguments for notational compactness. Note that

this implies that for all t we have

Prob (xt = zt|x0:t−1 = z0:t−1, u0:t−1 = a0:t−1) = Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) ,

Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) = At (zt, zt−1, at−1) ,

Prob (yt = st|xt = zt) = Ct (st, zt) ,

Prob (ut = at|y0:t = s0:t, u0:t−1 = a0:t−1) = Kt (at, s0:t, a0:t−1) .

Similar to MDPs, given a fixed policy, the state evolution process xt and the observa-

tion process yt are both Markov. The POMDP policy only depends on the observation

vector y and not on the actual state vector x, justifying the partially observed part

of the name.

The cost function for POMDPs is given as

JK (A,C, g) = E

T∑

t=0

gt (xt, ut) .

where the expectation is taken with respect to the marginal probability measure

derived from equation (2.4). Here JK (A,C, g) represents the cost of a POMDP

(A,C, g) under a POMDP policy K. The objective of a decision-maker is to find a

POMDP policy which minimizes the expected cost.

2.2.1 Information State for POMDPs

An information state for a POMDP represents all the information about the history

of the POMDP that is relevant to the selection of an optimal control. A POMDP

can be reformulated as an MDP using the information state. For a POMDP, the

information state consists of either a complete history of observations and actions or


their corresponding sufficient statistics [40]. We define the term sufficient infor-

mation state to mean a function of the past observations of the POMDP that is

detailed enough to permit an optimal controller to use the history processed through

this function as its only input. Using the sufficient information state, a POMDP can

be converted into an MDP with observable state such that the optimal controller for

this MDP also minimizes the cost function for the original POMDP.

Definition 7. Suppose (A,C, g) is a POMDP and define a sequence of functions

γt : U t × Y t+1 → Q.

Let ξt = γt (u0:t−1, y0:t). Then ξt is called a sufficient information state for the

POMDP if there exists an MDP (A, g) over the state space Q and the action space U

such that, for all POMDP policies K, we have

1. A is a sequence such that

At+1 (qt+1, qt, at) = Prob (ξt+1 = qt+1 | ξ0:t = q0:t, u0:t = a0:t) . (2.5)

2. g is a sequence g0, g1 . . . such that

gt (qt, at) = E (gt (xt, at) | ξt = qt, ut = at) . (2.6)

3. For all t ≥ 0, we have

Prob(

xt = zt | ξt = γt (s0:t, a0:t−1) , . . . , ξ0 = γ0(s0), u0:t−1 = a0:t−1

)

=

Prob (xt = zt | y0:t = s0:t, u0:t−1 = a0:t−1) . (2.7)

Note that A in equation (2.5), gt in equation (2.6) and the conditional probability

in equation (2.7) are independent of the POMDP policy K. Furthermore, equa-

tion (2.5) shows that given the action sequence or the policy the evolution of ξt is

Markov.


From the above definition, it is clear that associated with any POMDP is a suffi-

cient information state MDP (A, g). Let hi-mdpt be the history of the sufficient infor-

mation state MDP at time t. Then, we have

hi-mdpt =

(u0:t−1, ξ0:t

).

We will use ii-mdpt to denote a realization of hi-mdp

t as

ii-mdpt =

(a0:t−1, q0:t

).

As before, we define a sufficient information state MDP policy as a mapping from the

history of the information state MDP to an action at time t. Let Kt be a sufficient

information state MDP policy. As before, we can interpret Kt as

Kt(at, it) = Prob(ut = at | h

i-mdpt = it

).

The following theorem shows that we can find an optimal POMDP policy by consid-

ering the associated MDP over the sufficient information state.

Theorem 8. Consider a POMDP (A,C, g) and let Ppomdp be the set of all POMDP

policies. Let (A, g) be the sufficient information state MDP associated with the given

POMDP and let Pi-mdp be the set of all sufficient information state MDP policies.

Then, for any T , we have

minKt∈Ppomdp

t=0,1,···T

T∑

t=0

E[gt(zt, at)

]= min

Kt∈Pi-mdp

t=0,1,···T

T∑

t=0

E[gt(qt, at)

]

Proof. The proof follows from standard dynamic programming techniques as given

in Chapter 6 of [40].

From the above theorem, it is clear that one can find an optimal policy for a

POMDP by transforming it into a sufficient information state MDP. Given an optimal

sufficient information state policy Kopt one may immediately compute the optimal

POMDP policy by composing Kopt with γ. The optimal sufficient information state


policy Kopt may be found using standard dynamic programming recursion. From [48],

we know that the optimal policy for an MDP is a function of its current state. In

other words, the optimal policy for a POMDP is just a function of its sufficient

information state ξt. One such sufficient information state is the entire history of

the POMDP, where γt is an identity function [40]. As we show below, for a certain

class of POMDPs (in particular for networked MDPs), the sufficient information state

includes only finite past history of observations and control actions. In other words,

for a certain class of POMDPs, the function γt is a projection operator. Also note

that the above theorem can be easily extended to the infinite horizon case (both

average cost as well as discounted cost), as long as the limiting value of the sum of

the costs is well defined. For the discounted infinite horizon case, we can incorporate

the discount factor in the time dependent cost function.

2.3 Networked Markov Decision Processes

A networked Markov decision process (N-MDP) is a weighted directed graph G =

(V , E), where V = 1, . . . , n is a finite set of vertices and E ⊂ V × V is a set of

edges. Each vertex i ∈ V represents a Markov decision process. An edge (i, j) ∈ E

if the MDP at vertex i directly affects the MDP at vertex j. Associated with each

edge (i, j) ∈ E is a non-negative integer weight, Mij, which specifies the delay for the

dynamics of vertex i to propagate to vertex j. We assume without loss of generality

that (i, i) /∈ E .

Associated with each j ∈ V, let Paj be the set of all vertices with an incoming

edge to vertex j, specifically

Paj = i ∈ V | (i, j) ∈ E .

Similarly, for each j ∈ V, let Chj be the set of all vertices connected by an edge

outgoing from vertex j, specifically

Chj = i ∈ V | (j, i) ∈ E .


Thus, Paj is the set of vertices that affect the system at node j and Chj is the set of

vertices that are affected by the system at node j. At each time t, the state of the

MDP at vertex i belongs to a finite set X i. The decision or the control action taken

at vertex i is drawn out of a finite set U i.

Remark. In the remainder of this section, we denote X−i =∏

j∈Pai X j. Also de-

note X (n) =∏n

i=1Xi as the cartesian product of state space corresponding to all

vertices. Similarly, let U (n) =∏n

i=1 Ui.

Definition 9. A networked Markov decision process is a tuple (A, g) where

1. A is a set of transition matrices Ait, t ≥ 0 | i ∈ V with Ai

0 : X i → [0, 1] for

all i ∈ V, such that for all z ∈ X i, we have

Ai0 (z) ≥ 0 and

∑

z

Ai0 (z) = 1.

For t ≥ 0, we have At : X i × X i × X−i × U i → [0, 1] such that, for all i ∈ V

and for all a ∈ U i and z ∈ X−i we have

Ait(z1, z2, z, a) ≥ 0, ∀ z1, z2 ∈ X

i,∑

z1

Ait(z1, z2, z, a) = 1, ∀ z2 ∈ X

i.

2. g is a sequence g0, g1, . . . with gt : X (n) × U (n) → [0, 1].

As an example of a networked Markov decision process, consider a networked

system consisting of four subsystems as shown in Figure 2.1. The system dynamics

are

xit+1 = f i

(xi

t, xjt−Mji

| j ∈ Pai, uit, w

it

), (2.8)

for all i ∈ V. Here uit ∈ U

i is the control action applied to subsystem i at time t. The

random variables xi0, w

it for t ≥ 0 and i ∈ V are independent, i.e., the noise processes

are independent across both time and subsystems. The directed graph corresponding

to this networked MDP is shown in Figure 2.2.


S4 S3

S2

S1

Controller

M43

M34

M23M42

M21 M12N4 N3

N2

N1

Figure 2.1: A network of interconnected subsystems with delays. Subsystem i isdenoted by Si, the network propagation delay from Si to Sj is denoted by Mij andthe measurement delay from Si to the controller is denoted by Ni.

Associated with this system is a networked MDP (A, g) as defined below. For p ∈

X i, let Ai0(p) = Prob(xi

0 = p) be the probability mass functions of the initial state of

subsystem i ∈ V. The initial states x10, . . . , x

n0 are chosen independently. For t > 0,

let

Ait(z, p, q, a) = Prob

(

xit = z | xi

t−1 = p, xjt−1−Mji

= qj | j ∈ Pai, uit−1 = a

)

, (2.9)

be the conditional probability mass function of state xit given the previous states xi

t−1

and xjt−1−Mji

| j ∈ Pai and the applied input uit−1. It is easy to verify that the

sequence A satisfies the properties in Definition 9. The sequence gt(xt, ut) represents

the cost at time t and depends on the state of the system xt = (x1t , . . . , x

nt ) as well as

the action ut = (u1t , . . . , u

nt ) applied at time t.

In a networked MDP, the controller needs to choose a control action corresponding

to each vertex i ∈ V. The actions are chosen based on the information available to


4 3

2

1

Figure 2.2: Directed graph for the network of Figure 2.1.

the controller at time t. Associated with each vertex i ∈ V of a networked MDP,

we have a non-negative integer Ni which specifies the delay in receiving the state

measurement from system i. We define hn-mdpt to be the information available to the

decision-maker at time t, given by

hn-mdpt =

(x1

0:t−N1, u1

0:t−1, . . . , xn0:t−Nn

, un0:t−1

).

Also define in-mdpt to be a realization of hn-mdp

t as

in-mdpt =

(z10:t−N1

, a10:t−1, . . . , z

n0:t−Nn

, an0:t−1

).

Thus, the observations received by the decision-maker at time t consist of the state

of the subsystem i delayed by Ni time steps. A networked MDP policy specifies the

decisions taken at time t.

Definition 10 (Networked-MDP Policy). A networked MDP policy is a sequence

K = (K0, K1, . . . ) where

K0 : U (n) ×n∏

i=1

(X i)1−Ni → [0, 1]

and

Kt : U (n) ×n∏

i=1

(X i)t+1−Ni ×

n∏

i=1

(U i)t→ [0, 1],


for all t ∈ Z++ such that

K0(a, z) ≥ 0 ∀a ∈ U (n), ∀z ∈n∏

i=1

(X i)1−Ni ,

∑

a

K0(a, z) = 1 ∀z ∈n∏

i=1

(X i)1−Ni ,

and for all t ∈ Z++, for all a1 ∈ U

(n), z ∈∏n

i=1 (X i)t+1−Ni and a2 ∈

∏ni=1 (U i)

twe

have

Kt (a1, z, a2) ≥ 0,∑

a1

Kt (a1, z, a2) = 1.

Note that for all times t, the product∏n

i=1 (X i)t+1−Ni in the above definition is

taken over those i for which t+ 1−Ni is strictly positive. For the networked systems

as given in equation (2.8), a general mixed control policy is defined as a sequence of

transition matrices Kt, t ≥ 0 given by

Kt(at, it) = Prob(ut = at | hn-mdpt = it).

2.3.1 Networked MDP as a POMDP

In networked MDPs, although the controller receives state information from the sub-

systems, these states are delayed by different amounts. Thus, a networked MDP can

be written as a POMDP. Consider a networked MDP as given in Definition 9. Let

us define a new state xt =xi

t−b′:t | i ∈ V, where we choose b′ = maxi,j∈V Mij +

maxi∈V Ni. The state x is chosen such that in the resulting system the observation at

time t is only a function of the current state at time t. It is easy to check that there

exists a function f such that

xt+1 = f (xt, ut, wt) .


Associated with this function is a transition probability mass function At (zt+1, zt, at),

where zt is the realization of the state xt. The observation at any time t is given as

yt = h(xt).

Corresponding to this observation process is a probability mass function Ct (st, zt),

where st is the realization of the observation yt and is given as

st =zi

t−Ni| i ∈ V

.

The cost function is given as

gt (xt, ut) = gt (xt, ut) . (2.10)

It is easy to check that the functions At, Ct and gt satisfy the properties given in

Definition 4. The networked MDP can thus be written as a POMDP(A, C, g

).

As shown in the above subsection, we can write any networked MDP as a POMDP.

In the next chapter, we compute the sufficient information state (as defined in Defi-

nition 7) for networked MDPs.

Chapter 3

Information State for Networked

MDPs

In this chapter, we establish the main result associated with the information state

for networked MDPs. This result establishes that the sufficient information state

for networked Markov decision processes consists only of a finite number of past

observations. As we will see, these finite numbers or bands depend only the network

structure and the associated delays. We begin by making the following definitions.

Definition 11. Let

di = maxNi, maxk∈Pai

(Nk −Mki − 1) (3.1)

and define the integers bi by

bi = maxdi, maxk∈Chi

(dk + Mik) −Ni (3.2)

Remark. In the remainder of this chapter, we use the following additional notation.

We define a new function Pt for t ≥ 0 by

Pt = A10:tA

20:t . . . A

n0:t.

26

CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 27

Define

αt = zi0:t−Ni

, ai0:t−1 | i ∈ V,

βt = zit−Ni−bi:t−Ni

, ait−di:t

| i ∈ V.

Furthermore, the notation z /∈ αt means the set

z | z /∈ αt = zit−Ni+1:t | i ∈ V,

and the notation z /∈ βt and a /∈ βt mean the sets

z | z /∈ βt = zi0:t−Ni−bi−1 | i ∈ V.

a | a /∈ βt = ai0:t−di−1 | i ∈ V.

Recall that any list of variables xt1:t2 with t2 < t1 is interpreted as empty.

The following theorem is the main result for networked MDPs. It defines a suf-

ficient information state for a networked Markov decision process. It shows that a

networked MDP can be converted into a fully observable MDP with a state that is

bounded and does not grow with time. Note that a networked MDP can be written

as a POMDP(A, C, g

), with state x.

Theorem 12. Consider a networked Markov decision process. Then,

ξt =ui

t−di:t−1, xit−Ni−bi:t−Ni

| i ∈ V

. (3.3)

is a sufficient information state for the networked MDP.

To prove this theorem, we check the conditions of a sufficient information state

as given in Definition 7. The following key lemma shows that ξt as defined in equa-

tion (3.3) satisfies the first condition of a sufficient information state as given in

equation (2.5).

Lemma 13. Consider a networked Markov decision process (A, g) and a networked


MDP policy K. Define

At+1(qt+1, qt, at) , Prob(ξt+1 = qt+1 | ξt = qt, ut = at

).

Then, ξt satisfies the following Markov property

At+1 = Prob(ξt+1 = qt+1 | ξ0:t = q0:t, u0:t = a0:t

),

and A is independent of the policy K.

Proof. Using Bayes’ rule, we can write

L = Prob(ξt+1 = qt+1 | ξ0:t = q0:t, u0:t = a0:t

)=

Prob(ξ0:t+1 = q0:t+1, u0:t = a0:t

)

Prob (ξ0:t, = q0:tu0:t = a0:t).

(3.4)

Note that the above conditional probability is well defined when the sequence of

random variables (ξ0:t = q0:t, u0:t = a0:t) has a non-zero probability. For the event,

where this sequence has zero probability, we define the conditional probability to be

zero. Also note that the sequence ξ0:t consists of the variables xi0:t−Ni

, ui0:t−1 | i ∈ V.

Let us denote the denominator of equation (3.4) by Lden. Then,

Lden =∑

z /∈αt

PtK0:t, (3.5)

where we have used the definition of ξ0:t and the notation that Pt = A10:t . . . A

n0:t. Note

that the transition kernel Ait has arguments

zit, z

it−1, a

it−1, z

kt−1−Mki

| k ∈ Pai.

We first show that some of the Ait’s are independent of the variables being summed

over. Consider an arbitrary s ≥ 0, and suppose Ait−s depends upon at least one


of z1t−N1+1:t, . . . , z

nt−Nn+1:t. Then, we must have

t−Ni + 1 ≤ t− s or

t−Ni + 1 ≤ t− s− 1 or

t−Nk + 1 ≤ t− s− 1−Mki for some k ∈ Pai

where each inequality arises from the corresponding argument of Ait−s. This implies

that

s ≤ Ni − 1 or s ≤ maxNk − 1−Mki | k ∈ Pai − 1.

Hence for each i, the largest such s is exactly equal to di−1 where di is defined by equa-

tion (3.1). Thus if s ≥ di then Ait−s does not depend on any of z1

t−N1+1:t, . . . , znt−Nn+1:t.

In other words, Ai0:t−di

are independent of all the variables of summation. Further-

more, note that K0:t only depend on the variables in αt and hence are independent

of the variables of summation. Thus, we can write the denominator of equation (3.4)

as

Lden = A10:t−d1

. . . An0:t−dn

K0:t

∑

z /∈αt

A1t−d1+1:t . . . A

nt−dn+1:t. (3.6)

Let us denote the numerator of equation (3.4) as Lnum. Then,

Lnum =∑

z /∈αt+1

Pt+1K0:t. (3.7)

Following the same argument as above, it is easy to verify that if s ≥ di − 1, then

Ait−s does not depend on any of z1

t−N1+2:t+1, . . . , znt−Nn+2:t+1. Thus, Ai

0:t−di+1 are inde-

pendent of the variables of summation of Lnum. We can thus write Lnum as

Lnum = A10:t−d1

. . . An0:t−dn

K0:t

∑

z /∈αt+1

A1t−d1+1:t+1 . . . An

t−dn+1:t+1.


Canceling the common factors from the numerator and denominator gives

L =

∑

z /∈αt+1A1

t−d1+1:t+1 . . . Ant−dn+1:t+1

∑

z /∈αtA1

t−d1+1:t . . . Ant−dn+1:t

. (3.8)

Using Bayes’ rule, we can write

R = Prob(ξt+1 = qt+1 | ξt = qt, ut = at

)=

Prob(ξt+1 = qt+1, ξt = qt, ut = at

)

Prob (ξt = qt, ut = at). (3.9)

As before, if the denominator is zero, we define the conditional probability to be zero.

In that particular case, the lemma is trivially true. Let Rden denote the denominator

of equation (3.9). Using the definition of ξt, we can write the denominator as,

Rden =∑

a/∈βt

∑

z /∈βt

∑

z /∈αt

PtK0:t.

As before Ait−di

and K0:t are independent of the variables of summation z /∈ αt and

hence we can write Rden as

Rden =∑

a/∈βt

∑

z /∈βt

A10:t−d1

. . . An0:t−dn

K0:t ×∑

z /∈αt

A1t−d1+1:t . . . A

nt−dn+1:t

︸︷︷︸

Rden

.

Let us determine explicitly what variables Rden depends on. For notational conve-

nience, let us denote

T = A1t−d1+1:t . . . A

nt−dn+1:t.

If T depends on zis then we must have

t− di ≤ s or

t− dk −Mik ≤ s for some k ∈ Chi.

The first inequality holds if zis occurs in Ai

t−di+1···t and the second holds if it occurs


in Akt−dk+1···t. If Rden depends on on zi

t−Ni−r then,

t− di ≤ t−Ni − r or

t− dk −Mik ≤ t−Ni − r for some k ∈ Chi,

and these conditions imply that

r ≤ di −Ni or

r ≤ maxdk + Mik | k ∈ Chi −Ni.

Using the definition of bi in equation (3.2), these two inequalities imply that r ≤ bi.

Thus Rden depends on ait−di:t−1 | ∈ V and zi

t−Ni−bi:t−Ni| i ∈ V and hence is

independent of variables a /∈ βt and z /∈ βt. Thus, we can write

Rden =

∑

a/∈βt

∑

z /∈βt

A10:t−d1

. . . An0:t−dn

K0:t

(∑

z /∈αt

A1t−d1+1:t . . . A

nt−dn+1:t

)

. (3.10)

Let Rnum denote the numerator of the equation (3.9). Then,

Rnum =∑

a/∈βt

∑

z /∈βt

∑

z /∈αt+1

Pt+1K0:t.

Using the same argument as above we can write the numerator as

Rnum =

∑

a/∈βt

∑

z /∈βt

A10:t−d1

. . . An0:t−dn

K0:t

∑

z /∈αt+1

A1t−d1+1:t+1 . . . An

t−dn+1:t+1

.

(3.11)

From equation (3.10) and equation (3.11) we have

R =

∑

z /∈αt+1A1

t−d1+1:t+1 . . . Ant−dn+1:t+1

∑

z /∈αtA1

t−d1+1:t . . . Ant−dn+1:t

. (3.12)

The result follows from equations (3.8) and (3.12).


The next lemma evaluates the cost function gt for the induced MDP and shows

that it is independent of the POMDP policy.

Lemma 14. The cost function as defined in equation (2.6) is independent of the

POMDP policy K.

Proof. From equation (2.6), we have that

gt (qt, at) = E (gt (xt, at) | ξt = qt, ut = at) ,

Using the definition of gt from equation (2.10), we get that

gt (qt, at) = E (gt (xt, at) | ξt = qt, ut = at) ,

=∑

zt

gt(zt, at)Prob (zt, qt, at)

Prob (qt, at).

Using the definition of ξt we get that

Prob (qt, at) =∑

a/∈βt

∑

z /∈βt

∑

z /∈αt

PtK0:t.

Thus, we get that gt is given as

gt (qt, at) =

∑

a/∈βt

∑

z /∈βt

∑

z /∈αtgt(zt, at)PtK0:t

∑

a/∈βt

∑

z /∈βt

∑

z /∈αtPtK0:t

,

=

∑

z /∈αtgt(zt, at)A

1t−d1+1:t . . . A

nt−dn+1:t

∑

z /∈αtA1

t−d1+1:t . . . Ant−dn+1:t

.

where the last equality follows from a similar argument as given for equation (3.10).

Thus the cost function is independent of the POMDP policy K.

The following lemma shows that the conditional probability density function for

the state at time t is the same for the induced MDP and the original POMDP.


Lemma 15. For all t ≥ 0, we have

Prob (xt = zt | ξ0:t = q0:t, u0:t−1 = a0:t−1) =

Prob (xt = zt | y0:t = s0:t, u0:t−1 = a0:t−1) , (3.13)

where we have used the notation γt (s0:t, a0:t−1) = qt.

Proof. Note that the sequence ξ0:t consists of the variables xi0:t−Ni

, ui0:t−1 | i ∈ V.

Also from section 2.3.1, we know that yt =xi

t−Ni| i ∈ V

. The lemma follows

trivially from these two facts.

Proof of Theorem 12. From Lemmas 13, 14, and 15, we get that ξt as defined in

equation (3.3) is a sufficient information state for a networked MDP.

S4 S3

S2

S1

Controller

M43

M34

M23M42

M21 M12N4 N3

N2N1

P4 P3

P2 P1

Figure 3.1: A networked Markov decision process with action delays. The controlaction delay to subsystem Si is denoted by Pi.


3.1 Networked MDP with Action Delays

In this section, we extend our result to the case where the control action does not

take effect immediately. Consider a networked Markov decision process as shown in

Figure 3.1. The system dynamics are

xit+1 = f i

(xi

t, xjt−Mji

| j ∈ Pai, uit−Pi

, wit

),

for all i ∈ V. Here uit−Pi

is the control action applied to subsystem i at time t− Pi.

To obtain a sufficient information state for a networked MDP with action delays,

we convert this system into a networked MDP with no action delays. To do this, let

us define a new state xit = (xi

t, uit−Pi:t−1) for all i ∈ V. As before, if any Pi = 0, we

interpret the list uit−Pi:t−1 as empty and thus xi

t = xit. This new state is chosen such

that the state evolution of each subsystem at time t + 1 depends on the current state

and action at time t. Thus, a networked MDP with action delays can be reformulated

as a networked MDP with no action delays with system dynamics given as

xit+1 = f i

(xi

t, xjt−Mji

| j ∈ Pai, uit, w

it

),

for all i ∈ V. Using Theorem 12, we know that a sufficient information state for this

new system consists of past states xit−bi−Ni:t−Ni

and past control actions uit−di:t−1 for

all i ∈ V. Let us define a new band di as

di =

di if Pi = 0

bi + Ni + Pi otherwise(3.14)

Using this definition, it is easy to check that a sufficient information state for a net-

worked MDP with action delays consists of past states xit−bi−Ni:t−Ni

and past control

actions uit−di:t−1

for all i ∈ V. This gives us the following theorem.

Theorem 16. Consider a networked Markov decision process with action delays.


Then,

ξt =

uit−di:t−1

, xit−Ni−bi:t−Ni

| i ∈ V

.

is a sufficient information state for a networked MDP with action delays.

3.2 Discussion

From Theorem 12, we note that every networked MDP has a sufficient information

state ξt given by equation (3.3), which depends on only the finite history of the states

and control actions. Thus, from Definition 7 we have that, associated with every

networked MDP is a tuple (A, g) where At is the transition matrix given by

At+1(qt+1, qt, at) = Prob(ξt+1 = qt+1 | ξt = qt, ut = at

),

and gt is the cost function associated with this new MDP. The cost function is given

by equation (2.6). From, the Theorem 8, we note that an optimal controller for the

original POMDP can be found by considering the associated sufficient information

state MDP. An optimal controller can be found using dynamic programming [48, 20]

over the state space Q generated by ξt. This holds for both finite horizon as well as

infinite horizon average cost as well as infinite horizon discounted cost models. In the

next subsection we show that the previously known results on single systems with

delayed state [8] can be obtained as a special case of our main result.

3.2.1 Single System with Delayed State Observations and

Action Delays

We consider control of a single system with a delayed state measurement. This is

precisely the information pattern considered in [8], and we show that in this case the

above results imply those of [8]. We have dynamics

xt+1 = f(xt, ut, wt)


where, since the system is composed of exactly one subsystem, we have x1t = xt.

The controller must choose ut at time t, when it has access to u0, . . . , ut−1 and

x0, x1, . . . , xt−N1 . Then, from Definition 11 we have

d1 = N1 and b1 = 0

and so the optimal control action ut is a memoryless function of ut−N1 , . . . , ut−1

and xt−N1 . Thus, the optimal controller applied at time t is a function of the last

observed state and the previous N1 actions, which is exactly the result of [8].

A single system with both observation and action delays was analyzed in [39].

Consider a single system with both delayed state measurement of N1 steps and a

delay of P1 steps in control action. From Theorem 16, we know that the control

action at time t is a function of ut−N1−P1 , . . . , ut−1 and the state xt−N1 , which is

exactly the result obtained in [39].

S2S1

Controller

M12

M21

N2N1

Figure 3.2: A network of two interconnected subsystems with delays. Here the controlinput is only applied to subsystem 1.

3.3 Numerical Examples

In this section we consider two numerical examples where we compute the optimal

controller for networked Markov decision processes. In the first example, we study

linear scalar systems with delays. For a special class of such systems, one can compute

controllers using an approach based on the Youla parametrization, in combination

with convex optimization, as in [24]. We observe that for a certain class of systems, the

optimal controller has exactly the same amount of past history as given in Theorem 12.

This shows that bands computed in the main theorem are tight in the sense that there


are systems where using any less amount of information would yield sub-optimal

controllers. As a second example, we study controller design for two interacting

queues. Using the knowledge of the bands as computed from Theorem 12, we use

dynamic programming to explicitly compute the optimal controller. The knowledge

of the bands allows us to greatly simplify the computation of the optimal controller.

3.3.1 Linear Systems with Delays

As a first example, we compute an optimal controller for the special case of a linear

scalar system with delays. For simplicity, we consider a two system case as shown

in Figure 3.2. Note that the control action is only applied to subsystem 1. For this

system, the controller is only required to store bi + 1 values of the state of system i

and d1 values of the past inputs, where

b1 = max0, N2 + M12 −N1,

b2 = max0, N1 + M21 −N2,

d1 = maxN1, N2 −M21 − 1.

(3.15)

The system dynamics are given by

x1t+1 = f 1(x1

t , ut, x2t−M21

, w1t ),

x2t+1 = f 2(x2

t , x1t−M12

, w2t ).

(3.16)

The information available to the controller at time t is given as

yt = (a0:t−1, z10:t−N1

, z20:t−N2

)

The system under consideration has a continuous state space, and the results pre-

sented above may be extended to this scenario under appropriate technical assump-

tions on the probability measures. Specifically, we consider system dynamics which


are a special case of those in equation (3.16) given by

x1t+1 = x1

t + 0.25x2t−2 + ut + w1

t

x2t+1 = 0.25x1

t−2 + x2t + w2

t

The noise processes w1t and w2

t are zero mean, unit variance white Gaussian noise

processes. The initial states x10 and x2

0 are independent of each other and are normally

distributed with variance 10−5. The objective is to minimize the cost

J = E

((T−1∑

t=0

(‖xt‖2 + ‖ut‖

2)

)

+ ‖xT‖2

)

,

which is a standard quadratic cost. We will use a time horizon of T = 10. The

propagation and measurement delays are

M12 = 2, M21 = 2, and N1 = 0, N2 = 1

so that the controller receives the observations from subsystem 2 after a single time-

step delay. For this system, equation (3.15) gives the memory requirements of the

optimal controller as

b1 = 3, b2 = 1 and d1 = 0.

Therefore at each time t the optimal input ut is given by a memoryless function of

ymemt , that is of the data x1

t−3, x1t−2, x

1t−1, x

1t , x

2t−2, x

2t−1.

To compute the optimal controller for this problem, we use an approach based on

the Youla parametrization, in combination with convex optimization, as in [24]. A

similar approach is used in [49] to compute optimal decentralized controllers. The

optimal controller for this problem is

u0

u1

...

u9

= −F

x10

x11...

x19

−G

x20

x21...

x29


where F and G are given by

F ≈1

10

8.2

1.2 8.1

1.1 1.1 7.9

1.1 1 1 7.7

0.9 0.8 0.8 7.4

0.6 0.5 0.5 6.9

0.3 0.3 0.2 6.5

6.2

6

5

G ≈1

10

0

0 5.8

1.7 5.4

1.7 4.8

1.6 4

1.6 3

1.6 1.8

1.5 0.8

1.5 0.5

1.2 0

Hence we have

µ(ymemt ) = −

T−1∑

s=0

Ftsx1s −

T−1∑

s=0

Gtsx2s.

It is apparent from the above matrices that the control input at time t depends only on

the past history of x according to the memory limits b1, b2 and d1 as in Equations (3.1)

and (3.2).


M12M21

S2S1

Figure 3.3: A system of two interacting queues. Here the solid line represent jobs oftype R which enter system 1, and are then transported to system 2 after a delay ofM12. Similarly, the dashed line represents jobs of type B which enter system 2 andare transported to system 1 after a delay of M21. Of the two queues at each system,the top queue is the high-priority queue.

3.3.2 Controller Design for Finite State Systems

We consider a network of two interconnected queues as shown in Figure 3.3. Our

example is inspired by the model of interacting queues studied in [30]; however our

objective here is to illustrate the computation of an optimal controller for such sys-

tems. As opposed to previous works, we introduce delays between queues as well as

delays in receiving queue state information at a centralized controller. We however

assume that any control inputs have immediate effects.

Informally, the system description is as follows. Jobs of type R arrive at system 1

while jobs of type B arrive at system 2. The arrival process at each system is in-

dependent and identically distributed over time. Furthermore, we assume that the

arrival process at the two systems are independent of each other. At each system,

the server maintains two kinds of queues, the high priority queue and the low priority

queue. Jobs of type R are placed in the high priority queue at system 1, where they

are processed and are moved to system 2 after a delay of M12 time units. At system

2, these jobs are placed in the low priority queue. On the other hand, jobs of type B

enter the high priority queue at system 2 and after being processed at system 2, they

are moved to system 1 after a delay of M21 time units. At system 1, these jobs are

placed in the low priority queue. At each system, if the queue is full the incoming

jobs are dropped.


The server at each system has two modes of operation, a slow and a fast mode.

In the slow mode, in each time unit, the server serves one job from the high priority

queue (provided the queue is non-empty). In the fast mode, as long as the queues

are non-empty, the server serves one job from each of the high and the low priority

queues. After being processed, the high priority jobs are moved to the other system,

while the low priority jobs exit the system. A centralized controller receives delayed

information about the total number of jobs in each queue, and decides what mode

each server should operate at. At each time step, a cost depending on the number of

jobs in each queue and the mode of operation of each server is incurred.

To describe the above system mathematically, we let xiR(t) be the number of R

jobs in queue i at time t. Similarly, we let xiB(t) be the number of B jobs in queue i

at time t. Here both xiR(t) and xi

R(t) are in the set 0, 1, 2, . . . ,Q, where Q is the

queue length at each system. For simplicity, we assume that all the queues are of same

length. The control action is ui(t) ∈ 0, 1, for i = 1, 2. Here ui(t) = 0 represents the

slow mode of the server. The system dynamics are

x1R(t + 1) = maxminx1

R(t) + w1(t),Q − 1, 0

x1B(t + 1) = maxminx1

B(t) + 1x2B

(t−M21)>0,Q − u1(t), 0

x2B(t + 1) = maxminx2

B(t) + w2(t),Q − 1, 0

x2R(t + 1) = maxminx2

R(t) + 1x1R

(t−M12)>0,Q − u2(t), 0

where 1x>0 is the indicator function. Here wit, i = 1, 2 are the number of packets that

arrive at each queue at time t. We let the state of each system be the number of R

and B packets at each time, i.e., xi(t) =(xi

R(t), xiB(t)

)for i ∈ 1, 2. It is easy to

check that there exist functions f 1 and f 2 such that the state dynamics are given by

x1(t + 1) = f 1(x1(t), x2(t−M21), u

1(t), w1(t)),

x2(t + 1) = f 1(x2(t), x1(t−M12), u

2(t), w2(t))

Let gs(x1(t), x2(t)) be the cost associated with the state and ga(u

1(t), u2(t)) be


the cost associated with the actions. We assume that the state cost is

gs(x1(t), x2(t)) =

(x1

R(t) + x1B(t) + x2

R(t) + x2B(t)

)2.

The action cost is

ga(u1(t), u2(t)) =

(u1(t) + 1 + u2(t) + 1

)2,

where we assume that for ui(t) = 0, the cost incurred is 1 unit. The total cost at

time t is thus

g(x1(t), x2(t), a1(t), a2(t)

)= (1− α)gs(x

1(t), x2(t)) + αga(u1(t), u2(t)),

where α is the weighting factor. The objective is to minimize the infinite horizon

discounted cost

J = E

(∞∑

t=0

βtg(

x1(t), x2(t), u1(t), u2(t)))

,

= E

(∞∑

t=0

βt(

(1− α)gs + αga

)

)

,

= (1− α)Js + αJa

where β is the discount factor. Here Js and Ja are the infinite horizon discounted

costs associated with the state and action.

For purposes of numerical computation, we let Q = 1 in our specific example. The

state space consists of (0, 0), (0, 1), (1, 0), (1, 1), where the first element in the tuple

represents the number of R jobs. The arrival process at both systems is assumed to be

Bernoulli with the probability of arrival at the first system given by Prob(w1(t) = 1) =

0.1 and the probability of arrival at the second system given by Prob(w2(t) = 1) = 0.3.

The inter subsystem propagation delays and observation delays are chosen to be

M12 = 2, M21 = 1, and N1 = 2, N2 = 1


0 1 2 3 4 5 6 7 8 915

16

17

18

19

20

21

22

23

24

25

Js

Ja

Figure 3.4: Infinite horizon discounted action cost Ja (averaged over all initial states)vs. the infinite horizon discounted state cost Js. The curve is plotted by varying theweighting factor α.

Using equations (3.1) and (3.2), we find

b1 = 1, b2 = 2 and d1 = 2, d2 = 1.

For the discount factor β = 0.75, Figure 3.4, shows the tradeoff curve for Ja vs. Js.

This curve shows the trade off between the action cost and the state cost for different

values of the weighting factor α.

This section illustrates that the knowledge of the bands simplifies the computation

of the optimal controller. Without the knowledge of these bands, one would compute

the optimal controller by treating this networked MDP as a POMDP and using dy-

namic programming over the belief state. The knowledge of these bands allows us to

write this networked MDP as a fully observed MDP over the sufficient information

state and greatly simplifies the computation of the optimal controller.

As shown in Theorem 12, the sufficient information state for networked MDPs

depends only on the finite past history of the observations. This finite history or bands

depend only on the network structure and the associated delays. In the next chapter,


we look at a special case of networked MDPs over a finite time horizon. Based on the

ideas from Bayesian networks, we provide an alternate proof of Theorem 12. This

alternate proof provides an intuitive explanation for the bands given in Definition 11.

In particular, it shows that the finiteness of the bands occurs because given the finite

history of states and actions, the current state of the system is independent of the

remaining states and actions.

Chapter 4

A Bayesian Network Approach to

Network MDPs

In this chapter, we restrict our attention to networked MDPs over a finite time hori-

zon. For such a special class of networked MDPs, we provide an alternate proof of

Theorem 12 based on the ideas from Bayesian networks. We show that the finite

history of states and actions that was obtained in the previous chapter is exactly

the same as the information required to estimate the current state of the system.

This, along with the separation principle, provides an alternate proof and additional

insights into the finite memory of the controllers for networked MDPs. It shows that

the finiteness of the bands occurs because given the finite history of states and actions,

the current state of the system is independent of the remaining states and actions.

We begin by describing concepts from Bayesian networks.

4.1 Bayesian Networks

A Bayesian network [37], Nb = Gb,Pb consists of

• A directed acyclic graph Gb = (Vb, Eb), and

• A set of conditional probability distributions Pb.

45

CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS46

Here the subscript b stands for Bayesian and is used to distinguish the Bayesian

network graph from the networked MDP graph G as defined in the previous section.

Associated with each vertex v ∈ Vb of the graph Gb, is a random variable Xv taking

values in a particular set. A directed edge e ∈ Eb between vertices describes the

conditional dependence between the random variables corresponding to the vertices.

If there is a directed edge from a vertex v1 to v2, we say that v2 is a child of v1

and that v1 is a parent of v2. The set of parent vertices of a vertex v is denoted by

parent(v).

The set of probability distributions Pb contains one distribution P(Xv|Xparent(v)

)

for every v ∈ Vb. The joint distribution of all the variables Xk, k = 1, . . . , n is given

as

Prob(X1, . . . Xn

)=

n∏

k=1

Prob(Xk | parents(Xk)

)

An example of a Bayesian network is shown in Figure 4.1. Here the graph Gb consists

of vertices A,B,C,D,E, F and edges A→ C,B → C,C → D,C → E,D → F.

The set of probabilities is given as

Pb = P (A), P (B), P (C|A,B), P (D|C), P (E|C), P (F |D).

Note that since the variables A and B have no parents, the probability set contains

their unconditional probabilities.

A B

C

D E

F

Figure 4.1: A Bayesian network with 6 variables


d-Separation. As mentioned before, the graph Gb encodes the conditional de-

pendencies between the variables. Conditional independence between variables is de-

termined by the property of d-separation. If two variables X and Y are d-separated

in the graph by a third variable Z, then the variables X and Y are conditionally

independent given the variable Z.

Definition 17. A path π in the graph Gb = Vb, Eb is said to be d-separated by a set

of nodes Z ∈ Vb if and only if one of the following holds

• π contains a chain i→ z → j such that i, j ∈ π and z ∈ Z,

• π contains a fork i← z → j such that i, j ∈ π and z ∈ Z and

• π contains an inverted fork (or a collider) i → z ← j such that i, j ∈ π and

neither z nor any of its children are in Z.

The concept of d-separation is closely tied to that of a Markov blanket. Before

we define the Markov blanket, we introduce some notation.

Remark: Consider a set of variables X = X1, . . . , Xn. Denote P(X) to be

the set consisting of all parents of variables in the set X, not including the variables

themselves. Similarly, we denote CH(X) (and PCH(X)) to be the set consisting of

all children (parents of children) of variables in the set X, not including the variables

themselves.

Definition 18 (Markov Blanket). The Markov blanket of a set of variables X =

X1, . . . , Xn

(denoted by MB(X)) is given as

MB(X) = P(X) ∪ CH(X) ∪ PCH(X) (4.1)

The following theorem (see [37] for the proof) states that the variables in the set

X are independent of the rest of the graph given its Markov blanket.

Theorem 19. Given a finite Bayesian network and two distinct variables X and

Y /∈ MB(X), we have

Prob(X|MB(X), Y

)= Prob

(X|MB(X)

)


The Markov blanket of the set of variables shields the variables from the rest of

the graph. Thus, the Markov blanket is the only knowledge required to predict the

value of the variables. Furthermore, if all the variables in a Markov blanket of X are

known, then X is d-separated from the rest of the graph [37].

4.2 Networked MDPs as Bayesian Networks

In this section, we model networked Markov decision processes as Bayesian networks

in a natural way. Consider a networked MDP given by a graph G = V , E, where

we let V = 1, . . . , n. As before, for each i ∈ V. we have xit ∈ X

i. For the remainder

of this chapter, we would consider the evolution of the networked MDP over a finite

horizon T . Associated with this networked MDP, we can construct a finite Bayesian

network Nb = Gb,Pb. The vertex set Vb is given as

Vb =vstate

i,t | i ∈ V, t = 0, 1, . . . , T ⋃

vactioni,t | i ∈ V, t = 0, 1, . . . , T − 1

.

Associated with a vertex vstatei,t is the random variable xi

t, taking values in the finite

set X i, that corresponds to the state of subsystem i at time t. Similarly, associated

with a vertex vactioni,t is the random variable ui

t, taking values in the finite set U i, that

corresponds to the control action applied to subsystem i at time t. The edge set Eb

consists of the following edges.

Eb =vstate

i,t → vstatei,t+1, v

statej,t−Mji

→ vstatei,t+1, v

actioni,t → vstate

i,t+1, vstatei,0:t−Ni

→ vactionk,t ,

vactioni,0:t−1 → vaction

k,t | j ∈ I i, i, k ∈ V, t ∈ N.

Here vstatei,0:t−Ni

→ vactionk,t is interpreted as a directed edge between vstate

i,τ → vactionk,t for

every τ = 0, . . . , t − Ni. An edge vstatej,t−Mji

→ vstatei,t+1 means that the random variable

xjt−Mji

affects the random variable xit+1. Similar interpretations exist for other edges

in the edge set Eb. The set of conditional probability densities Pb consists of all the


transition probabilities, that is

Pb =Ai

t | i ∈ V, t = 0, . . . , T∪Kt | t = 0, . . . , T − 1

For a finite time horizon T , let ST be the set of random variables given as

ST =xi

t | i ∈ V, t = 0, 1, . . . , T ⋃

uit | i ∈ V, t = 0, 1, . . . , T − 1

The joint probability density function of all the variables in the set ST can then be

written as

Prob (ST ) = A10:T A2

0:T . . . An0:T K0:T−1

S2S1

Controller

M12

M21

N2N1

Figure 4.2: A network of two interconnected subsystems with delays. Subsystem i isdenoted by Si, the network propagation delay from Si to Sj is denoted by Mij andthe measurement delay from Si to the controller is denoted Ni.

As an example, consider the networked system of Figure 4.2. The system dynamics

equations are given as

x1t+1 = f 1(x1

t , x2t−M21

, u1t , w

1t ),

x2t+1 = f 2(x2

t , x1t−M12

, u2t , w

2t ).

(4.2)

For the purpose of this example, we choose M12 = 2 and M21 = 1. Thus, the transition

probability matrices are given as

A1t

(

z1t , z

1t−1, z

2t−2, a

1t−1

)

= Prob(

x1t = z1

t | x1t−1 = z1

t−1, x2t−2 = z2

t−2, u1t−1 = a1

t−1

)

,

(4.3)


and

A2t

(

z2t , z

2t−1, z

1t−3, a

2t−1

)

= Prob(

x2t = z2

t | x2t−1 = z2

t−1, x1t−3 = z1

t−3, u2t−1 = a2

t−1

)

,

(4.4)

Associated with this networked control system is a Bayesian network as shown

in Figure 4.3. The directed acyclic graph Gb consists of a vertex for each state of

the two systems and two control actions applied at time t. A directed edge between

two vertices v1 and v2 exists if the variable corresponding to vertex v1 affects the

variable corresponding to vertex v2. For example, a directed edge exists between the

vertex corresponding to x2t−2 and the vertex corresponding to x1

t . Similarly, a directed

edge exists between the vertex corresponding to control action u2t−1 and the vertex

corresponding to x2t . The set of probability distributions Pb consists of the transition

probabilities A1t , A2

t and Kt for all t ≥ 0.

4.3 Alternate Proof of the Information State for

Networked MDPs

In this section, we provide an alternate proof of the finiteness of the information state

for networked MDPs. We start by making the following definition.

Definition 20. Define

hmemt =

(x1

t−N1−b1:t−N1, u1

t−d1:t−1, . . . , xnt−Nn−bn:t−Nn

, unt−dn:t−1

)(4.5)

to the finite history of observations at time t and denote

imemt =

(z1

t−N1−b1:t−N1, a1

t−d1:t−1, . . . , znt−Nn−bn:t−Nn

, ant−dn:t−1

)

to be a realization of hmemt . Further define the set Hmem

t as

Hmemt =

n∏

i=1

(X i)bi+1

×n∏

i=1

(U i)di .


x1 x2u1 u2

t

t− 1

t− 2

t− 3

t− 4

t− 5

Figure 4.3: The Bayesian network associated with the 2-subsystems networked MDPof Figure 4.2. Here the circle represents the state of the two subsystems and thesquare represents the control input. For this Bayesian network, we chose M21 = 1and M12 = 2. The edges from state variables to control inputs have been omitted forvisual clarity.

From the separation principle [12], we know that the optimal control action is a

function of the belief state. We define the set of belief states at time t as follows.

Definition 21. Let Mt be a set defined as

Mt =Λt : X (n) ×Ht → [0, 1] | Λt(zt, it) ≥ 0,

∑

zt

Λt(zt, it) = 1,

where we denote X (n) =∏n

i=1Xi to be the cartesian product of the state space corre-

sponding to all vertices.

Here, Λt(zt, it) is interpreted as the conditional probability density of the current

state of the system given the entire observation history at time t. That is

Λt(zt, it) = Prob (xt = zt | ht = it)

Let Ft : Ht → Mt be an operator that maps the entire observation history at

time t to an element in Mt. That is, the operator Ft maps the observation history

to a belief state. Furthermore, let Tt : Mt → A be the operator that maps the

belief state to a control action. From the separation principle [12], we know that the


optimal control K∗t , as function of the observation history it, is given as

K∗t = Tt Ft

That is, K∗t (at, it) = Tt (at, Λt(·, it)).

To prove the main theorem, we show that for networked MDPs, there exists an

optimal controller that depends only on imemt . Let P : Ht → H

memt be the projection

operator that projects the entire observation history to a truncated history as defined

in equation (4.5). The following theorem shows that there exists an operator Fmemt :

Hmemt →Mt such that

Ft = Fmemt P

Theorem 22. For a networked Markov decision process, there exists Λ∗0, . . . Λ

∗T such

that

Λt (zt, it) = Λ∗t (zt, i

memt ) ∀ t = 0, 1, . . . T. (4.6)

Thus, there exists an optimal controller K∗0 , . . . , K

∗T−1 such that

K∗t (at, it) = Tt (at, Λ

∗t (·, imem

t ))

= Kt (at, imemt ) ∀ t = 0, 1, . . . T − 1. (4.7)

Thus, bi’s are the bounds on the length of the observation history that an optimal

estimator needs to maintain beyond it current observation.

Before we present the proof of Theorem 22, we first prove a key lemma.

Lemma 23. Suppose there exists an optimal K∗j , j = t + 1, . . . , T − 1 such that

K∗j (aj, ij) = Kj

(aj, i

memj

)

for all aj. Then

K∗t (at, it) = Kt (at, i

memt )

for all at.


Proof. From the separation principle [12], we know that

K∗t (at, it) = T (at, Λt (·, it))

Thus, to prove the lemma it suffices to show that Λt (zt, it) = Λ∗t (at, i

memt ). At time

t, the controller knows it = zi0:t−Ni

, ai0:t−1 | i ∈ V. Let

Sut =

(x1

t−N1+1:t, . . . , xnt−Nn+1:t

)

be the states that are unknown at the controller at time t. Here the superscript u is

used to indicate that these states are unknown to the controller at time t. Note that

states of subsystem i are part of Sut if and only if Ni ≥ 1. This is because if Ni = 0,

then the current state of subsystem i is known to the controller. Let

Zut =

(z1

t−N1+1:t, . . . , znt−Nn+1:t

)

be a realization of Sut . Let Lt (Zu

t , it) be the joint conditional probability of the

variables in the set Sut given it. That is,

Lt (Zut , it) = Prob

(Su

t = Zut | ht = it

).

Define

L∗t (Zu

t , imemt ) = Prob

(Su

t = Zut | h

memt = imem

t

).

If we can show that there exists L∗t such that

Lt (Zut , it) = L∗

t (Zut , imem

t ) , (4.8)


then it follows that

Λt (zt, it) =∑

zit−Ni+1:t−1|i∈V

Lt (Zut , it)

=∑

zit−Ni+1:t−1|i∈V

L∗t (Zu

t , imemt )

= Λ∗t (zt, i

memt ) (4.9)

Thus, to prove the lemma it suffices to find an L∗ satisfying equation (4.8). To prove

the existence of an L∗, we show that the Markov blanket of the set Sut consists of the

variables imemt . Theorem 19 would then prove the existence of L∗.

Note that Sut contains xj

t−τjfor τj = 0, 1, . . . , Nj − 1 and j = 1, 2, . . . , n. From

equation (4.1), we know that the Markov blanket of Sut consists of parents, children

and parents of children of the variables in the set Sut . We focus on a single variable

xjt−τj

and find its parents, its children and all the parents of its children.

To find the parents of xjt−τj

, we look at the transition probability of this variable.

From equation (2.9), we note that xjt−τj

depends on

P(xj

t−τj

)=

xjt−τj−1, u

jt−τj−1, x

st−(τj+1+Msj)

| s ∈ Ij

, (4.10)

and hence these variables are the parents of xjt−τj

.

To find the children of xjt−τj

, consider the set Oj of outgoing vertices of subsystem

j and let p ∈ Oj. Consider Apt−t′ and note that this transition probability contains

xjt−t′−1−Mjp

. Thus, xjt−τj

would be a parent of xpt−t′ for all p ∈ Oj, if t− t′− 1−Mjp =

t− τj, which gives that t′ = τj − 1−Mjp.

Note that the children of xjt−τj

also consist of all the control variables that depend

on xjt−τj

. From the assumption in the lemma, we know that the K∗t+1:T−1 are only

a function of the finite past history of states given by imem. Thus, a directed edge

exists between xjt−τj

and ut−t′ for all t′ = τj −Nj − bj : τj −Nj. Thus, the children of


xjt−τj

consists of

CH(xj

t−τj

)=

xjt−τj+1, x

pt−τj+Mjp+1 | p ∈ O

j ⋃

ukt−τj+Nj :t−τj+Nj+bj

| k ∈ V

.

(4.11)

To find the parents of children of xjt−τj

, we find the parents of the variables given in

equation (4.11). From transition probability equation (2.9), we note that the parents

of xpt−τj+Mjp+1 include

xpt−τj+Mjp

, upt−τj+Mjp

, xrt−τj+Mjp−Mrp

| r ∈ Ip

To find the parents of ukt−τj+Nj :t−τj+Nj+bj

| k ∈ V, we note that from the assump-

tion in the lemma, these control inputs only depend on imemt . Thus, the parents of

ukt−τj+Nj :t−τj+Nj+bj

k ∈ V consist of

xit−τj+Nj−bi−Ni:t−τj+Nj+bj−Ni

, uit−τj+Nj−di:t−τj+Nj+bj−1 | i ∈ V

Thus we have

PCH(xj

t−τj

)=

xst−τj−Msj

, ujt−τj

, xpt−τj+Mjp

, upt−τj+Mjp

,

xrt−τj+Mjp−Mrp

| s ∈ Ij, r ∈ Ip, p ∈ Oj ⋃

xit−τj+Nj−bi−Ni:t−τj+Nj+bj−Ni

,

uit−τj+Nj−di:t−τj+Nj+bj−1 | i ∈ V

(4.12)

Let us denote the set of parents, the children and the parents of children of

xjt−Nj+1:t by Mj. From equations (4.10), (4.11), (4.12), we get that the set Mj


contains

Mj =

xjt−Nj :t+1, x

st−(Nj+Msj):t−Msj

, xpt−(Nj−1−Mjp):t+Mjp+1, x

rt−(Nj−1−Mjp+Mrp):t−(Mrp−Mjp),

xit−Ni−bi+1:t−Ni+bj+Nj

| s ∈ Ij, p ∈ Oj, r ∈ Ip, i ∈ V ⋃

ujt−Nj :t

, ukt+1:t+Nj+bj

, upt−(Nj−1−Mjp):t+Mjp

,

uit−(di−1):t+Nj+bj−1 | p ∈ O

j, k, i ∈ V.

Let us denoteM = ∪j∈VMj. Note that ukt−sk∈M if sk ≥ Nk or sk ≥ Nj−1−Mjk

for all j ∈ Ik or sk ≥ dk − 1. From the definition 11, this implies that

sk = maxNk, dk − 1, Nj −Mjk − 1 | j ∈ Ik

= dk

Similarly, xkt−qk∈ S if and only if xk

t−qk∈ M. This happens if one of the following

conditions holds.

1. qk ≥ Nk.

2. qk ≥ Nj + Mkj such that k ∈ Ij for some j ∈ V. This happens for all j ∈ Ok.

3. qk ≥ (Nj − 1 − Mjk) such that k ∈ Oj for some j ∈ V. That is if qk =

(Nj − 1−Mjk) for all j ∈ Ik.

4. For the last term, we need to find all j ∈ V such that for all p ∈ Oj, we

have k ∈ Ip. This happens for all j ∈ Ip, such that p ∈ Ok. Thus we have

qk ≥ Nj − 1−Mjp + Mkp for all p ∈ Ok and all j ∈ Ip.

5. qk ≥ bk + Nk − 1

Thus, we get that

qk = max

Nk, Ns + Mks, Nr − 1−Mrk,

Np − 1−Mps + Mks, bk + Nk − 1 | p ∈ Is, s ∈ Ok, r ∈ Ik


Using the definition of bk and dk, it is easy to verify that qk = bk + Nk. This proves

that the Markov blanket of the variables Sut consists of only imem

t . Thus, there exists

L∗t such that equation (4.8) is satisfied. The lemma then follows from equation (4.9).

Proof of Theorem 22. To prove the main theorem, we first show that at time T−1,

the belief state is only a function of imemT−1 . To see this, note that at time T − 1, the

set of unknown states at the controller SuT has no children. Thus, using a simplified

version of the argument given in the proof of lemma 23, it is easy to verify that there

exists Λ∗T−1 such that

ΛT−1 (aT−1, iT−1) = Λ∗T−1

(aT−1, i

memT−1

).

Thus, there exists an optimal controller K∗T−1 such that

K∗T−1 (aT−1, iT−1) = T

(aT−1, Λ

∗T−1

(·, imem

T−1

))

= KT−1

(aT−1, i

memT−1

)

The proof of the theorem then follows from the inductive argument using Lemma 23.

In previous chapters, we studied networked Markov decision processes with delays

between subsystems. We showed that for networked MDPs, a sufficient information

state is a function of a finite number of past system states and the past controller

inputs. The number of past states as well as past inputs depends only on the un-

derlying graph structure of the networked Markov decision process as well as the

associated delays. We also give explicit bounds on the number of past states and

inputs required to compute an optimal control action for networked MDPs with de-

lays. We also showed that this bound has interesting connections to Markov blanket

in Bayesian networks. This allows us to look at complex networked systems from the

view point of Bayesian networks and provides additional insights into how the delays

between subsystems affects the overall controller performance.


The results of previous chapters allow us to look at complex interconnected sys-

tems that have a centralized controller or a decision maker. In several systems of

interest, the presence of a centralized decision maker is infeasible. Even if one can

envision a centralized decision maker, it might be costly for every subsystem to trans-

mit its state to the decision maker. In the next chapter, we look at a stochastic game

model of complex interacting systems. In such models, each system makes optimal

decision in a decentralized manner. We study a new notion of equilibrium in such

systems that allows us to compute decentralized policies or strategies for systems with

large number of players.

Chapter 5

A Mean Field Approach to

Studying Large Systems

In several complex systems, a large number of agents interact with each other without

the presence of centralized authority. Even in systems where a centralized authority

may be present, it might be costly for each player or subsystem to transmit its state

to the centralized authority. Imagine a wireless network where a large number of

devices are performing power control to maximize their capacity. Even if there is a

central base station, it is costly for each device to continuously update its channel

state or queue backlog for the base station to perform the power control. Thus, in such

scenarios, each agent (or player) interacts with the other agents in a decentralized

manner to achieve its own objectives. A natural framework to study such systems is

that of stochastic games. Stochastic games [51] have been used to study interactions

between players in stochastic dynamic environments. However, typically such games

can be solved with a very small number of players since the computational complexity

involved in finding optimal equilibrium policies is very large [25]. This limits their

application to models with small dimensions.

In chapters 5 – 7, we study a mean field approach to understanding systems with

large number of interacting players [38, 35, 56, 1, 2, 43, 26, 23]. Mean field theory

has been used in statistical physics to deal with the combinatoric complexity of large

interactions. The basic idea is to treat other particles or agents as a single entity

59

CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS60

with some average behavior. Applied to engineering problems, this greatly simplifies

decision making by a single agent: a single agent can make its decision based on

the average behavior of other agents. The equilibrium consistency condition requires

that the average behavior of agents arise from the individual trajectories. Just as

in statistical physics, the mean field behavior allows us to decouple the interactions

between agents and enables us to come up with simple decision making policies.

In this and subsequent chapters, we developed a unified framework to study mean

field equilibrium behavior of large scale stochastic games. In particular, we prove

that under a set of simple assumptions on the model, a mean field equilibrium always

exists. Furthermore, as a simple consequence of this existence theorem,we show that

from the viewpoint of a single agent, a near optimal decision making policy is one that

reacts only to the average behavior of its environment. This result unifies previous

known results on the mean field equilibrium in large scale systems. In developing this

unified framework, we isolate and highlight the key modeling parameters which make

this mean field approach feasible. As a first step in studying the mean field approach,

we begin by defining our model for stochastic games.

5.1 Stochastic Game Model

In this section, we describe our stochastic game model. Compared to standard

stochastic games in the literature [51], in our model, every player has an individual

state. Players are coupled through their payoffs and state transitions. A stochastic

game has the following elements:

Time. The game is played in discrete time. We index time periods by t =

0, 1, 2, . . ..

Players. There are m players in the game; we use i to denote a particular player.

State. The state of player i at time t is denoted by xi,t ∈ X , where X ⊆ Zd is

a subset of the d-dimensional integer lattice. We use x−i,t to denote the state of all

players except player i at time t.

Action. The action taken by player i at time t is denoted by ai,t ∈ A, where

A ⊆ Rq is a subset of q-dimensional Euclidean space.


Transition Probabilities. The state of a player evolves in a Markov fashion. For-

mally, let ht = x0,a0, . . . ,xt−1,at−1 denote the history up to time t. Conditional

on ht, players’ states at time t are independent of each other. Player i′ state xi,t at

time t depends on the past history ht only through the state of player i at time t− 1,

xi,t−1; the states of other players at time t−1, x−i,t−1; and the action taken by player

i at time t− 1, ai,t−1. We represent the distribution of the next state as a transition

kernel P, where:

P(x′i | xi, ai,x−i) = Prob

(xi,t+1 = x′

i | xi,t = xi, ai,t = ai,x−i,t = x−i

). (5.1)

Note that the evolution of players’ states may be coupled: in general, the next state

of player i depends on the current state not only of player i, but also the current state

of players other than i.

Payoff. In a given time period, if the state of player i is xi, the state of other

players is x−i, and the action taken by player i is ai, then the single period payoff

to player i is π(xi, ai,x−i

)∈ R. Note that the players are coupled via their payoff

function, since the payoff to player i depends on the state of every other player.

Discount Factor. The players discount their future payoff by a discount factor

0 < β < 1. Thus a player i’s infinite horizon payoff is given by:

∞∑

t=0

βtπ(xi,t, ai,t,x−i,t

).

In the model described above, the players’ payoff function and transition kernel

depends on the states of all players. In a variety of games, this coupling between

players is independent of the identity of the players. The notion of anonymity captures

scenarios where the interaction between players is via aggregate information about

the state. Let f(m)−i,t(y) denote the fraction of players (excluding player i) that have

their state as y at time t, i.e.:

f(m)−i,t(y) =

1

m− 1

∑

j 6=i

1xj,t=y, (5.2)


where 1xj,t=y is the indicator function that the state of player j at time t is y. We

refer to f(m)−i,t as the population state at time t (from player i’s point of view).

Definition 24 (Anonymous Stochastic Game). A stochastic game is called an anony-

mous stochastic game if the payoff function π(xi,t, ai,t,x−i,t) and transition kernel

P(x′i,t | xi,t, ai,t,x−i,t) depend on x−i,t only through f

(m)−i,t. In an abuse of notation,

we write π(xi,t, ai,t,f

(m)−i,t

)for the payoff to player i, and P(x′

i,t | xi,t, ai,t,f(m)−i,t) for the

transition kernel for player i.

For the remainder of the chapters, we focus our attention on anonymous stochastic

games. For ease of notation, we often drop the subscript i and t to denote a generic

transition kernel and a generic payoff function, i.e., we denote a transition kernel

by P(· | x, a, f) and a generic payoff function by π(x, a, f), where f represents the

population state of players other than the player under consideration.

Our results require a topology on population states; we consider a topology in-

duced by the 1-p norm. Given p > 0, the 1-p-norm of a function f : X → R is given

by:

‖f‖1-p =∑

x∈X

‖x‖pp |f(x)|,

where ‖x‖p is the usual p-norm of a vector. When X is finite, then ‖f‖1-p induces

the same topology as the standard Euclidean norm. However, when X is infinite,

the 1-p-norm weights larger states higher than smaller states. In many applications,

other players at larger states have a greater impact on the payoff; in such settings,

continuity of the payoff in f in the 1-p-norm naturally controls for this effect.

Formally, let F be the set of all possible population states on X with finite 1-p

norm, i.e.:

F =f : X → [0, 1] | f(x) ≥ 0,

∑

x∈X

f(x) = 1, ‖f‖1-p <∞. (5.3)

In addition, we let F(m) denote the set of all population states in F over m−1 players,


i.e.:

F(m) =

f ∈ F : there exists x ∈ Xm−1 with f(y) =1

m− 1

∑

j

1xj=y

.

5.2 Markov Perfect Equilibrium (MPE)

In studying stochastic games, attention is typically focused on a smaller class of

Markov strategy spaces, where the action of a player at each time is a function of

only the current state of every player [29]. In the context of anonymous stochastic

games, a Markov strategy depends on the current state of the player as well as the

current population state. Because a player using such a strategy tracks the evolution

of the other players, we refer to such strategies in our context as cognizant strategies.

Definition 25. Let M be the set of cognizant strategies available to a player. That

is,

M =µ | µ : X × F→ A

. (5.4)

Consider an m-player anonymous stochastic game. At every time t, player i

chooses an action ai,t that depends on its current state and on the current population

state f(m)−i,t ∈ F(m). Letting µi ∈ M denote the cognizant strategy used by player i,

we have ai,t = µi(xi,t,f(m)−i,t). The next state of player i is randomly drawn according

to the kernel P:

xi,t+1 ∼ P(

·∣∣∣ xi,t, µi(xi,t,f

(m)−i,t),f

(m)−i,t

)

. (5.5)

We let µ denote the vector of strategies chosen by the players. We also let µ(m)

denote the strategy vector where every player has chosen strategy µ.

Let V (m)(x, f | µ′,µ(m−1)

)be the expected net present value for a player with

initial state x, and with initial population state f ∈ F(m), given that the player

follows a strategy µ′ and every other player follows the strategy µ. In particular, we


have

V (m)(x, f | µ′,µ(m−1)

),

E

[ ∞∑

t=0

βtπ(xi,t, ai,t,f

(m)−i,t

) ∣∣ xi,0 = x, f

(m)−i,0 = f ; µi = µ′,µ−i = µ(m−1)

]

. (5.6)

Note that state sequence xi,t and population state sequence f(m)−i,t evolve according to

the dynamics (5.5).

We focus our attention on a symmetric Markov perfect equilibrium (MPE), where

all players use the same cognizant strategy µ. In an abuse of notation, we write

V (m)(x, f | µ(m)

)to refer to the expected discounted value as given in equation (5.6)

when every player follows the same cognizant strategy µ.

Definition 26 (Markov Perfect Equilibrium). The vector of cognizant strategies

µ(m) ∈M is a symmetric Markov perfect equilibrium (MPE) if for all initial states x ∈

X and population states f ∈ F(m) we have

supµ′∈M

V (m)(x, f | µ′,µ(m−1)

)= V (m)

(x, f | µ(m)

).

Thus, a Markov perfect equilibrium is a profile of cognizant strategies that si-

multaneously maximize the expected discounted payoff for every player, given the

strategies of other players. It is a well known fact that computing a Markov perfect

equilibrium for a stochastic game is computationally challenging in general [25]. This

is because to find an optimal cognizant strategy, each player needs to track and fore-

cast the exact evolution of the entire population state. In certain scenarios, it might

be infeasible to exchange or learn this information at every step because of limited

communication capacity between players or limited cognitive ability. In the next

section, we describe a recently proposed scheme for approximating Markov perfect

equilibrium.


5.3 Mean Field Equilibrium (MFE)

In a game with a large number of players, we might expect that fluctuations of players’

states “average out” and hence the actual population state remains roughly constant

over time. Because the effect of other players on a single player’s payoff and transition

probabilities is only via the population state, it is intuitive that, as the number of

players increases, a single player has negligible effect on the outcome of the game.

Based on this intuition, a scheme for approximating MPE has been proposed via a

solution concept we call mean field equilibrium, or MFE [38, 35, 56, 1, 2, 43, 26, 23].

Mean field equilibrium is also referred to as “oblivious equilibrium” by [56] or as

“Nash certainty equivalence control” by [35].

In MFE, each player optimizes its payoff based on only the long-run average

population state. Thus, rather than keep track of the exact population state, a single

player’s immediate action depends only on his own current state. We call such players

oblivious, and refer to their strategies as oblivious strategies. Formally, we let MO

denote the set of (stationary, nonrandomized) oblivious strategies, defined as follows.

Definition 27. Let MO be the set of oblivious strategies available to a player. That

is,

MO =µ | µ : X → A

. (5.7)

Given a strategy µ ∈ MO an oblivious player i takes an action ai,t = µ(xi,t) at

time t; as before, the next state of the player is randomly distributed according to

the transition kernel P:

xi,t+1 ∼ P(· | xi,t, ai,t, f) where ai,t ∼ µ(xi,t). (5.8)

Note that because we are considering a mean field model, the player’s state evolves

according to transition kernel with population state f .

We define the oblivious value function V(x | µ, f

)to be the expected net present


value for any oblivious player with initial state x, when the long run average popula-

tion state is f , and the player uses an oblivious strategy µ. We have

V(x | µ, f

), E

[ ∞∑

t=0

βtπ(xi,t, ai,t, f

)∣∣∣ xi,0 = x; µ

]

. (5.9)

Note that the state sequence xi,t is determined by the strategy µ according to the

dynamics (5.8).

We define the optimal oblivious value function V ∗(x | f) as

V ∗(x | f) = supµ∈MO

V (x | µ, f).

Given a population state f , an oblivious player computes an optimal strategy by

maximizing their oblivious value function. Note that because an oblivious player does

not track the evolution of the population state, under reasonable assumptions their

optimal strategy is only a function of their current state—i.e., it must be oblivious

even if optimizing over cognizant strategies. We capture this optimization step via

the correspondence P defined next.

Definition 28. The correspondence P : F → MO maps a distribution f ∈ F to

the set of optimal oblivious strategies for a player. That is, µ ∈ P(f) if and only if

V(x | µ, f

)= V ∗(x | f) for all x, where V is the oblivious value function given by

equation (5.9).

Note that P maps a distribution to a stationary, nonrandomized oblivious strategy.

This is typically without loss of generality, since in most models of interest there

always exists such an optimal strategy. We later establish under our assumptions

that P(f) is nonempty.

Now suppose that the population state is f , and all players are oblivious and play

using a stationary strategy µ. We expect that the long run population state should

in fact be an invariant distribution of the Markov process with transition kernel (5.8).

We capture this relationship via the correspondence D, defined next.

Definition 29. The correspondence D : MO × F → F maps the oblivious strategy


µ and population state f to the set of invariant distributions D(µ, f) associated with

the dynamics (5.8).

Note that the image of the correspondence D is empty if the strategy does not

result in an invariant distribution. We later establish conditions under which D(µ) is

nonempty.

We can now define mean field equilibrium. If every agent conjectures that f is

the long run population state, then every agent would prefer to play an optimal

oblivious strategy µ. On the other hand, if every agent plays µ and the population

state is in fact f , then we should expect the long run population state of all players

to be an invariant distribution of (5.8). Mean field equilibrium requires a consistency

condition: the equilibrium population state f must in fact be an invariant distribution

of the dynamics (5.8) under the strategy µ and the same population state f .

Definition 30 (Mean Field Equilibrium). An oblivious strategy µ ∈MO and a dis-

tribution f ∈ F constitute a mean field equilibrium if µ ∈ P(f) and f ∈ D(µ, f).

The notion of mean field equilibrium provides a simple approach to understanding

behavior in large population stochastic dynamic games. However, this notion is not

very meaningful unless we can guarantee that a mean field equilibrium exists in a

wide variety of stochastic games. Even if a mean field equilibrium were to exist in

a particular game of interest, it is natural to wonder whether such an equilibrium

is a good approximation to Markov perfect equilibrium in games with finitely many

players. MFE is unlikely to be useful in practice without conditions that guarantee

it approximates equilibria in finite systems well. Below we address these two fun-

damental questions: the existence of MFE and whether it provides any meaningful

approximation to MPE.

As we shall show below, an important contribution of our thesis is to relate ap-

proximation to existence of MFE. The approximation theorem we provide requires

continuity assumptions on the model primitives; as we demonstrate later, these same

continuity conditions are required (together with convexity and compactness condi-

tions) to ensure an MFE actually exists. Thus we obtain the valuable insight that


approximation is essentially a corollary of existence. This is practically valuable: es-

tablishing that MFE is a good approximation is effectively a free byproduct, once the

conditions ensuring its existence have been verified.

We begin by studying the approximation result. We first define the appropriate

notion of approximation and show that under very mild assumptions a mean field

equilibrium (if it exists) approximates Markov perfect equilibrium as the number of

players in the game becomes large.

Chapter 6

MFE as an Approximation to MPE

As discussed in the previous chapter, a mean field equilibrium is of practical value only

if it approximates equilibria in finite systems well. In this chapter, we establish one of

our main results: under a parsimonious set of assumptions, a mean field equilibrium is

a good approximation to Markov perfect equilibrium as the number of players grows

large.

6.1 The Asymptotic Markov Equilibrium (AME)

Property

We begin by formalizing the approximation property of interest, referred to as the

asymptotic Markov equilibrium (AME) property. Intuitively, this property requires

that a mean field equilibrium strategy is approximately optimal even when compared

against Markov strategies, as the number of players grows large.

Definition 31 (Asymptotic Markov Equilibrium). A mean field equilibrium (µ, f)

possesses the asymptotic Markov equilibrium (AME) property if for all states x and

sequences of cognizant strategies µm ∈M, we have:

lim supm→∞

V (m)(x, f (m) | µm,µ(m−1)

)− V (m)

(x, f (m) | µ(m)

)≤ 0, (6.1)

69

CHAPTER 6. MFE AS AN APPROXIMATION TO MPE 70

almost surely, where the initial population state f (m) is derived by sampling each other

player’s initial state independently from the probability mass function f .

Note that V (m)(x, f (m) | µ′,µ(m−1)

)is the actual value function of a player as

defined in equation (5.6), when the player uses a cognizant strategy µ′ and every

other player plays an oblivious strategy µ. In particular, we have

V (m)(x, f (m) | µm,µ(m−1)

),

E

[∞∑

t=0

βtπ(xi,t, ai,t,f

(m)−i,t

)∣∣∣ xi,0 = x, f

(m)−i,0 = f (m); µi = µm,µ−i = µ(m−1)

]

,

where the state evolution of the players is given by:

xi,t+1 ∼ P(· | xi,t, µm(xi,t,f

(m)−i,t),f

(m)−i,t

)

xj,t+1 ∼ P(· | xj,t, µ(xj,t),f

(m)−i,t

)∀j 6= i.

Similarly, V (m)(x, f (m) | µ(m)

)is the actual value function of a player as defined

in equation (5.6) when every player is playing the oblivious strategy µ. AME requires

that the error when using the MFE strategy approaches zero almost surely with

respect to the randomness in the initial population state. This definition can be

shown to be stronger than the definition considered by [56], where AME is defined

only in expectation with respect to randomness in the initial population state.1

We emphasize that the AME property is essentially a continuity property in the

population state f . Under reasonable assumptions, we show that the time t popula-

tion state in the system with m players, f(m)−i,t, approaches f almost surely for all t as

m→∞. Therefore, informally, if the payoffs satisfy an appropriate continuity prop-

erty in f , we should expect the AME property to hold. This observation is significant,

because as noted above, continuity is also an essential prerequisite to existence. It is

for this reason that, under fairly general assumptions, the AME property is essentially

a corollary to existence.

1Under our assumptions on the model, convergence in expectation can be established via anapplication of the bounded convergence theorem. In particular, by Lemma 41 it follows that|V (m)(x, f | µ′, µ)| ≤ C(x, 0) <∞ for all f , µ′, and µ.


Before proceeding, we require some additional notation. Without loss of generality,

we can view the state Markov process in terms of the increments from the current

state. Specifically, if the current state is x and action a is taken, we can write:

xi,t+1 = xi,t + ξi,t, (6.2)

where we consider ξi,t to be a random increment that is distributed according to the

probability mass function Q(· | x, a, f), where

Q(z′ | x, a, f) = P(x + z′ | x, a, f).

Note that Q(z′ | x, a, f) is positive for only those z′ such that x + z′ ∈ X .

We make the following assumptions over model primitives; these ensure the model

is appropriately “continuous” in the limit.

Assumption 1 (Continuity). 1. Compact action set. The set of feasible actions

for a player, denoted by A, is compact.

2. Bounded increments. There exists M ≥ 0 such that, for all z with ‖z‖∞ > M ,

Q(z | x, a, f) = 0 for all x, a, and f .

3. Payoff and kernel continuity. The payoff π(x, a, f) is jointly continuous in a ∈ A

and f ∈ F for fixed x ∈ X (where F is endowed with the 1-p norm), and the

kernel P(x′ | x, a, f) is jointly continuous in a ∈ A and f ∈ F for each x, x′ ∈ X

(where F is endowed with the 1-p norm).2

4. Growth rate bound. There exist constants K and n ∈ Z+ such that

supa∈A,f∈F

|π(x, a, f)| ≤ K(1 + ‖x‖∞)n

for every x ∈ X , where ‖·‖∞ is the sup norm.

2Here we view P(x′ | x, a, f) as a real valued function of a and f , for fixed x, x′; note that sincewe have also assumed bounded increments, this notion of continuity is equivalent to assuming thatP(· | x, a, f) is jointly continuous in a and f with respect to the topology of weak convergence ondistributions over X .


The most consequential of these assumptions are that the model exhibits bounded

increments, and that the payoff growth rate can be bounded. These are not particu-

larly severe restrictions; for a wide range of economic models of interest, it is reason-

able to assume increments are bounded. Further, the polynomial growth rate bound

on the payoff is quite weak, and serves to exclude the possibility of strategies that

yield infinite expected discounted payoff.

Theorem 32 (AME). Let (µ, f) be a mean field equilibrium with f ∈ F, and suppose

Assumption 1 holds. Then the AME property holds for (µ, f).

The proof of the AME property exploits the fact that the 1-p-norm of f must

be finite (since f ∈ F) to show that∥∥∥f

(m)−i,t − f

∥∥∥

1-p→ 0 almost surely as m → ∞;

i.e., the population state of other players approaches f almost surely. Continuity of

the payoff π in f , together with the growth rate bounds in Assumption 1, yields the

desired result. The proof of the AME property is provided in the appendix.

In the next chapter we establish the existence of MFE. The existence results uses

the same continuity assumption (along with other additional assumptions) as required

for the AME property. This shows that the approximation result is a corollary of the

existence result.

Chapter 7

Existence of Mean Field

Equilibrium

The notion of mean field equilibrium allows to us approximate Markov perfect equi-

librium in large stochastic dynamic games. This notion is vacuous unless we can

guarantee that mean field equilibrium exists in a wide variety of games. In this

chapter, we study the existence of mean field equilibria. From Definition 30, we

observe that (µ, f) is a mean field equilibrium if and only if f is a fixed point of

Φ(f) = D(P(f), f), and µ ∈ P(f). Thus our approach is to find conditions un-

der which the correspondence Φ has a fixed point; in particular, we aim to apply

Kakutani’s fixed point theorem to Φ to find a MFE.

Kakutani’s fixed point theorem requires three essential pieces: (1) compactness of

the range of Φ; (2) convexity of both the domain of Φ, as well as Φ(f) for each f ; and

(3) appropriate continuity properties of the operator Φ. As emphasized in the last

chapter, a central technical observation is that the same continuity properties needed

to establish the AME property are essential to proving existence of a MFE.

We start with the following restatement of Kakutani’s theorem.

Theorem 33 (Kakutani). Suppose there exists a set FC ⊆ F such that:

1. FC is convex and compact (in the 1-p norm), with Φ(FC) ⊂ FC;

2. Φ(f) is convex and nonempty for every f ∈ FC; and

73

CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 74

3. Φ has a closed graph on FC.

Then there exists a mean field equilibrium (µ, f) with f ∈ FC.

In the remainder of this section, we find exogenous conditions on model primitives

to ensure these requirements are met. We tackle them in reverse order. We first show

that under Assumption 1, Φ has a closed graph. Next, we study conditions under

which Φ(f) can be guaranteed to be convex. Finally, we provide conditions on model

primitives under which there exists a compact, convex set FC with Φ(FC) ⊂ FC .

The conditions we provide are mild, and yet also suffice to guarantee that Φ(f) is

nonempty.

7.1 Closed Graph

In this section we establish that exactly the same the continuity assumptions embod-

ied in Assumption 1 also suffice to ensure that Φ has a closed graph. We begin with

the following lemma.

Lemma 34. For each f , P(f) is compact; further, the correspondence P is upper

hemicontinuous on F.

Proof. By Assumption 1, π(x, a, f) is jointly continuous in a and f . Lemma 42

establishes that the optimal oblivious value function V ∗(x | f) is continuous in f ,

and so as in the proof of that lemma, it follows that for a fixed state x, π(x, a, f) +

β∑

x′ V ∗(x′ | f)P(x′ | x, a, f) is finite and jointly continuous in a and f . Define the

set Px(f) ⊂ A as the set of actions that achieve the maximum on the right hand side

of (A.3); this is nonempty as A is compact (Assumption 1) and the right hand side

is continuous in a. By Berge’s maximum theorem, for each x the correspondence Px

is upper hemicontinuous with compact values [3].

By Lemma 42, µ ∈ P(f) if and only if µ(x) ∈ Px(f) for each x. Note that we have

endowed the set of strategies with the topology of pointwise convergence. The range

space of P is an infinite product of the compact action space A (Assumption 1) over

the countable state space. Hence by Tychonoff’s theorem [3], the range space of P is


compact. Further, since Px is compact-valued, it follows that P is compact-valued.

Since Px(f) is compact-valued and upper hemicontinuous, the Closed Graph Theorem

ensures that Px has a closed graph [3]. This in turn ensures that P has closed graph;

again by the Closed Graph Theorem, we conclude that P is upper hemicontinuous.

Proposition 35. Suppose that Assumption 1 holds. Then Φ has a closed graph on

F; i.e., the set (f, g) : g ∈ Φ(f) ⊂ F× F is closed (where F is endowed with the 1-p

norm).

Proof. Suppose fk → f in the 1-p norm, and that gk → g in the 1-p norm, where

gk ∈ Φ(fk) for all k. We must show that g ∈ Φ(f). For each k, let µk ∈ P(fk) be an

optimal oblivious strategy such that gk ∈ D(µk, fk). As in the proof of Lemma 34,

the range space of P is compact in the topology of pointwise convergence; therefore,

taking subsequences if necessary, we can assume without loss of generality that µk

converges to some strategy µ ∈MO pointwise. By upper hemicontinuity of P (Lemma

34), we have µ ∈ P(f).

By definition of D, it follows that for all x:

gk(x) =∑

x′

gk(x′)P(x|x′, µk(x

′), fk). (7.1)

Since P(x|x′, a, f) is jointly continuous in action and population state (Assumption

1), it follows that for all x and x′:

P(x|x′, µk(x′), fk)→ P(x|x′, µ(x′), f)

as k → ∞. Further, if gk → g in the 1-p norm, then in particular, gk(x) → g(x) for

all x. Finally, observe that for all a and f , we have P(x|x′, a, f) = 0 for all states x′

such that ‖x′ − x‖∞ > M , since increments are bounded (Assumption 1). Thus:

∑

x′

gk(x′)P(x|x′, µk(x

′), fk)→∑

x′

g(x′)P(x|x′, µ(x′), f)


as k →∞. Taking the limit as k →∞ on both sides of (7.1) yields:

g(x) =∑

x′

g(x′)P(x|x′, µ(x′), f), (7.2)

which establishes that g ∈ D(µ, f). Since we had µ ∈ P(f), we conclude g ∈ Φ(f),

as required.

7.2 Convexity

Next, we develop conditions to ensure that Φ(f) is nonempty and convex. We start

by considering a simple model, where the action set A is the simplex of randomized

actions on a base set of pure actions. Formally, we have the following definition.

Definition 36. An anonymous stochastic game has a finite action space if there exists

a finite set S such that the following three conditions hold:

1. A consists of all probability distributions over S: A = a ≥ 0 :∑

s a(s) = 1.

2. π(x, a, f) =∑

s a(s)π(x, s, f), where π(x, s, f) is the payoff evaluated at state

x, population state f , and pure action s.

3. P(x′ | x, a, f) =∑

s a(s)P(x′ | x, s, f), where P(x′ | x, s, f) is the kernel eval-

uated at states x′ and x, population state f , and pure action s.

Essentially, the preceding definition allows inclusion of randomized strategies in

our search for mean field equilibrium. This model inherits Nash’s original approach

to establishing existence of an equilibrium for static games, where randomization

induces convexity on the strategy space. We show next that in any game with finite

action spaces, the set Φ(f) is always convex.

Proposition 37. Suppose Assumption 1 holds. In any anonymous stochastic game

with a finite action space, Φ(f) is convex for all f ∈ F.


Proof. Fix f ∈ F, and let g1, g2 be elements of Φ(f). Let µ1, µ2 ∈ P(f) be strategies

such that gi ∈ D(µi, f), i = 1, 2. Then for i = 1, 2 and all x′ ∈ X , we have:

gi(x′) =

∑

x

gi(x′)P(x′ | x, µi(x), f).

Fix δ, 0 ≤ δ ≤ 1, and for each x, define g(x) by:

g(x) = δg1(x) + (1− δ)g2(x).

We must show g ∈ Φ(f). Define a new strategy µ as follows: for each x such that

g(x) > 0,

µ(x) =δg1(x)µ1(x) + (1− δ)g2(x)µ2(x)

g(x).

For each x such that g(x) = 0, let µ(x) = µ1(x).

We claim that µ ∈ P(f), i.e., µ is an optimal oblivious strategy given f ; and that

g ∈ D(µ, f), i.e., that g is an invariant distribution given strategy µ and population

state f . This suffices to establish that g ∈ Φ(f).

To establish the claim, first observe that under Definition 36, the right hand side

of (A.3) is linear in a. Thus any convex combination of two optimal actions is also

an optimal action. This establishes that for every x, µ(x) achieves the maximum on

the right hand side of (A.3); so we conclude µ ∈ P(f).

Let T = x : g(x) > 0. Then:

g(x′) = δg1(x′) + (1− δ)g2(x

′)

=∑

x

δg1(x)P(x′ | x, µ1(x), f) + (1− δ)g2(x)P(x′ | x, µ2(x), f)

=∑

x

∑

s

(δg1(x)µ1(x)(s) + (1− δ)g2(x)µ2(x)(s))P(x′ | x, s, f)

=∑

x∈T

∑

s

g(x)µ(x)(s)P(x′ | x, s, f).

The first equality is the definition of g(x′), and the second equality follows by ex-

panding the invariant distribution equations for g1 and g2. The third equality follows


by expanding the sum over pure actions s. Finally, in the last equality, we substitute

the definition of g(x), and we also observe that for x 6∈ T , g(x) = 0—and therefore,

g1(x) = g2(x) = 0. Since g(x) = 0 for x 6∈ T , it follows that:

∑

x 6∈T

∑

s

g(x)µ(x)(s)P(x′ | x, s, f) = 0.

It follows that:

g(x′) =∑

x

g(x)P(x′ | x, µ(x), f),

as required.

The preceding result ensures that if randomization is allowed over a set of finite

actions, then the map Φ is convex-valued. On the other hand, many relevant appli-

cations typically require existence of equilibria in pure strategies. For this purpose,

we present an alternate result, in which concavity assumptions on model primitives

guarantee convexity of Φ(f). Before proceeding we require some additional terminol-

ogy.

Let S ⊂ Rn. We say that a function g : S → R is nondecreasing if g(x′) ≥ g(x)

whenever x′ ≥ x (where we write x′ ≥ x if x′ is at least as large as x in every

component). Let Pθ be a family of probability distributions on X indexed by θ ∈

S. We say that Pθ is stochastically nondecreasing in the parameter θ, if for every

nondecreasing function u : X → R, and for θ1 ≥ θ2, there holds Eθ1 [u] ≥ Eθ2 [u]

wherever the expectations are both well defined. (When the preceding condition

holds, we say Pθ1 stochastically dominates Pθ2 .) We say that Pθ is stochastically

concave in the parameter θ, if for every nondecreasing function u : X → R, Eθ[u] =∑

x u(x)Pθ(x) is a concave function of θ wherever the expectation is well defined.

We have the following assumption.

Assumption 2. 1. The action set A is convex.

2. The payoff π(x, a, f) is nondecreasing in x for fixed a and f , and the kernel

P(· | x, a, f) is stochastically nondecreasing in x for fixed a and f .

3. The payoff is concave in a for fixed x and f , and the kernel is stochastically


concave in a for fixed x and f , with at least one of the two strictly concave in

a.1

In the following proposition we show the preceding conditions on model primitives

ensure the optimal oblivious strategy is unique, and therefore that Φ(f) is convex for

all f .

Proposition 38. Suppose Assumptions 1 and 2 hold. Then P(f) is a singleton, and

Φ(f) is convex for all f ∈ F.

Proof. From Lemma 43, the conditions of the proposition guarantee a unique optimal

solution in the right hand side of (A.3), for every x ∈ X . Thus under either of these

conditions the optimal strategy given f is unique, i.e., P(f) is a singleton. It is

straightforward to check that D(µ, f) is convex for each fixed µ and f(the set of

invariant distributions given µ and f are the solution to a linear system), so it follows

that Φ(f) is convex for each f .

7.3 Compactness

In this section, we provide conditions under which under which we can guarantee

the existence of a compact, convex, nonempty set FC such that Φ(FC) ⊂ FC . The

assumptions we make are closely related to those needed to ensure that Φ(f) is

nonempty. To see the relationship between these results, observe that in Lemma 42,

we showed that under Assumption 1 an optimal oblivious strategy always exists for

any f ∈ F. Thus to ensure that Φ(f) is nonempty, it suffices to show that there

exists at least one ergodic strategy in P(f)—i.e., at least one strategy that possesses

an invariant distribution. Our approach to demonstrating existence of an invariant

distribution is to use a Foster-Lyapunov argument. This same argument also allows

us to bound the moments of the invariant distribution—precisely what is needed to

find the desired set FC that is compact in the 1-p norm.

1Strict stochastic concavity of the kernel requires that the expectation against any nondecreasingfunction is strictly concave.


One simple condition under which Φ(f) is nonempty is that the state space is

finite; in this case any policy is ergodic, since any Markov chain on a finite state

space possesses at least one positive recurrent class. In this case the entire set F is

compact in the 1-p norm. Thus we have the following result.

Proposition 39. Suppose Assumption 1 holds, and that the state space is finite.

Then Φ(f) is nonempty for all f ∈ F, and F is compact in the 1-p norm.

We now turn our attention to the setting where the state space may be unbounded.

In this case, we must make additional assumptions to ensure the optimal strategy

does not allow the state to become transient, and to bound moments of the invariant

distribution of any optimal oblivious strategy.

We endow F with the stochastic dominance ordering (i.e., f ′ f if f ′ stochastically

dominates f). If X and Y are partially ordered sets with orders and ⊒, respectively,

we say that a function F : X × Y → R has decreasing differences in x and y if for all

x, x′, y, y′ such that x′ x and y′ ⊒ y, there holds:

F (x′, y′)− F (x, y′) ≤ F (x′, y)− F (x, y).

In other words, increasing the parameter y reduces the marginal return to higher

values of x.

Assumption 3. 1. The state space X = Z+.

2. The payoff function π(x, a, f) has decreasing differences in x ∈ X , a ∈ A, and

f ∈ F.

3. For all ∆ ∈ Z+, a ∈ A, and f ∈ F, as x→∞ there holds:

π(x + ∆, a, f)− π(x, a, f)→ 0.

4. Any action is costly: there exists an action a ∈ A such that a ≥ a for all a ∈ A,

and the payoff satisfies π(x, a, f) < π(x, a, f) for all x, f , and a 6= a.


5. The increment kernel Q(· | x, a, f) is stochastically nonincreasing in x ∈ X for

each a ∈ A and f ∈ F, and stochastically nonincreasing in f ∈ F for each

x ∈ X and a ∈ A.

6. Given any f ∈ F, the drift is eventually negative at a: there exists a state χ

such that for all x with x ≥ χ,

∑

z

zQ(z | x, a, f) < 0.

Under the preceding assumptions we have the following result.

Proposition 40. Suppose Assumptions 1, 2, and 3 hold. Then Φ(f) is nonempty for

all f ∈ F, and there exists a compact, convex, nonempty set FC such that φ(F) ⊂ FC.

The proof of this proposition is provided in the Appendix. From Lemmas 35,

38 and 40 and applying Kakutani’s theorem, the existence of mean field equilibrium

follows.

Chapter 8

Conclusions and Future Work

Complex networks are becoming more pervasive in our lives. Yet the design and

understanding of these networks is still a challenging task. In this thesis, we looked

at two particular challenges in the design of such networks, namely the reactive envi-

ronment in which these networks operate as well as the lack of complete information

available to any decision maker.

To understand the affect of delay in decision making, we model complex systems

as a networked of interconnected Markov decision processes. The subsystems in

this networked MDPs are connected to each other via delay lines. We considered

a scenario where a centralized decision maker receives delayed state feedback from

each subsystem. Our main theorem shows that the central decision maker can make

optimal decisions based only on a subset of past information available to it. In other

words, beyond a certain history, the past is irrelevant to future decision making. We

also explored a connection between networked control systems and Bayesian networks

where we show that the amount of information required to make optimal decisions in

a networked MDP is related to the concept of a Markov blanket in Bayesian networks.

These results allow us to compute optimal controllers for networked MDPS in presence

of delays.

To cope with the reactive environment present in complex networks, we use the

mean field equilibrium approach. This approach, that is motivated by statistical

physics, deals with the complexity of interactions in complex systems. As part of

82

CHAPTER 8. CONCLUSIONS AND FUTURE WORK 83

our research, we have developed a unified framework to study mean field equilibrium

behavior of large scale dynamical stochastic systems. In particular, we proved that

under a set of simple assumptions on the model, a mean field equilibrium always exists.

Furthermore, as a simple consequence of this existence theorem, we show that from the

viewpoint of a single agent, a near optimal decision making policy is one that reacts

only to the average behavior of its environment. This result unifies previous known

results on the mean field equilibrium in large scale dynamical systems. In developing

this unified framework, we isolate and highlight the key modeling parameters which

make this mean field approach feasible. The mean field approach provides a low

complexity solution to a single agent’s decision making problem.

Although the issues addressed in this thesis are important to the design and un-

derstanding of complex systems, several important questions still remain unanswered.

Below we highlight some important questions that are pertinent to the understanding

of complex systems.

• Large Scale Systems with Local Interactions: The mean field analysis

serves well when there are a large number of agents. However, in many large

complex networks, an agent may interact with only a finite subpopulation of

an infinite collection of agents. For example, consider a network of electric cars

that are all trying to charge their batteries from the power grid. An electric

vehicles recharging schedule would depend mostly on other electric vehicles in

its immediate neighborhood. In such scenarios, we can use mean field analysis

as a starting point to design algorithms for large scale systems with local in-

teractions. Quantifying the scale at which mean field analysis becomes useful

would also help us better understand these systems.

• Distributed Control in the Presence of Delays: The current model of

networked MDPs assumes that there is a single decision maker or controller that

receives delayed state information from every subsystem. In several systems,

each subsystem has its own controller. These controllers may receive delayed

state information from only a small number of subsystems. Understanding the

sufficient information for each controller is still a challenge.

CHAPTER 8. CONCLUSIONS AND FUTURE WORK 84

• Understanding the Price of Delay: Any networked system must address

decision making in the presence of delay. However, the effect of delay on overall

performance is very poorly understood. How does the optimal cost in a system

increase as more delay is incurred in receiving the information? Understanding

the price of delay would enable us to design approximate decision making rules

with a clear understanding of the gap to optimality. In scenarios where the delay

in receiving information can be managed or spread among different agents, this

price of delay would allow us to better understand the design of networked

systems with delays.

• Value of Information in Decision Making: In any complex system, agents

usually have partial information about the environment as well as about each

other. Imagine a transportation system where a particular vehicle needs to

decide a minimum delay route to its destination. The vehicle makes its deci-

sion based on partial information about the congestion in the transportation

network. How does the overall delay in the system decrease as the vehicle

receives more information about the congestion in the network? What informa-

tion would the vehicle like to receive in order to minimize its delay? Should it

be congestion on its nearest links, or the vehicular traffic on the most congested

link? Understanding the value of information would enable us to understand

the right kind of information required to make a near optimal decision.

These and several other questions are of immense importance in the design of

complex systems. We believe that the tools and methodologies developed in this

thesis would provide a stepping stone in answering some or all of these questions.

Appendix A

Proofs

A.1 Preliminary Lemmas

In this section, we prove some preliminary lemmas that are used for both the AME

proof as well as the proof of existence of a MFE. We begin with the following lemma.

Lemma 41. Suppose Assumption 1 holds. Let x0 = x. Let at ∈ A be any sequence

of (possibly history dependent) actions, and let ft ∈ F be any sequence of (possibly

history dependent) population states. Let xt be the state sequence generated, i.e.,

xt ∼ P(· | xt−1, at−1, ft−1).

Then for all T ≥ 0, there exists C(x, T ) <∞ such that:

E

[∞∑

t=T

βt|π(xt, at, ft)|∣∣∣ x0 = x

]

≤ C(x, T ).

Further, C(x, T )→ 0 as T →∞.

Proof. Observe that by Assumption 1, the increments are bounded. Thus starting

from state x, we have ‖xt‖∞ ≤ ‖x‖∞ + tM . Again by Assumption 1, |π(xt, at, ft)| ≤

85

APPENDIX A. PROOFS 86

K(1 + ‖xt‖∞)n. Therefore:

E

[∞∑

t=T

βt|π(xt, at, ft)| | x0 = x

]

≤ K∞∑

t=T

βt(1 + ‖x‖∞ + tM)n.

We define C(x, 0) as the right hand side above when T = 0:

C(x, 0) = K∞∑

t=0

βt(1 + ‖x‖∞ + tM)n.

Observe that C(x, 0) <∞.

We now reason as follows for T ≥ 1:

K∞∑

t=T

βt(1 + ‖x‖∞ + tM)n = KβT

∞∑

t=0

βt(1 + ‖x‖∞ + tM + TM)n

= KβT

∞∑

t=0

βt

n∑

j=0

(n

j

)

(1 + ‖x‖∞ + tM)j(TM)n−j

≤ KβT

∞∑

t=0

βt

n∑

j=0

(n

j

)

(1 + ‖x‖∞ + tM)n(TM)n

= KβT 2n(TM)n

∞∑

t=0

βt(1 + ‖x‖∞ + tM)n

= C(x, 0)βT (2MT )n.

Here the inequality holds because 1 + ‖x‖∞ + tM ≥ 1, M ≥ 0, and T ≥ 1. So for

T ≥ 1, define:

C(x, T ) = C(x, 0)βT (2MT )n. (A.1)

Then C(x, T )→ 0 as T →∞, as required.

We now show that the Bellman equation holds for the dynamic program solved by

a single agent given a population state f . Our proof involves the use of a weighted sup

norm, defined as follows. For each x ∈ X , let W (x) = (1 + ‖x‖∞)n. For a function


F : X → R, define:

‖F‖W -∞ = supx∈X

∣∣∣∣

F (x)

W (x)

∣∣∣∣.

This is the weighted sup norm with weight function W . We let B(X ) denote the set

of all functions F : X → R such that ‖F‖W -∞ <∞.

Let Tf denote the dynamic programming operator with population state f : given

a function F : X → R, we have:

(TfF )(x) = supa∈A

π(x, a, f) + β∑

x′∈X

F (x′)P(x′ | x, a, f)

.

We define T kf to be the composition of the mapping Tf with itself k times.

Lemma 42. Suppose Assumption 1 holds. For all f ∈ F, if F ∈ B(X ) then TfF ∈

B(X ). Further, there exist k, ρ independent of f with 0 < ρ < 1 such that Tf is a

k-stage ρ-contraction on B(X ); i.e., if F, F ′ ∈ B(X ), then for all f :

∥∥T k

f F − T kf F ′∥∥

W -∞≤ ρ ‖F − F ′‖W -∞ . (A.2)

In particular, value iteration converges to V ∗(·|f) ∈ B(X ) from any initial value

function in B(X ), and for all f ∈ F and x ∈ X , the Bellman equation holds:

V ∗(x | f) = supa∈A

π(x, a, f) + β∑

x′∈X

V ∗(x′ | f)P(x′ | x, a, f)

. (A.3)

Further, V ∗(x|f) is continuous in f .

Finally, there exists at least one optimal oblivious strategy among all (possibly

history-dependent, possibly randomized) strategies; i.e., P(f) is nonempty. An obliv-

ious strategy µ ∈ MO is optimal given f if and only if µ(x) achieves the maximum

on the right hand side of (A.3) for every x ∈ X .

Proof. We have the following three properties:

1. By the growth rate bound in Assumption 1 we have supa |π(x, a, f)|/W (x) ≤ K

for all x.


2. We have:

W (x) = supa∈A

∑

x′

P(x′ | x, a, f)W (x′) ≤ (1 + ‖x‖∞ + M)n,

since the increments are bounded (Assumption 1). Thus W (x)/W (x) ≤ (1 +

M)n for all x.

3. Finally, fix ρ such that 0 < ρ < 1 and let:

W k(x) = supµ∈MO

E[W (xk)|x0 = x, µ],

where the state evolves according to xt+1 ∼ P(· | xt, µ(xt), f). By bounded

increments in Assumption 1, we have:

βkW k(x) ≤ βk(1 + ‖x‖∞ + kM)n ≤ βk(1 + kM)nW (x).

By choosing k sufficiently large so that βk(1 + kM)n < ρ, we have:

βkW k(x) ≤ ρW (x).

Given (1)-(3), by standard arguments (see, e.g., [19]), it follows that Tf is a k-

stage ρ-contraction with respect to the weighted sup norm, value iteration converges to

V ∗(· | f), the Bellman equation holds, and any (stationary, nonrandomized) oblivious

strategy that maximizes the right hand side in (A.3) for each x ∈ X is optimal.

Observe that since V ∗(· | f) ∈ B(X ) for any f , it follows that V ∗(x | f) <∞ for all

x. In fact, by Lemma 41, |V ∗(x | f)| ≤ C(x, 0) for all x.

Next we show that V ∗(x | f) is continuous in f . Define Z(x) = 0 for all x, and let

V(ℓ)f = T ℓ

fZ. We first show that V(ℓ)f (x) is continuous in f . To see this, we proceed

by induction. The result is trivally true at ℓ = 0. Next, observe that π(x, a, f) is

jointly continuous in a and f for each fixed x by Assumption 1. Suppose V(ℓ)f (x) is

continuous in f for each x; then V(ℓ)f (x′)P(x′ | x, a, f) is jointly continuous in a and

f for each fixed x, x′. Since the kernel has bounded increments from Assumption 1,


we conclude that∑

x′ V(ℓ)f (x′)P(x′ | x, a, f) is jointly continuous in a and f for each

fixed x. It follows by Berge’s maximum theorem [3] that V(ℓ+1)f (x) is continuous in f .

Fix ǫ > 0. Since Tf is a k-stage ρ-contraction in the weighted sup norm for every

f , it follows that for all sufficiently large ℓ, for every f there holds:

|V (ℓ)f (x)− V ∗(x | f)| ≤ W (x)ǫ.

So now suppose that fn → f in the 1-p norm. Since V ℓf (x) is continuous in f , for all

sufficiently large n there holds:

|V (ℓ)fn

(x)− V(ℓ)f (x)| ≤ ǫ.

Thus using the triangle inequality, for all sufficiently large n we have:

|V ∗(x | f)− V ∗(x | fn)| ≤ (2W (x) + 1)ǫ.

Since ǫ was arbitrary it follows that the left hand side approaches zero as n→∞, as

required.

Finally, observe that by a similar argument as above,

∑

x′

V ∗(x′ | f)P(x′ | x, a, f)

is a continuous function of a for each fixed x and f ; since π(x, a, f) is also continuous

in a for each fixed f , the right hand side of (A.3) is continuous in a for each fixed

f . Since A is compact, it follows that there exists an optimal action at each state x,

and thus there exists an optimal strategy given f .

Lemma 43. Suppose Assumptions 1 and 2 hold. Then the right hand side of (A.3)

is nondecreasing in x and strictly concave in a.

Proof. Define Z(x) = 0 for all x, and let V(ℓ)f = T ℓ

fZ. Observe that if V(ℓ)f is

nondecreasing, then under the conditions of Proposition 38, it follows that V(ℓ+1)f will

be nondecreasing. Taking the limit as n → ∞, we conclude (from convergence of


value iteration) that V ∗(· | f) is nondecreasing, and thus the right hand side of (A.3)

is nondecreasing in x.

Since V ∗(· | f) is nondecreasing, π(x, a, f) is concave in a, and the kernel is

stochastically concave in a, with at least one of the last two strictly concave, it

follows that the right hand side of (A.3) is strictly concave in a.

A.2 Proof of AME

In this section, we provide the proof of Theorem 32. Throughout this section, we

suppose Assumption 1 holds. We begin by defining the following sets.

Definition 44. For every x ∈ X , define

Xx =

z ∈ X∣∣∣ P(x | z, a, f) > 0 for some a ∈ A and for some f ∈ F

. (A.4)

Also define Xx,t as

Xx,t =

z ∈ X∣∣∣ ‖z‖∞ ≤ ‖x‖∞ + tM

. (A.5)

Thus, Xx is the set of all initial states that can result in the final state as x.

Since the increments are bounded (Assumption 1), for every x ∈ X , the set Xx is

finite. The set Xx,t is a superset of all possible states that can be reached at time t

starting from state x (since the increments are uniformly bounded over action a and

distribution f); note that Xx,t is finite as well.

Lemma 45. Let (µ, f) be a mean field equilibrium. Consider an m-player game.

Let x(m)i,0 = x0 and suppose the initial state of every player (other than player i) is

independently sampled from the distribution f . That is, suppose x(m)j,0 ∼ f for all

j 6= i; let f (m) ∈ F(m) denote the initial population state. Let a(m)i,t be any sequence of

(possibly random, possibly history dependent) actions. Suppose players’ states evolve


as:

x(m)j,t+1 ∼ P

(· | x(m)

j,t , µ(x(m)j,t ), f

(m)−j,t

)∀ j = 1, 2, · · · ,m, j 6= i,

x(m)i,t+1 ∼ P

(· | x(m)

i,t , a(m)i,t , f

(m)−i,t

).

Then, for every initial state x0, for all times t,∥∥∥f

(m)−i,t − f

∥∥∥

1-p→ 0 almost surely as

m→∞.

Proof. Note that f ∈ F and hence ‖f‖1-p < ∞. Thus, given any ǫ > 0, there exists

a finite set Cǫ,f such that:

∑

x/∈Cǫ,f

‖x‖pp f(x) < ǫ. (A.6)

At t = 0, we have

f(m)−i,0(x) =

1

m− 1

m−1∑

j=1

1Xj,0=x,

where Xj,0 are i.i.d random variables distributed according to the distribution f .

Define:

Yj = ‖Xj,0‖pp 1Xj,0 6∈Cǫ,f.

Note that the Yj are i.i.d. random variables, with:

E[Yj] =∑

x 6∈Cǫ,f

‖x‖pp f(x).

Further, observe that:

∑

x 6∈Cǫ,f

‖x‖pp f(m)−i,0(x) =

1

m− 1

m−1∑

j=1

Yj.

Thus by the strong law of large numbers, almost surely as m→∞,

∑

x 6∈Cǫ,f

‖x‖pp f(m)−i,0(x)→

∑

x 6∈Cǫ,f

‖x‖pp f(x) < ǫ.


Now observe that:

∥∥∥f

(m)−i,0(x)− f

∥∥∥

1-p≤∑

x∈Cǫ,f

‖x‖pp |f(m)−i,0(x)− f(x)|+

∑

x/∈Cǫ,f

‖x‖pp f(m)−i,0(x) +

∑

x/∈Cǫ,f

‖x‖pp f(x).

Each of the second and third terms on the right hand side is almost surely less than

ǫ for sufficiently large m. For the first term, observe that |f (m)−i,0(x) − f(x)| → 0

almost surely, again by the strong law of large numbers (since f (m)(x) is the sample

average of m − 1 Bernoulli random variables with parameter f(x)). Thus the first

term approaches zero almost surely as m→∞ by the bounded convergence theorem.

Since ǫ was arbitrary, this proves that∥∥∥f

(m)−i,0 − f

∥∥∥

1-p→ 0 almost surely as m→∞.

We now use an induction argument; let us assume that,∥∥∥f

(m)−i,τ − f

∥∥∥

1-p→ 0 almost

surely as m→∞ for all times τ ≤ t. From the definition of f(m)−i,t+1 we have:

f(m)−i,t+1(y) =

1

m− 1

∑

j 6=i

1x

(m)j,t+1=y

,

where x(m)j,t+1 ∼ P

(· | x(m)

j,t , µ(x(m)j,t ), f

(m)−j,t

)for all j 6= i. Note that if two players have

the same initial state, then the population state from their viewpoint is identical.

That is, if x(m)j,t = x

(m)k,t , then f

(m)−j,t(y) = f

(m)−k,t(y) for all y ∈ X . We can thus redefine

the population state from the viewpoint of a player at a particular state. Let f(x,m)t

be the population state at time t from the viewpoint of a player at state x. Then, if

x(m)j,t = x

(m)k,t = x, then for all y ∈ X , f

(m)−j,t(y) = f

(m)−k,t(y) = f

(x,m)t (y). Without loss of

generality, we assume m > 1. Let η(m)−i,t(x) be the total number of players (excluding

player i) that have their state at time t as x, i.e., η(m)−i,t(x) = (m − 1)f

(m)−i,t(x). Note


that η(m)−i,t(x) = 0 if and only f

(m)−i,t(x) = 0. We can now write f

(m)−i,t+1(y) as:

f(m)−i,t+1(y) =

1

m− 1

∑

x∈X

η(m)−i,t(x)∑

j=1

1Y

(m)j,x,t=y

=∑

x∈X

f(m)−i,t(x)

1

η(m)−i,t(x)

η(m)−i,t(x)∑

j=1

1Y

(m)j,x,t=y

=∑

x∈Xy

f(m)−i,t(x)

1

η(m)−i,t(x)

η(m)−i,t(x)∑

j=1

1Y

(m)j,x,t=y

(A.7)

where the last equality follows from Definition 44. Here, Y(m)j,x,t are random variables

that are independently drawn according to the transition kernel P(· | x, µ(x), f(x,m)t ).

Note that if η(m)−i,t(x) = 0, we interpret the term inside the parentheses as zero.

Let us now look at f(x,m)t . We have

f(x,m)t (z) = f

(m)−i,t(z) +

1

m− 11x

(m)i,t =z

−1

m− 11z=x.

Consider∥∥∥f

(x,m)t − f

∥∥∥

1-p. We have:

∥∥∥f

(x,m)t − f

∥∥∥

1-p=∑

z∈X

‖z‖pp

∣∣∣f

(x,m)t (z)− f(z)

∣∣∣

=∑

z∈X

‖z‖pp

∣∣∣∣f

(m)−i,t(z) +

1

m− 11x

(m)i,t =z

−1

m− 11z=x − f(z)

∣∣∣∣

≤∑

z∈X

‖z‖pp

∣∣∣f

(m)−i,t(z)− f(z)

∣∣∣+

1

m− 1

∑

z∈X

‖z‖pp 1x

(m)i,t =z

+1

m− 1

∑

z∈X

‖z‖pp 1z=x

=∥∥∥f

(m)−i,t − f

∥∥∥

1-p+

1

m− 1

∑

z∈X

‖z‖pp 1x

(m)i,t =z

+1

m− 1

∑

z∈X

‖z‖pp 1z=x

From the induction hypothesis, we have∥∥∥f

(m)−i,t − f

∥∥∥

1-p→ 0 almost surely as m→∞.


Note that at time t, x(m)i,t ∈ Xx0,t from equation (A.5), and Xx0,t is finite. Thus,

supm

∑

z∈X

‖z‖pp 1x

(m)i,t =z

<∞.

This implies that for all states x ∈ X ,∥∥∥f

(x,m)t − f

∥∥∥

1-p→ 0 almost surely as m →

∞. From Assumption 1, we know that the transition kernel is continuous in the

population state f (where F is endowed with the 1-p norm). Thus for every x ∈ X ,

we have almost surely:

P(· | x, µ(x), f(x,m)t )→ P(· | x, µ(x), f), (A.8)

as m→∞.

Next, we show that f(m)−i,t+1(y) → f(y) almost surely as m → ∞, for all y. We

leverage equation (A.7). Observe that the set of points x ∈ X where ‖x‖p ≤ 1 is

finite, since X is a subset of an integer lattice. From the induction hypothesis, as∑

x∈X ‖x‖pp|f

(m)−i,t(x)− f(x)| → 0 almost surely as m→∞, it follows that f

(m)−i,t(x)→

f(x) almost surely for all x ∈ X as x→∞.

Suppose that x ∈ Xy and f(x) > 0. Since f(m)−i,t(x) → f(x), it follows that

η(m)−i,t → ∞ as m → ∞, almost surely. Note that Y

(m)j,x,t are random variables that

are independently drawn according to the transition kernel P(·|x, µ(x), f(x,m)t ). From

equation (A.8), and Lemma 46, we get that for every x, y ∈ X , there holds

1

η(m)−i,t(x)

η(m)−i,t(x)∑

j=1

1Y

(m)j,x,t=y

→ P(y|x, µ(x), f),

almost surely as m→∞.

On the other hand, suppose x ∈ Xy and f(x) = 0. Again, since f(m)−i,t(x) → f(x)

as x→∞, it follows that as m→∞, almost surely:

f(m)−i,t(x)

1

η(m)−i,t(x)

η(m)−i,t(x)∑

j=1

1Y

(m)j,x,t=y

→ 0,


since the term in brackets is nonnegative and bounded. (Recall we interpret the term

in brackets as zero if f(m)−i,t(x) = 0.)

We conclude that, almost surely, as m→∞:

f(m)−i,t+1(y) =

∑

x∈Xy

f(m)−i,t(x)

1

η(m)−i,t(x)

η(m)−i,t(x)∑

j=1

1Y

(m)j,x,t=y

→

∑

x∈Xy

f(x)P(y|x, µ(x), f) = f(y).

To complete the proof, we need to show that∥∥∥f

(m)−i,t+1 − f

∥∥∥

1-p→ 0 almost surely

as m→∞. Since f(m)−i,t(x)→ f(x) almost surely, for all ǫ > 0 we have:

∑

x∈Cǫ,f

‖x‖ppf(m)−i,t(x)→

∑

x∈Cǫ,f

‖x‖ppf(x).

This together with the fact that ‖f (m)−i,t − f‖1-p → 0 implies that, almost surely:

lim supm→∞

∑

x∈Cǫ,f

‖x‖ppf(m)−i,t(x) < ǫ. (A.9)

Now at time t + 1, we have

∑

x/∈Cǫ,f

‖x‖pp f(m)−i,t+1 =

∑

x/∈Cǫ,f

d∑

ℓ=1

|xℓ|pf

(m)−i,t+1(x)

≤∑

x/∈Cǫ,f

d∑

ℓ=1

(|xℓ|+ M

)pf

(m)−i,t(x), (A.10)

where the equality follows because X is a subset of the d-dimensional integer lattice.

The last inequality follows from the fact that the increments are bounded (Assump-

tion 1). Without loss of generality, assume that |xℓ| ≥ 1 and that M ≥ 1. Then we


have:

(|xℓ|+ M

)p=

p∑

j=1

(p

j

)

|xℓ|jMp−j

≤

p∑

j=1

(p

j

)

|xℓ|pMp

= 2pMp|xℓ|p = K1|xℓ|

p,

where we let K1 = (2M)p. Substituting in equation (A.10), we have, almost surely,

lim supm→∞

∑

x/∈Cǫ,f

‖x‖pp f(m)−i,t+1 ≤

∑

x/∈Cǫ,f

d∑

ℓ=1

K1|xℓ|pf

(m)−i,t(x)

= K1

∑

x/∈Cǫ,f

‖x‖pp f(m)−i,t(x)

< K1ǫ,

where the last inequality follows from equation (A.9). Now observe that:

∥∥∥f

(m)−i,t+1 − f

∥∥∥

1-p≤∑

x∈Cǫ,f

‖x‖pp |f(m)−i,t+1(x)− f(x)|+

∑

x/∈Cǫ,f

‖x‖pp f(m)−i,t+1(x)

+∑

x/∈Cǫ,f

‖x‖pp f(x).

In taking a limsup on the left hand side, the second term on the right hand side

is almost surely less than K1ǫ. From the definition of Cǫ,f and equation (A.6), we

get that the third term on the right hand side is also less than ǫ. Finally, since for

every x |f (m)−i,t+1(x) − f(x)| → 0 almost surely as m → ∞, and Cǫ,f is finite, the

first term in the above equation approaches zero almost surely as m → ∞ by the

Bounded Convergence Theorem. Since ǫ was arbitrary, this proves the induction step

and hence the lemma.

The preceding proof uses the following refinement of the strong law of large num-

bers.


Lemma 46. Suppose 0 ≤ pk ≤ 1 for all k, and that pk → p as k → ∞. For each

k, let Y(k)1 , . . . , Y

(k)k be i.i.d. Bernoulli random variables with parameter pk. Then

almost surely:

limk→∞

1

k

k∑

i=1

Y(k)i = p.

Proof. Let ǫ > 0. By Hoeffding’s inequality, we have:

Prob

(∣∣∣∣∣

1

k

k∑

i=1

Y(k)i − pk

∣∣∣∣∣> ǫ

)

≤ 2e−kǫ2 ,

since 0 ≤ Y(k)i ≤ 1 for all i, k. Let ǫℓ = 1/ℓ; then by the Borel-Cantelli lemma, the

event on the left hand side in the preceding expression occurs for only finitely many

ℓ, almost surely. In other words, almost surely:

limk→∞

[

pk −1

k

k∑

i=1

Y(k)i

]

= 0.

The result follows.

Before we prove the AME property, we need some additional notation. Let (µ, f)

be a mean field equilibrium. Consider again an m player game and focus on player i.

Let x(m)i,0 = x0 and assume that player i uses a cognizant strategy µm. The initial

state of every other player j 6= i is independently drawn from the distribution f , that

is, x(m)j,0 ∼ f . Denote the initial distribution of all m− 1 players (excluding player i)

by f (m) ∈ F(m). The state evolution of player i is given by

x(m)i,t+1 ∼ P

(

· | x(m)i,t , a

(m)i,t ,f

(m)−i,t

)

, (A.11)

where a(m)i,t = µm

(x

(m)i,t ,f

(m)−i,t

)and f

(m)−i,t is the actual population distribution. Here

the superscript m on the state variable represents the fact that we are considering

an m player stochastic game. Let every other player j use the oblivious strategy µ


and thus their state evolution is given by

x(m)j,t+1 ∼ P

(

· | x(m)j,t , µ

(x

(m)j,t

), f

(m)−j,t

)

. (A.12)

Define V (m)(x, f (m) | µm,µ(m−1)

)to be the actual value function of player i, with its

initial state x, the initial distribution of the rest of the population as f (m) ∈ F(m),

when the player uses a cognizant strategy µm and every other player uses an oblivious

strategy µ. We have

V (m)(x, f (m) | µm,µ(m−1)

)=

E

[∞∑

t=0

βtπ(xi,t, ai,t,f

(m)−i,t

) ∣∣ xi,0 = x, f

(m)−i,0 = f (m); µi = µm,µ−i = µ(m−1)

]

. (A.13)

We define a new player that is coupled to player i in the m player stochastic games

defined above. We call this player the coupled player. Let x(m)i,t be the state of this

coupled player at time t. The subscript i and the superscript m reflect the fact that

this player is coupled to player i in an m player stochastic game. We assume that

the state evolution of this player is given by:

x(m)i,t+1 ∼ P

(·, x(m)

i,t , a(m)i,t , f

), (A.14)

where a(m)i,t = a

(m)i,t = µm

(x

(m)i,t ,f

(m)−i,t

). In other words, this coupled player takes

the same action as player i at every time t and this action depends on the actual

population state of m−1 players. However, note that the state evolution is dependent

only on the mean field distribution f . Let us define

V (m)(

x∣∣∣ f ; µm,µ(m−1)

)

=

E

[∞∑

t=0

βtπ(

x(m)i,t , a

(m)i,t , f

)

| x(m)i,0 = x0, a

(m)i,t = µm(xi,t,f

(m)−i,t); µ

(m−1)

]

. (A.15)

Thus, V (m)(x | f ; µm, µ) is the expected net present value of this coupled player, when

the player’s initial state is x, and the long run average population state is f . Observe


that

V (m)(x | f ; µm,µ(m−1)

)≤ sup

µ′∈M

V (m)(x | f ; µ′,µ(m−1))

= supµ′∈MO

V (m)(x | f ; µ′,µ(m−1))

= V ∗(x | f) = V (x | µ, f). (A.16)

Here, the first equality follows from Lemma 42, which implies that the supremum

over all cognizant strategies is the same as the supremum over oblivious strategies

(since the state evolution of other players does not affect the payoff of this coupled

player), and the last equality follows since µ ∈ P(f).

Lemma 47. Let (µ, f) be a mean field equilibrium and consider an m player game.

Let the initial state of player i be x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population

state of m − 1 players whose initial state is sampled independently from the distri-

bution f . Assume that player i uses a cognizant strategy µm and every other player

uses the oblivious strategy µ. Their state evolutions are given by equation (A.11)

and (A.12). Also define a coupled player with initial state x(m)i,0 = x and let its state

evolution be given by equation (A.14).

Then, for all times t, and for every y ∈ X , we have

∣∣∣Prob

(x

(m)i,t = y

)− Prob

(x

(m)i,t = y

)∣∣∣→ 0,


Proof. The lemma is trivially true for t = 0. Let us assume that it holds for all

times τ = 0, 1, · · · , t− 1. Then, we have

Prob(

x(m)i,t = y

)

=∑

z∈Xy

Prob(

x(m)i,t−1 = z

)

P(

y∣∣∣ z, µm(z, f

(m)−i,t−1), f

(m)−i,t−1

)

Prob(x

(m)i,t = y

)=∑

z∈Xy

Prob(

x(m)i,t−1 = z

)

P(


(m)−i,t−1), f

)

.

Here we use the fact that the coupled player uses the same action as player i and the


state evolution of the coupled player is given by equation (A.14). Note that the sum-

mation is over all states in the finite set Xy, where Xy is defined as in equation (A.4).

From Lemma 45, we know that for all times t,∥∥∥f

(m)−i,t − f

∥∥∥


m→∞. From Assumption 1, we know that the transition kernel is jointly continuous

in the action a and distribution f (where the set of distributions F is endowed with

1-p norm). Since the action set is compact, this implies that for all y, z ∈ X ,

limm→∞

supa∈A

∣∣∣P(

y∣∣∣ z, a, f

(m)−i,t−1

)

−P(

y∣∣∣ z, a, f

)∣∣∣ = 0.

It follows that for every y, z ∈ X ,

limm→∞

∣∣∣P(


(m)−i,t−1), f

(m)−i,t−1

)

−P(


(m)−i,t−1), f

)∣∣∣ = 0

almost surely. From the induction hypothesis, we know that for every z ∈ X ,

∣∣∣Prob

(x

(m)i,t−1 = z

)− Prob

(x

(m)i,t−1 = z

)∣∣∣→ 0

almost surely as m→∞. This along with the finiteness of the set Xy, gives that for

every y ∈ X∣∣∣Prob

(x

(m)i,t = y

)− Prob

(x

(m)i,t = y

)∣∣∣→ 0

almost surely as m→∞. This proves the lemma.

Lemma 48. Let (µ, f) be a mean field equilibrium and consider an m player game.

Let the initial state of player i be x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population

state of m − 1 players whose initial state is sampled independently from the distri-

bution f . Assume that player i uses a cognizant strategy µm and every other player

uses the oblivious strategy µ. Their state evolutions are given by equation (A.11)

and (A.12). Also define a coupled player with initial state x(m)i,0 = x and let its state

evolution be given by equation (A.14).

Then, for all times t, we have

lim supm→∞

E

[

π(

x(m)i,t , µm

(x

(m)i,t ,f

(m)−i,t

),f

(m)−i,t

)

− π(

x(m)i,t , µm

(x

(m)i,t ,f

(m)−i,t

), f)]

≤ 0,


almost surely.

Proof. Let us write a(m)i,t = µm

(x

(m)i,t ,f

(m)−i,t

). We have

∆(m)i,t = E

[

π(

x(m)i,t , a

(m)i,t ,f

(m)−i,t

)

− π(

x(m)i,t , a

(m)i,t , f

)]

= E

[

π(

x(m)i,t , a

(m)i,t ,f

(m)−i,t

)

− π(

x(m)i,t , a

(m)i,t , f

)]

+

E

[

π(

x(m)i,t , a

(m)i,t , f

)

− π(

x(m)i,t , a

(m)i,t , f

)]

, T(m)1,t + T

(m)2,t .

Consider the first term. We have

T(m)1,t =

∑

y∈X

Prob(x

(m)i,t = y

) (

π(

y, a(m)i,t ,f

(m)−i,t

)

− π(

y, a(m)i,t , f

))

≤∑

y∈X

Prob(x

(m)i,t = y

)supa∈A

∣∣∣π(

y, a, f(m)−i,t

)

− π (y, a, f)∣∣∣

=∑

y∈Xx,t

Prob(x

(m)i,t = y

)supa∈A

∣∣∣π(

y, a, f(m)−i,t

)

− π (y, a, f)∣∣∣ ,

where the last equality follows from the fact that x(m)i,0 = x and from equation (A.5).

From Assumption 1, we know that the payoff is jointly continuous in action a and

distribution f (with the set of distributions F endowed with 1-p norm) and the set A

is compact. Thus, for every y ∈ X , we have

supa∈A

∣∣∣π(

y, a, f(m)−i,t

)

− π (y, a, f)∣∣∣→ 0,

almost surely as m → ∞. This along with the fact that Xx,t is finite shows that

lim supm→∞ T(m)1,t ≤ 0 almost surely.


Now consider the second term. We have

T(m)2,t = E

[

π(

x(m)i,t , a

(m)i,t , f

)

−(

x(m)i,t , a

(m)i,t , f

)]

=∑

y∈X

Prob(x

(m)i,t = y

)π(y, a

(m)i,t , f

)−∑

y∈X

Prob(x

(m)i,t = y

)π(y, a

(m)i,t , f

)

=∑

y∈X

(

Prob(x

(m)i,t = y

)− Prob

(x

(m)i,t = y

))

π(y, a

(m)i,t , f

)

≤∑

y∈X

∣∣∣Prob

(x

(m)i,t = y

)− Prob

(x

(m)i,t = y

)∣∣∣

∣∣∣π(y, a

(m)i,t , f

)∣∣∣

≤∑

y∈X

∣∣∣Prob

(x

(m)i,t = y

)− Prob

(x

(m)i,t = y

)∣∣∣ sup

a∈A

∣∣π(y, a, f)

∣∣

=∑

y∈Xx,t

∣∣∣Prob

(x

(m)i,t = y

)− Prob

(x

(m)i,t = y

)∣∣∣ sup

a∈A

∣∣π(y, a, f)

∣∣ ,

where the last equality follows from the fact that x(m)i,0 = x

(m)i,0 = x and from Defini-

tion 44. From Lemma 47, we know that for every y ∈ X ,

∣∣∣Prob

(x

(m)i,t = y

)− Prob

(x

(m)i,t = y

)∣∣∣→ 0

almost surely as m→∞. Since Xx,t is finite for every fixed x ∈ X and every time t,

this implies that lim supm→∞ T(m)2,t ≤ 0 almost surely. This proves the lemma.

Before we proceed further, we need one additional piece of notation. Once again

let (µ, f) be a mean field equilibrium and consider an oblivious player. Let xt be the

state of this oblivious player at time t. We assume that x0 = x and the since the

player used the oblivious strategy µ, the state evolution of this player is given by

xt+1 ∼ P(· | xt, at, f

)(A.17)

where at = µ(xt). Define V(x | µ, f

)to be the oblivious value function for this player


starting from state x. That is,

V(x | µ, f

), E

[ ∞∑

t=0

βtπ(xi,t, ai,t, f

)∣∣∣ xi,0 = x; µ

]

. (A.18)

Also, consider an m player game and focus on player i. We represent the state of

player i at time t by x(m)i,t . As before, the superscript m on the state variable represents

the fact that we are considering an m player stochastic game. Let x(m)i,0 = x and let

player i also use the oblivious strategy µ. The initial state of every other player

j 6= i is drawn independently from the distribution f , that is, x(m)j,0 ∼ f . Denote the

initial distribution of all m− 1 players (excluding player i) by f (m) ∈ F(m). The state

evolution of player i is then given by

x(m)i,t+1 ∼ P

(

· | x(m)i,t , a

(m)i,t ,f

(m)−i,t

)

, (A.19)

where a(m)i,t = µ

(x

(m)i,t

). Note that even though the player uses an oblivious strategy,

its state evolution is affected by the actual population state. Let every other player j

also use the oblivious strategy µ and let their state evolution be given by

x(m)j,t+1 ∼ P

(

·∣∣∣ x

(m)j,t , µ

(x

(m)j,t

), f

(m)−j,t

)

. (A.20)

Define V (m)(x, f (m) | µ(m)

)to be the actual value function of the player, when the

initial state of the player is x, the initial population distribution is f (m) and every

player uses the oblivious strategy µ. That is.

V (m)(x, f | µ(m)

)=

E

[∞∑

t=0

βtπ(xi,t, ai,t,f

(m)−i,t

) ∣∣ xi,0 = x, f

(m)−i,0 = f ; µi = µ,µ−i = µ(m)

]

. (A.21)

Lemma 49. Let (µ, f) be a mean field equilibrium and consider an m player stochastic

game. Let x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population state of m − 1

players whose initial state is sampled independently from f . Assume that every player

uses the oblivious strategy µ and their state evolutions are given by equations (A.19)


and (A.20). Also, consider an oblivious player with x0 = x and let its state evolution

be given by equation (A.17).

Then, for every time t and for all x ∈ X , we have

∣∣∣Prob(xt = y)− Prob(x

(m)i,t = y)

∣∣∣→ 0, (A.22)


Proof. The lemma is trivially true for t = 0. Let us assume that it holds for all

times τ = 0, 1, · · · , t− 1. Then, we have

Prob (xt = y) =∑

z∈Xy

Prob (xt−1 = z)P(

y∣∣∣ z, µ(z), f

)

Prob(x

(m)i,t = y

)=∑

z∈Xy

Prob(

x(m)i,t−1 = z

)

P(

y∣∣∣ z, µ(z),f

(m)−i,t

)

.

Note that the summation above is over all states in a finite set Xy (as defined in

Defintion 44).

From Lemma 45, we know that for all times t,∥∥∥f

(m)−i,t − f

∥∥∥


m→∞. From Assumption 1, we know that the transition kernel is continuous in the

distribution (where the set of distributions F is endowed with the 1-p norm). From

the induction hypothesis, we know that∣∣∣Prob

(xt−1 = z

)− Prob

(x

(m)−i,t−1 = z

)∣∣∣ → 0.

This along with the finiteness of the set Xy, gives that for every x ∈ X

∣∣∣Prob (xt = x)− Prob

(x

(m)i,t = x

)∣∣∣→ 0

almost surely as m→∞. This proves the lemma.

Lemma 50. Let (µ, f) be a mean field equilibrium and consider an m player stochastic

game. Let x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population state of m − 1

players whose initial state is sampled independently from f . Assume that every player

uses the oblivious strategy µ and their state evolutions are given by equations (A.19)

and (A.20). Also, consider an oblivious player with x0 = x and let its state evolution

be given by equation (A.17).


Then for all times t, we have

E

[

π(xt, µ(xt), f

)− π

(x

(m)i,t , µ(x

(m)i,t ),f

(m)−i,t

)]

→ 0,


Proof. Define ∆(m)i,t as

∆(m)i,t = E

[

π(xt, µ(xt), f

)− π

(x

(m)i,t , µ(x

(m)i,t ),f

(m)−i,t

)]

= E

[

π(xt, µ(xt), f

)− π

(xt, µ(xt),f

(m)−i,t)

]

+ E

[

π(xt, µ(xt),f

(m)−i,t)− π

(x

(m)i,t , µ(x

(m)i,t ),f

(m)−i,t

)]

, T(m)1,t + T

(m)2,t .

Note that from Lemma 45, we have that∥∥∥f

(m)−i,t − f

∥∥∥

1-p→ 0 almost surely as m→

∞. From Assumption 1, we know that the payoff is continuous in the distribution,

where the set of distributions F is endowed with 1-p norm. Thus, for every y and a,

we have∣∣∣π(y, a, f)− π(y, a, f

(m)−i,t)

∣∣∣→ 0, (A.23)

as m→∞. Consider the first term. We have:

T(m)1,t =

∑

y∈X

Prob (xt = y)∣∣∣π(y, µ(y), f)− π(y, µ(y),f

(m)−i,t)

∣∣∣

=∑

y∈Xx,t

Prob (xt = y)∣∣∣π(y, µ(y), f)− π(y, µ(y),f

(m)−i,t)

∣∣∣ ,

where the last equality follows from the fact that x0 = x and from Definition 44.

Since Xx,t is a finite set for every initial state x ∈ X and every time t, we get that

T(m)1,t → 0 almost surely as m→∞.


Consider now the second term. We have:

T(m)2,t = E

[

π(xt, µ(xt),f

(m)−i,t)− π

(x

(m)i,t , µ(x

(m)i,t ),f

(m)−i,t

)]

=∑

y∈X

Prob(xt = y

)π(y, µ(y),f

(m)−i,t)−

∑

y∈X

Prob(x

(m)i,t = y

)π(y, µ(y),f

(m)−i,t

)

=∑

y∈Xt

(

Prob(xt = y

)− Prob

(x

(m)i,t = y

))

π(y, µ(y),f

(m)−i,t).

From Lemma 49, equation (A.23), and the finiteness of Xx,t, we get that

lim supm→∞

T(m)2,t ≤ 0

almost surely. This proves the lemma.

Proof.[Proof of Theorem 32] Let us define

∆V (m)(x, f (m)) , V (m)(x, f (m) | µm,µ(m−1)

)− V (m)

(x, f (m) | µ(m)

).

Then we need to show that for all x, lim supm→∞ ∆V (m)(x, f (m)) ≤ 0 almost surely.

We can write

∆V (m)(x, f (m)) = V (m)(x, f (m) | µm,µ(m−1)

)− V (x | µ, f) + V (x | µ, f)

− V (m)(x, f (m) | µ(m)

)

≤ V (m)(x, f (m) | µm,µ(m−1)

)− V (m)

(x | f ; µm,µ(m−1)

)+ V (x | µ, f)

− V (m)(x, f (m) | µ(m)

)

, T(m)1 + T

(m)2 .

Here the inequality follows from equation (A.16). Consider the term T(m)1 . We have

T(m)1 = V (m)

(x, f (m) | µm,µ(m−1)

)− V (m)

(x | f ; µm,µ(m−1)

)

= E

[∞∑

t=0

βt(

π(x

(m)i,t , a

(m)i,t ,f

(m)−i,t

)− π

(x

(m)i,t , a

(m)i,t , f

))]

,


where the last equality follows from equations (A.13) and (A.15). Note that xi,0 =

xi,0 = x and ai,t = ai,t = µm

(xi,t,f

(m)−i,t

)and the state transitions of players are given

by equations (A.11), (A.12), and (A.14). From Lemma 48, we have

lim supm→∞

E

[T−1∑

t=0

βt(

π(x

(m)i,t , a

(m)i,t ,f

(m)−i,t

)− π

(x

(m)i,t , a

(m)i,t , f

))]

≤ 0,

almost surely for any finite time T . From Lemma 41, we have, almost surely

E

[∞∑

t=T

βt(

π(x

(m)i,t , a

(m)i,t ,f

(m)−i,t

)− π

(x

(m)i,t , a

(m)i,t , f

))]

≤ 2C(x, T ),

which goes to zero as T →∞. This proves that lim supm→∞ T(m)1 ≤ 0 almost surely.

Similar analysis (with an application of Lemma 50) shows that lim supm→∞ T(m)2 ≤ 0

almost surely, yielding the result.

A.3 Compactness: Proof

In this section, we provide the proof of Proposition 40. Throughout this subsection

we suppose Assumptions 1, 2, and 3 are in effect.

Lemma 51. Given x′ ≥ x, a ∈ A, and f ∈ F, there exists a probability space with

random variables ξ′ ∼ Q(· | x′, a, f), ξ ∼ Q(· | x, a, f), such that ξ′ ≤ ξ almost surely,

and x′ + ξ′ ≥ x + ξ almost surely.

Proof. The proof uses a standard coupling argument. Let U be a uniform ran-

dom variable on [0, 1]. Let F (resp., F ′) be the cumulative distribution function

of Q(· | x, a, f) (resp., Q(· | x′, a, f)), and let G be the cumulative distribution

function of P(· | x, a, f) (resp., P(· | x′, a, f)). By Assumption 2, P(· | x, a, f) is

stochastically nondecreasing in x, and by Assumption 3, Q(· | x, a, f) is stochasti-

cally decreasing in x. Thus for all z, F (z) ≤ F ′(z), but for all y, G(y) ≥ G′(y);

further, G(y) = F (y − x) (and G′(y) = F ′(y − x′)). Let ξ = infz : F (z) ≥ U,

and let ξ′ = infz : F ′(z) ≥ U. Then ξ ≥ ξ′. Rewriting the definitions, we also


have x + ξ = infy : F (y − x) ≥ U, and x′ + ξ′ = infy : F ′(y − x′) ≥ U, i.e.,

x + ξ =∈ y : G(y) ≥ U, and x′ + ξ′ = infy : G′(y) ≥ U. Thus it follows that

x′ + ξ′ ≥ x + ξ, as required.

Lemma 52. Fix ∆ ∈ Z, ∆ ≥ 0. Then as x→∞,

supf∈F

V ∗(x + ∆ | f)− V ∗(x | f))→ 0.

Proof. Fix f ∈ F, and fix x. Let x0 = x, and let x′0 = x + ∆. Let µ be an

optimal oblivious strategy given f , i.e., µ ∈ P(f); such a strategy exists by Lemma

42. Let x′t and a′

t denote the state and action sequence realized under the kernel

P(·|x, a, f), respectively, when a′t = µ(x′

t) and starting from x′0; and let xt denote

the state sequence realized using the action sequence a′t, starting from x0.

We use a coupling argument to study the difference in V (x+∆ | f) and V (x | f).

It follows from Lemma 51 that there exists a probability space with random variables

ξ0, ξ′0 such that ξ0 ∼ Q(·|x0, a

′0, f) and ξ′0 ∼ Q(·|x′

0, a′0, f), ξ0 ≥ ξ′0 almost surely,

and yet x0 + ξ0 ≤ x′0 + ξ′0 almost surely; this ensures that ξ0 − ξ′0 ≤ ∆. Proceeding

inductively, there exists a joint probability measure under which:

0 ≤ x′t − xt ≤ ∆,

for all t ≥ 0.

We now have the following sequence of inequalities:

V ∗(x + ∆ | f)− V ∗(x | f)

≤ E

[∑

t

βt(π(x′t, a

′t, f)− π(xt, a

′t, f))

∣∣∣ x′

0 = x + ∆, x0 = x, a′t = µ(x′

t)

]

≤ E

[∑

t

βt supδ:0≤δ≤∆

(π(xt + δ, a′t, f)− π(xt, a

′t, f))

∣∣∣ x′

0 = x + ∆, x0 = x, a′t = µ(x′

t)

]

≤ E

[∑

t

βt( supδ:0≤δ≤∆

(π(xt + δ, a, f)− π(xt, a, f))∣∣∣ x′

0 = x + ∆, x0 = x, a′t = µ(x′

t)

]

,


where f denotes the smallest distribution in F—i.e., the one that places all its mass

on state 0. The first inequality follows because the payoff received under the action

sequence a′t starting from x cannot be larger than V ∗(x | f). The second inequality

follows by taking a supremum over the difference in state, and the fact that (almost

surely) 0 ≤ x′t−xt ≤ ∆ for all t. The third inequality follows because π has decreasing

differences in x, a, and f , and because δ ≥ 0 (Assumption 3).

Now recall that increments are bounded (Assumption 3). Thus in time t, the

maximum distance the state could have moved from initial state x is tM . Thus if

x0 = x, then:

supδ:0≤δ≤∆

(π(xt + δ, a, f)−π(xt, a, f)) ≤ supǫ,δ:0≤δ≤∆,|ǫ|≤tM

(π(x+ ǫ+ δ, a, f)−π(x+ ǫ, a, f)).

Let Ax,t denote the right hand side of the preceding equation; note that this is a deter-

ministic quantity. Since the supremum is over a finite set, it follows from Assumption

3 that Ax,t → 0 as x→∞, for all fixed t.

Finally, observe that since π(x + δ, a, f) − π(x, a, f) → 0 as x → ∞, it follows

that:

sup0≤δ≤∆,y∈Z+

(π(y + δ, a, f)− π(y, a, f)) <∞.

We denote the left hand side of the preceding inequality by D.

Combining our arguments, we have:

V ∗(x + ∆, f)− V ∗(x, f) ≤T∑

t=0

βtAx,t +βT D

1− β. (A.24)

First taking the limit on the right hand side as x→∞, then taking the limit on the

right hand side as T → ∞ shows that V ∗(x + ∆, f) − V ∗(x, f) → 0 as x → ∞ for

every fixed f . Since the right hand side in (A.24) is independent of f , it follows that

convergence to zero is uniform in f , as required.

Lemma 53. Let µf be the unique optimal oblivious strategy given f , cf. Lemma 42


and Proposition 38. Then there exists ǫ > 0 and x such that for all x ≥ x,

supf

∑

z

zQ(z | x, µf (x), f) < −ǫ.

Proof. We first show that as x→∞,

supf‖µf (x)− a‖∞ → 0.

Suppose the preceding statement fails; then there exists r > 0 and a sequence

fn, xn such that ‖µfn(xn) − a‖∞ ≥ r for all n. We first use this fact to bound

π(xn, µfn(xn), fn) away from π(xn, a, fn).

Fix R > 0, and let Γ(R) be the optimal objective value of the following optimiza-

tion problem:

maximize π(0, a, f)

subject to ‖a− a‖∞ ≥ R

a ∈ A.

where f denotes the smallest distribution in F—i.e., the one that places all its mass

on state 0. By Assumption 3, it follows that Γ(R) ≤ π(0, a, f). We claim that in fact,

Γ(R) < π(0, a, f). Suppose not; then Γ(R) = π(0, a, f). Further, observe that the

objective function is continuous in a and the feasible region is compact, so at least one

feasible solution exists, say a∗. But then at a∗, π(0, a∗, f) = π(0, a, f), contradicting

Assumption 3.

So now we have:

π(xn, µfn(xn), fn)− π(xn, a, fn) ≤ π(0, µfn

(xn), f)− π(0, a, f)

≤ Γ(r)− π(0, a, f) < 0.

The first line follows by decreasing differences (Assumption 3), and the second line

by the definition of Γ(r). Importantly, note the bound on the right hand side is


independent of n.

On the other hand, we have:

∑

x′

V ∗(x′ | fn)P(x′ | xn, a, fn)−∑

x′

V ∗(x′ | fn)P(x′ | xn, µfn(xn), fn)

≥ V ∗(xn −M | fn)− V ∗(xn + M | fn)

≥ − supf

(V ∗(xn + M | f)− V ∗(xn −M | f)).

Here the first inequality follows because V ∗(x | f) is nondecreasing in x (Lemma 43),

and because increments are bounded (Assumption 3). By Lemma 52, it follows that

the right hand side above approaches zero as n→∞.

Combining our observations, for all sufficiently large n we have:

π(xn, µfn(xn), fn) + β

∑

x′

V ∗(x′ | fn)P(x′ | xn, µfn(xn), fn) <

π(xn, a, fn) + β∑

x′

V ∗(x′ | fn)P(x′ | xn, a, fn),

contradicting the fact that µfnis an optimal oblivious strategy (since Bellman’s op-

timality condition fails; see Lemma 42). We conclude that supf ‖µf (x)− a‖∞ → 0 as

x→∞.

Next, observe that∑

z zQ(z | x, a, f) is continuous in a, as the kernel P is con-

tinuous in a (Assumption 1) and increments are bounded (Assumption 3). Further,

we know that∑

z zQ(z | x, a, f) < 0 for all sufficiently large x, by Assumption

3. Since the increment kernel is stochastically nonincreasing in x, it follows that∑

z zQ(z | x, a, f) is nonincreasing in x for fixed a. Further, since the increment ker-

nel is stochastically nonincreasing in f , it follows that for any f ,∑

z zQ(z | x, a, f) ≤∑

z zQ(z | x, a, f). Combining these observations, we conclude there exists ǫ > 0

and δ > 0 such that if ‖a − a‖∞ < δ, then for all f ,∑

z zQ(z | x, a, f) < −ǫ for all

sufficiently large x. The desired result now follows since supf ‖µf (x) − a‖∞ → 0 as

x→∞.

Lemma 54. For every f , Φ(f) is nonempty.


Proof. As described in the discussion of Section 7.3, it suffices to show that the

state Markov chain induced by an optimal oblivious strategy possesses at least one

invariant distribution—i.e., that D(µ, f) is nonempty, where µ is the unique optimal

oblivious strategy given f .

We use a Foster-Lyapunov argument. Let U(x) = x. Then x ∈ X : U(x) ≤ K

is finite for all K, and by Lemma 53, for x ≥ x,

∑

x′

U(x′)P(x′ | x, µ(x), f) ≤ U(x)− ǫ.

Since increments are bounded (Assumption 3), it is trivial that:

sup0≤x≤x

(∑

x′

U(x′)P(x′ | x, µ(x), f)− U(x)

)

<∞.

It follows by the Foster-Lyapunov criterion that the resulting chain is positive recur-

rent, as required [44].

Lemma 55. For every η ∈ Z+,

supf

supφ∈Φ(f)

∑

x

xηφ(x) <∞.

Proof. We again use a Foster-Lyapunov argument. We proceed by induction; the

claim is clearly true if η = 0. So assume the claim is true up to η − 1; in particular,

define:

αk = supf

supφ∈Φ(f)

∑

x

xkφ(x)

for k = 0, . . . , η−1. Fix f , and let µ ∈ P(f) be the unique optimal oblivious strategy

given f . The preceding lemma establishes that the Markov chain induced by µ is


positive recurrent. Let U(x) = xη+1. Then:

∑

x′

U(x′)P(x′| x, µ(x), f) =∑

z

(x + z)η+1Q(z | x, µ(x), f)

=∑

z

η+1∑

k=0

(η + 1

k

)

xkzη+1−kQ(z | x, µ(x), f)

= U(x) + (η + 1)xη∑

z

zQ(z | x, µ(x), f)

+∑

z

η−1∑

k=0

(η + 1

k

)

xkzη+1−kQ(z | x, µ(x), f).

Define g(x) as:

g(x) =

η−1∑

k=0

(η + 1

k

)

Mη+1−kxk.

By the inductive hypothesis,

γ , supf

supφ∈Φ(f)

∑

x

g(x)φ(x) <∞.

Further, by Lemma 53, for all x ≥ x, we have:

∑

z

Q(z | x, µ(x), f) < −ǫ.

Define h(x) as:

h(x) =

−(η + 1)Mxη, if x ≤ x;

ǫ(η + 1)xη, if x > x.

It follows that:

∑

x′

U(x′)P(x′ | x, µ(x), f)− U(x) ≤ −h(x) + g(x),


so by the Foster-Lyapunov criterion [44] it follows that:

∑

x

h(x)φ(x) ≤∑

x

g(x)φ(x) ≤ γ.

Rearranging terms, we conclude that:

∑

x>x

xηφ(x) ≤γ

η + 1+ Mxη.

Thus:∑

x

xηφ(x) ≤γ

η + 1+ (M + 1)xη.

(Recall that the sum is only over nonnegative x.) Since the right hand side is finite

and independent of f and φ, the result follows.

Proof.[Proof of Proposition 40] We have already established that Φ(f) is nonempty

in Lemma 54. Define B as:

B = supf

supφ∈Φ(f)

∑

x

xp+1φ(x) <∞,

where the inequality is the result of Lemma 55.

We define the set FC as:

FC =

f ∈ F :∑

x

xp+1f(x) ≤ B

.

By definition, Φ(F) ⊂ FC . It is clear that FC is nonempty and convex. It remains to

be shown that FC is compact in the 1-p-norm. It is straightforward to check that FC

is complete; we show that FC is totally bounded, thus establishing compactness.

Fix ǫ > 0. Choose Kǫ so that B/Kǫ < ǫ. Then for all f ∈ FC :

∑

x≥Kǫ

xpf(x) ≤B

Kǫ

< ǫ. (A.25)


Let Sǫ be the projection of FC into the first Kǫ components; i.e.,

Sǫ = g ∈ RKǫ : ∃ f ∈ FC with g(x) = f(x)∀ x ≤ Kǫ.

It is straightforward to check that Sǫ is a compact subset of RKǫ ; so let f1, . . . , fℓ ∈ Sǫ

be a ǫ-cover of Sǫ (i.e., Sǫ is covered by the balls around f1, . . . , fℓ of radius ǫ in the

1-p-norm). Then it follows that f1, . . . , fℓ is a 2ǫ-cover of FC , since (A.25) bounds the

tail of any f ∈ FC by ǫ. This establishes that FC is totally bounded in the 1-p-norm,

as required.

Bibliography

[1] V. Abhishek, S. Adlakha, R. Johari, and G. Weintraub. Oblivious equilibrium

for general stochastic games with many players. Proceedings of the Allerton

Conference on Communication, Control and Computing, pages 892–896, 2007.

[2] S. Adlakha, R. Johari, G. Weintraub, and A. Goldsmith. Oblivious equilibrium

for large-scale stochastic games with unbounded costs. Proceedings of the IEEE

Conference on Decision and Control, 2008.

[3] C.D. Aliprantis and K.C. Border. Infinite dimensional analysis: a hitchhiker’s

guide. Springer Verlag, 2006.

[4] E. Altman. Applications of Markov decision processes in communication net-

works. Handbook of Markov Decision Processes: Methods and Applications, page

489, 2002.

[5] E. Altman and T. Basar. Optimal rate control for high speed telecommunication

networks. Proceedings of the IEEE Conference on Decision and Control, pages

1389–1394, 1995.

[6] E. Altman, T. Basar, and R. Srikant. Congestion control as a stochastic control

problem with action delays. Automatica, 12:1937–1950, 1999.

[7] E. Altman, V. Kambley, and A. Silva. Stochastic games with one step delay

sharing information pattern with application to power control. Proceeding of the

International Conference on Game Theory for Networks, pages 124–129, May

2009.

116

BIBLIOGRAPHY 117

[8] E. Altman and P. Nain. Closed-loop control with delayed information. Perfor-

mance Evaluation Review, 20:193–204, 1992.

[9] E. Altman and S. Stidham. Optimality of monotonic policies for two-action

Markovian decision processes, with applications to control of queues with delayed

information. Queueing Systems, 21(3):267–291, 1995.

[10] A. C. Antoulas and D. C. Sorensen. Approximation of large-scale dynamical sys-

tems: An overview. International Journal of Applied Mathematics and Computer

Sciences, 11(5):1093–1121, 2001.

[11] D. Artiges. Optimal routing into two heterogeneous service stations with delayed

information. IEEE Transactions on Automatic Control, 40(7):1234–1236, 1995.

[12] K. J. Astrom. Optimal control of Markov processes with incomplete state esti-

mation. Journal of Mathematical Analysis and Applications, 10:174–205, 1965.

[13] J. L. Bander and C. C. White III. Markov decision processes with noise-corrupted

and delayed state observations. Journal of Operational Research Society, 50:660–

668, 1999.

[14] T. Basar and N. Bansal. The theory of teams: a selective annotated bibliography.

Lecture Notes in Contol and Information Sciences, 119:186–201, 1989.

[15] T. Basar and J. B Cruz. Concepts and methods in multi-person coordination

and control. Optimization and Control of Dynamic Operational Research Models,

pages 351–387, 1982.

[16] Michael Basin, Jesus Rodriguez-Gonzalez, and Rodolfo Martinez-Zuniga. Op-

timal control for linear systems with time delay in control input based on the

duality principle. Proceedings of the American Control Conference, pages 2144–

2148, 2003.

[17] R. E. Bellman. Dynamic Programming. Princeton University Press, 1957.

BIBLIOGRAPHY 118

[18] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of

decentralized control of Markov decision processes. Mathematics of Operations

Research, 27(4):819–840, 2002.

[19] D. P. Bertsekas. Dynamic Programming and Optimal Control (Vol. 2). Athena

Scientific, Nashua, New Hampshire, 2001.

[20] D.P. Bertsekas. Dynamic Programming and optimal control, volume 1. Athena

Scientific, 1995.

[21] D.P. Bertsekas. Dynamic Programming and optimal control, volume 2. Athena

Scientific, 1995.

[22] D. J. Bertsimas and G. V. Ryzin. Stochastic and dynamic vehicle routing in

the Euclidean plane with multiple capacitated vehicles. Operations Research,

41(1):60–76, 1993.

[23] Aaron Bodoh-Creed. The simple behavior of large mechanisms. Under submis-

sion, 2010.

[24] S. P. Boyd and C. H. Barratt. Linear controller design. Prentice Hall, 1991.

[25] U. Doraszelski and A. Pakes. A framework for applied dynamic analysis in IO.

Handbook of Industrial Organization, Volume 3, 2007.

[26] Darrell Duffie, Semyon Malamud, and Gustavo Manso. Information percolation

with equilibrium search dynamics. Econometrica, 77(5):1513–1574, 2009.

[27] R. Ericson and A. Pakes. Markov-perfect industry dynamics: A framework for

empirical work. Review of Economic Studies, 62(1):53–82, 1995.

[28] D. Famolari, N. Mandayam, D. Goodman, and V. Shah. A new framework for

power control in wireless data networks: Games, utility and pricing. In Proceed-

ings of the Allerton Conference on Communication, Control and Computing,

volume 36, pages 546–555. Springer, 1998.

[29] D. Fudenberg and J. Tirole. Game Theory. The MIT Press, 1991.

BIBLIOGRAPHY 119

[30] B. Hajek. Optimal control of two interacting service stations. IEEE Transactions

on Automatic Control, 29(6):491–499, 1984.

[31] Y. C. Ho and K. C. Chu. Team decision theory and infromation structures in

optimal control problems – Part I. IEEE Transactions on Automatic Control,

17:15–22, 1972.

[32] Y.C. Ho. Team decision theory and information structures. Proceedings of the

IEEE, 68(6):644–654, 1980.

[33] K. Hsu and H. I. Marcus. Decentralized control of finite state Markov processes.

Proceedings of the IEEE Conference on Decision and Control including the Sym-

posium on Adaptive Processes, 19:143–148, 1980.

[34] K. Hsu and H. I. Marcus. Decentralized control of finite state Markov processes.

IEEE Transactions on Automatic Control, 2:426–431, 1982.

[35] M. Huang, P. E. Caines, and R. P. Malhame. Large-population cost-coupled LQG

problems with nonuniform agents: Individual-mass behavior and decentralized

ǫ-Nash equilibria. IEEE Transactions on Automatic Control, 52(9):1560–1571,

2007.

[36] M. Huang, R. P. Malhame, and P. E. Caines. Nash equilibria for large-population

linear stochastic systems of weakly coupled agents. Analysis, Control and Opti-

mization of Complex Dynamical Systems, pages 215–252, 2005.

[37] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001.

[38] B. Jovanovic and R.W. Rosenthal. Anonymous sequential games. Journal of

Mathematical Economics, 17:77–87, 1988.

[39] K. V. Katsikopoulos and S. E. Engelbrecht. Markov decision processes with de-

lays and asynchronous cost collection. IEEE Transactions on Automatic Control,

48(4):568–574, 2003.

BIBLIOGRAPHY 120

[40] P. R. Kumar and P. Varaiya. Stochastic Systems: Estimation, Identification and

Adaptive Control. Prentice Hall, 1986.

[41] J. Kuri and A. Kumar. Optimal control of arrivals to queues with delayed queue

length information. IEEE Transactions on Automatic Control, 40(8):1444–1450,

1995.

[42] B. Kurtaran and R. Sivan. Linear-Quadratic-Gaussian control with one-step-

delay sharing pattern. IEEE Transactions on Automatic Control, 19(5):571–574,

1974.

[43] J. M. Lasry and P. L Lions. Mean field games. Japanese Journal of Mathematics,

2(1):229–260, 2007.

[44] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-

Verlag, 1993.

[45] George E. Monahan. A survey of partially observable Markov decision processes:

Theory, models, and algorithms. Management Science, 28(1):1–16, 1982.

[46] A. Pakes and P. McGuire. Computing Markov-perfect Nash equilibria: Numer-

ical implications of a dynamic differentiated product model. RAND Journal of

Economics, 25(4):555–589, 1994.

[47] A. Pakes and P. McGuire. Stochastic algorithms, symmetric Markov perfect

equilibrium, and the curse of dimensionality. Econometrica, 69(5):1261–1281,

2001.

[48] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro-

gramming. Wiley, 1994.

[49] M. Rotkowitz and S. Lall. A characterization of convex problems in decentralized

control. IEEE Transactions on Automatic Control, 51(2):274–286, 2006.

[50] N. R. Sandell and M. Athans. Solution of some nonclassical LQG stochastic

decision problems. IEEE Transactions on Automatic Control, 19:108–116, 1974.

BIBLIOGRAPHY 121

[51] L. S. Shapley. Stochastic games. Proceeding of the National Academy of Sciences,

39:1095–1100, 1953.

[52] D. D. Siljak. Decentralized Control of Complex Systems. Academic Press, 1991.

[53] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable

Markov processes over a finite horizon. Operations Research, 21(5):1071–1088,

1973.

[54] P. Varaiya and J. Walrand. On delayed sharing patterns. IEEE Transactions on

Automatic Control, 23:443–445, 1978.

[55] P. G. Voulgaris. Optimal control of systems with delayed observation sharing

patterns via input-output methods. Proceedings of the IEEE Conference on

Decision and Control, pages 2311–2316, 2000.

[56] G. Y. Weintraub, C. L. Benkard, and B. Van Roy. Markov perfect industry

dynamics with many firms. Econometrica, 76(6):1375–1411, 2008.

stacks.stanford.edukx619tt4623/thesis_adlakha... · I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation

Documents