-
Stochastic Agent-Based Simulations of Social NetworksGarrett
Bernstein and Kyle OBrien
MIT Lincoln Laboratory; 244 Wood Street; Lexington, MA
02421(garrett.bernstein, kyle.obrien)@ll.mit.edu
Keywords: Social networks, agent-based, mixed-membership,
activity model, observational model, humanmobility
Abstract
The rapidly growing field of network analytics requires datasets
for use in evaluation. Real world data often lack truthand
simulated data lack narrative fidelity or statistical gener-ality.
This paper presents a novel, mixed-membership, agent-based
simulation model to generate activity data with narra-tive power
while providing statistical diversity through ran-dom draws. The
model generalizes to a variety of network ac-tivity types such as
Internet and cellular communications, hu-man mobility, and social
network interactions. The simulatedactions over all agents can then
drive an application specificobservational model to render
measurements as one wouldcollect in real-world experiments. We
apply this frameworkto human mobility and demonstrate its utility
in generatinghigh fidelity traffic data for network analytics.
1
1. INTRODUCTIONUnderstanding phenomena in real world networks is
a
prominent field of research in many areas. There are a
widevariety of inferential tasks on phenomena in communica-tion,
social, and biological networks; for example, email traf-fic
between employees of a company [4], vehicle traffic be-tween
physical locations [17], collaborations between sci-entists [14],
protein-protein interactions [20]. These studiesrange from
clustering nodes into discrete communities toanomaly detection to
inferring attributes on individual nodesto searching for specific
activity embedded in backgroundpopulation clutter. The analytical
approaches taken to studynetworks require vast amounts of complex,
truthed data foralgorithmic verification and there is currently a
dearth of suf-ficient network data. We explore the causes of this
data prob-lem and introduce our solution: a novel, two-tiered
modelthat addresses drawbacks in current options. The first tieris
an agent-based, mixed-membership, activity model that iseasy to
parametrize and abstractly generates agents actionsover time. The
second tier is an application specific obser-
1This work is sponsored by the Assistant Secretary of Defense
for Re-search & Engineering under Air Force Contract
#FA8721-05-C-0002. Opin-ions, interpretations, conclusions and
recommendations are those of the au-thor and are not necessarily
endorsed by the United States Government
vational model that supplies researchers with the
simulatedsensor data necessary to conduct experiments.
Researchers face a trilemma of inadequate data from realworld
datasets, statistical simulation models, and agent-basedsimulation
models. Large-scale real world data sets are ex-pensive to collect
and difficult to obtain high fidelity groundtruth for. Statistical
models, such as Erods-Renyi, Chung-Lu,and blockmodels, have
parameters that are easy to specify andallow for simple replication
of large-scale data sets. What isoften missing, however, is the
ability to encode narratives intothe data because there is no sense
of individual agents, just in-teractions between nodes.
Hand-crafted agent-based modelsaddress this problem by allowing for
narratives in the senseof specific actions taken by an agent
throughout time. Thosenetworks, however, may not result in the
desired aggregatestatistical behavior and are usually difficult to
adapt to otherapplications.
Additionally, generating network data is only half of
themodeling problem. In the real world, data sets are not
deliv-ered as clean networks with nodes and edges. Instead,
algo-rithms must process them in the form of noisy sensor
observa-tions. Therefore, simply using an activity model is not
enoughto effectively simulate data for network analytics. Instead,
thesimulation must be augmented with an observational modelfor the
particular application we wish to study. Once the ob-servable
sensor data has been simulated, we can then feed itinto the desired
network analytics and construct a network.This flow can be seen in
Figure 1 where the top half depictsthe data synthesis aspect, with
parameters describing a pop-ulations behavior, the activity model
generating the popula-tions actions based on the parameters, and
the observationaltransforming the actions into observable sensor
data. The bot-tom half depicts the network analysis problem, with
networksbeing constructed from observed sensor data and then
algo-rithms inferring desired properties and parameters of the
net-work.
This paper is organized as follows: In Section 2. we give abrief
background of previous work on simulation models anddiscuss their
advantages and disadvantages. In Section 3. weintroduce our
activity model that employs the most desirableaspects of
statistical and agent-based models. This model useshigh-level
population parameters to drive an agent-based nar-rative, enabling
it to create rich network datasets but also al-lows for generation
of numerous, statistically similar datasetsfor Monte Carlo purposes
in analyzing network algorithms.
arX
iv:1
309.
1747
v1 [
cs.SI
] 6 S
ep 20
13
-
Parameters
Analysis
Synthesis
Parameters
Activity Model
Observational Model
Network Construction
Algorithms/Analytics
Figure 1: Network analytics workflow
In Section 4. we further the utility of the activity model by
in-troducing the observational model that takes abstract
networkinteractions and transforms them into the simulated output
ofa real sensor, allowing for realistic experimentation methods.We
conclude in Section 5. and discuss future work.
2. BACKGROUNDNetwork data come from two sources: collected real
world
data and simulated data. Real world data sets are are exactlyon
what the inferential tasks will be run in deployment andthus can
claim the highest fidelity. In practice, however, col-lection of
this type of data faces many hurdles. Privacy ofpersonal data can
cause both regulatory issues and hinder av-enues of potential
research, such as [18] needing to followexperimental oversight
regulations on cell phone GPS track-ing. The desired data can take
a long time to collect and onlyresult in one data set, which is
insufficient for Monte Carlopurposes. For example, [22] required
two years to observe asocial network of only thirty-four people.
Possibly most im-portantly for algorithm development, collecting
sufficientlyrepresentative and comprehensive data on a large scale
is adaunting task. [19] explores the prediction of clustering
inprotein interactions collected by [8], but that data set
rep-resents only a snapshot of the proteome averaged over allphases
of the cell cycle.
Simulation models can be a powerful alternative for ob-taining
the requisite experimental data and can be brokeninto two
categories. First, statistical models employ high-level statistics
to describe the aggregate behavior of a popu-lation. This allows
researchers to closely match the simulatedpopulation to the desired
real world population, but gener-ally leads to network interactions
that do not have narrativefidelity. The simplest, such as
Erdos-Reyni or Chung-Lu , areeasy to parameterize and can quickly
provide many large it-erations but are only able to create
populations with homo-geneous behavior. [7] [1] Slightly more
complicated mod-els, such as RMAT [10] and Blockmodels [9], address
thatissue by generating populations with power-law degree dis-
tributions and with block community structures,
respectively,properties which are prevalent in many real world
popula-tions. These models, however, only specify interactions
be-tween nodes and thus fail to encode a narrative of
specificactions over time that accurately reflect the constraints
andbehavior of the target data.
The second type of simulation model, agent-based, fixesthe lack
of narrative by building the network from theground up.
Instantiating individual agents and directly sim-ulating their
behavior necessitates that the resulting networkobeys the
individual constraints imposed on agents activ-ities. Achieving a
high-fidelity narrative comes at a cost,however, as agent-based
models struggle with some aspectsthat come naturally to statistical
models. A simulated vehiclemotion dataset made available by the
National Geospatial-Intelligence Agency (NGA) is a one-off,
hand-crafted, agent-based model that simulates over 4000 people
driving betweenover 5000 locations throughout Baghdad, Iraq over a
48-hourtime period. Because the network was carefully handcraftedit
reflects real world behavior relatively well but it took 2-manyears
to create and only one instance of it exists, so it cannotbe used
for Monte Carlo experiments.
Agent-based models tend to be very specific in their
appli-cation focus and require highly tuned parameters and
intri-cate understanding of the application space. [12] introduces
apowerful agent-based simulation model which provides thenecessary
data to successfully study the spread of diseasethrough a city in
the context of bioterrorism. The model em-ploys pertinent data
sources, such as that from census, schooldistricts, drug purchases,
emergency room visits, etc. The at-tention to detail ensures a
realistic simulation model but ne-cessitates expert knowledge and
copious amounts of targetdata, thus making it difficult to transfer
the model to otherdesirable applications or even for someone not
fully experi-enced with the model to re-parameterize it.
3. ACTIVITY MODELAs mentioned in the Introduction, the
development of net-
work inference algorithms suffers from a lack of
truthed,high-fidelity data. Real data is hard to collect and
truthingit often runs into privacy concerns. Simulated data can
begenerated but popular generation methods lack statistical
fi-delity or narrative flow. In this section we introduce a
mixed-membership, agent-based model that aims to provide both
de-sirable aggregate statistics and realistic agent
interactions.
Instead of directly simulating nodes and edges our ap-proach to
simulating network data recognizes the fact thatmost network data
are created by individuals taking actionsover time. This could be
users clicking on websites, peoplesending emails, or vehicles
driving to locations. We leveragethis insight by directly
simulating agents actions over time.
Just simulating agents actions over time would be difficult
-
both to parameterize and to allow agents to be diverse in
theirnature. To add richness to the data we introduce the conceptof
roles, which we define as the agents intention in executingan
action. In this way, prior to selecting an action the agentfirst
decides what role it will adopt and then chooses an actionbased on
that role.
The model draws from the widely-used concept of mixed-membership
in which data are treated as a mixture of classes.Latent Dirichlet
Allocation [3] treats documents as a mix-ture of thematic topics.
Optical Character Recognition [2]can be implemented to treat a
written character as a mixtureof potential digits. Mixed-Membership
Stochastic Blockmod-els [9] treat networks as a mixture of
community membership.We extend these concepts so that instead of
agents taking ononly one role throughout the entire simulation they
can in-stead have a mixture of intentions at any given time
To add further fidelity to the data we enable agents to
dy-namically change their mixture of roles over the total timeof
the simulation. The mixture of roles can easily be madecyclical to
reflect diurnal patterns of behavior.
3.1. Mathematical DescriptionThe essence of the the activity
model can be broken into
three parts for each event of choosing an action: drawing
thetime at which the agents event occurs, drawing the role theagent
adopts during the event, and drawing the action theagent actually
executes as a result of the event.
The plate model in Figure 2 represents the activity modelthat
will be described in detail in this section. Plate modelsare
convenient methods to depict algorithms with replicatedvariables.
Shaded boxes denote input parameters, circles de-note random
variables, arrows denote variable dependencies,and plates denote
repeated variables, with the variable in thebottom corner of the
plate denoting the number of replica-tions. We let N be the number
of agents, R be the number ofroles, A be the number of possible
actions, T be the number oftimespans in which we discretize the
total time of the simula-tion (e.g. day, evening, night), and Ei be
the number of eventsfor the ith agent
The algorithm depicted by the plate model in Figure 2 iswritten
out in Algorithm 1.
Choosing the events timeChoosing the times of each agents events
first requires de-
ciding how many events will occur for the agent. We drawthe
number of events Ei Poisson(), where the input isthe total time of
the simulation and the input is the aver-age amount of time before
the agent waits to initiate anotherevent. The times of the Ei
events are then drawn uniformlyover the entire time of the
simulation.
To lend fidelity to agent behavior we allow their
parameter-izations to take on different values in the input T
timespans
EiN
i 2 {1 : N}; j 2 {1 : Ei}
(R)I(z)ijI
(t)ij (T ) (A)
I(a)ij
(A)
(R,A)Yij
(R,A)Hi
(R, T )X P (R) (R,A)G
(R)ij ij
N = # agents; R = # roles; A = # actions;T = # timespan
discretizations; Ei = # events for agent i
I(g)ij
Figure 2: Activity model
Algorithm 1: Activity model algorithmData: User defines , , X, ,
P, GResult: A list of actions ai for each agent i and the time of
each
actionforeach agent i {1 : N} do
\\ denotes element-wise divisionHi Multinomial(GG1(A),P)Ei
Poisson()
for event j {1 : Ei} do
I(t)i j Uniform(0,)pii j Dirichlet(I(t)i j X)I(z)i j
Multinomial(pii j)I(g)i j Bernoulli()Yi j = I
(g)i j Hi+(1 I(g)i j )(Hi)
i j Dirichlet(I(z)i j Yi j)ai j Multinomial(i j)
endend
(e.g. propensity towards work during the day, restaurants inthe
evening, and home at night). We create the indicator I(t)ij ,which
specifies during which of the T timespans the jth eventfor the ith
agent occurs.
In actual implementation we employ Poisson count-timeduality to
determine the number and timing of events by hav-ing each agent
draw sequential waiting times between actionsfrom Exponential().
This allows us to build action dura-tions into the wait times and
ensures a realistic narrative inwhich an agents actions cannot
overlap.
-
Choosing the agents roleOnce the agent has picked a time for
each of its events it
then chooses roles for those actions. We wish to avoid
in-putting a specific distribution over roles for each agent as
thiswill lead to bias in Monte Carlo experiments. Instead we in-put
X, the Dirichlet concentration parameters, which specifiesthe
propensity of the population toward roles at each times-pan. To
obtain the actual distribution over roles for an eventwe draw pii j
Dirichlet(I(t)i j X). The actual role for that actionis then drawn
as the indicator I(z)i j Multinomial(pii j,1).
Choosing the agents actionOnce the agent has drawn a role for
its jth event it then
must draw an actual action to execute. Each action belongsto one
of the roles, as specified by the user input G, and canonly be
executed by an agent in that role. Again, to avoidbias, instead of
inputting the distribution over all actions wewill draw the
distribution as i j Dirichlet(I(z)i j Yi j).Yi j is theDirichlet
concentration parameters which specify the propen-sity of an agent
towards every action given the role of theevent.Yi j is constructed
in two conceptual parts, whether or not
this is a normal event and which are the normal actions. Atthe
beginning of the simulation, every agent must determinewhich
actions it deems normal and which actions are abnor-mal. The number
of each type of action the agent deems nor-mal is specified by the
user input P. In splitting the actionsthis way the agent will have
a routine over time but still beallowed deviate from the norm. The
agent draws its normalactions from Hi Multinomial(GG1(A),P), where
de-notes element-wise division. This multinomial draw is set upto
uniformly draw normal actions from all possible actions ofeach
type.
To determine whether each event is normal or abnormalthe agent
draws the indicator I(g)i j Bernoulli(). Then theagent constructs
its Dirichlet propensity towards all actionsfor that event by Yi j
= I
(g)i j Hi+(1 I(g)i j )(Hi), where de-
notes negation. From that the agent draws its probability
dis-tribution over all actions as i j Dirichlet(I(z)i j Yi j).
Finally,the agent draws the actual action executed for that event
asthe indicator I(a)i j Multinomial(i j,1).
3.2. Activity Model ResultsWe present two types of results to
show the capabilities of
the activity model. First we display the actions of an
indi-vidual agent under different parameter settings to show
therichness of model. Second, we tune the parameters to a
targetdata set to demonstrate the ability to simulate data with
spe-cific desired behavior and thus successfully provide
sufficientdata for network analytic experiments.
So far we have discussed the activity model in
anapplication-agnostic manner. To ground the discussion of
theresults we will place the activity model in the human mo-bility
modality, specifically people driving throughout a city.This
application and the motivation for network analysis ofhuman
mobility will be more fully discussed in Section 4. butfor now it
suffices to say that agents are people, an action isa person
driving from their current location to a destinationlocation, and a
role is the persons intention to drive to thedestination. For
example, a person may assume a work roleand decide to drive to
their companys building.
Richness of parameter settingsWe show the spectrum of the
attainable richness of data
with varying parameter settings using a toy human mobil-ity
example. We simulate 100 agents taking on three possibleroles of
Home, Work, and Public, with 25, 5, and 10 locationsof each type,
respectively. A public location may be a park orsports arena.
Figure 3 provides spatio-temporal plots of a sin-gle agents actions
over a 24-hour time period to show the ef-fects of varying
parameters on behavior. These plots displaythe progression of an
agents movements over the durationof the simulation. Time proceeds
along the positive x-axis inminutes since midnight. Each integer on
the positive y-axisrepresents a location and the locations are
separated verti-cally with dashed lines by category. A solid
horizontal lineindicates the agent staying at that single location
over thatspan of time. A diagonal line indicates an event occurring
inwhich the agent chooses and then travels to a new location.
We create three different parameter settings:
deterministic,random, and realistic. Figure 3a depicts parameters
that causean agent to deterministically adopt the three roles over
spec-ified time spans and choose to execute specified actions
foreach role. Figure 3b depicts parameters that cause an agent
torandomly choose a role at any given point in time and to
thenrandomly choose an action to execute given that role. Figure3c
depicts parameters in the middle-ground where an agenthas a normal
lifestyle but still retains the ability to deviatefrom the
norm.
Comparison to target datasetSimulated data is only useful if it
faithfully reflects the tar-
get data. Publicly available real-world network data is
diffi-cult to employ in this venue given privacy and security
con-cerns so we will compare our model to the NGA dataset.
Asdiscussed in Section 2., the dataset is a one-off,
hand-crafted,agent-based model that simulates over 4000 people
drivingthroughout Baghdad, Iraq over a 48-hour time period.
Themodel is not perfect at simulating realistic movements withina
city but it is close enough for our purposes to show that wecan
achieve a high fidelity comparison to a specific target.
-
0 200 400 600 800 1000 1200 1400
Public
Home
Work
Time since midnight (m)
Actio
n Ty
pe
Agents Deterministic Actions Over Time
(a) Deterministic parameter settings
0 200 400 600 800 1000 1200 1400
Public
Home
Work
Time since midnight (m)
Actio
n Ty
pe
Agents Random Actions Over Time
(b) Random parameter settings
0 200 400 600 800 1000 1200 1400
Public
Home
Work
Time since midnight (m)
Actio
n Ty
pe
Agents Realistic Actions Over Time
(c) Realistic parameter settings
Figure 3: Spatio-temporal plots depicting an agents actionsover
a 24-hour time period in a toy example. The x-axis isminutes since
midnight. The y-axis is divided into three roles,with different y
values representing different actions and themarker shape also
denoting to which role the action belongs.
First we parameterized the activity model to match theNGA data
as closely as possible. One strength of our modellies in using
relatively few, easy-to-set parameters. We di-rectly matched
simplistic parameters such as the number ofagents, N, the types of
roles, R, the number of locations, A,and the location categories,
G. With trivial analysis we wereable to determine two input
parameters not directly accessi-ble in the target data: How many of
each actions type an agentdeems normal, P, and agent propensities
towards roles, X.
After using these parameters to simulate an instance of thedata
we compared our output to the NGA dataset. In our dataagents chose
to belong to 0.30 fewer locations of each type onaverage and
visited locations of each type 1.98 fewer timeson average. Matching
agents actions so closely to the targetallows the simulated data to
stand in for the NGA dataset indesired Monte Carlo experiments. The
current activity modeldoes not account for community structure in
the population sowe omit any comparison in that domain, though an
expansionof the model to include community membership and
affinitytowards shared locations is certainly feasible.
The simulated agents individual activity also matched thatof the
NGA agents as can be seen in the two spatio-temporalplots in Figure
4. Over many simulations the agents producebehavior statistically
similar but not identical to the targetagents, as is desirable for
Monte Carlo experiments.
In addition to fidelity, running time is an important
consid-eration in simulating data so that Monte Carlo
experiments
0 200 400 600 800 1000 1200 1400Home
Work
WorshipMarket
ShopBarberRestaurant
CoffeeHouse
InternetParkSoccerOther
Time since midnight (m)
Actio
n Ty
pe
IDA Agents Actions Over Time
(a) NGA Agent
0 200 400 600 800 1000 1200 1400Home
Work
WorshipMarket
ShopBarber
Restaurant
CoffeeHouse
InternetParkSoccerOther
Time since midnight (m)
Actio
n Ty
pe
Simulated Agents Actions Over Time
(b) Simulated Agent
Figure 4: Spatio-temporal plots comparing a simulatedagents and
an NGA agents actions over a 24-hour time pe-riod.
with large numbers of trials are feasible. To simulate NGAdata
with 4623 agents and 5444 locations over 24 hours on astandard Mac
desktop currently takes approximately 25 min-utes on average to
draw the actions via the activity model andapproximately 35 minutes
on average to generate the tracksvia the observational model that
will be discussed in Section4.. Both models are parallelizable with
either Matlabs Par-allel Toolbox or grid computing so we envision
being ableto significantly speed up the implementation from its
currentstate.
4. APPLICATION: HUMANMOBILITYThe previous section introduced a
generalized agent-based
activity model and then applied the model to a human mobil-ity
application by tuning the parameters to simulate the NGAdata set.
That simulated data contains agents abstract actionsbut researchers
may also want to study the sensor observa-tions resulting from the
execution of those actions.
Therefore, we introduce an observational model for hu-man
mobility applications that generates tracks of vehiclemovement
within a city. The observational model is flexibleenough to create
simulations for any place in the world forwhich open-source map
data is available. To demonstrate thisflexibility and assess the
fidelity of the results, we will tailorparameters of the model to
match properties of the NGA dataset and then show a comparison of
the results.
-
4.1. MotivationTracking movement from location to location lends
it-
self naturally to network-related data and many interest-ing
inference and prediction applications. For example, inwide-area,
aerial surveillance systems, vehicle tracks canbe used to
forensically reconstruct clandestine terrorist net-works. Greater
detection can be achieved using networktopology modeling to
separate the foreground network em-bedded in a much larger
background network [17]. Locationdata recorded by mobile phones has
been used to track themovement of large groups of students on a
university cam-pus to learn more about human behavior and social
networkformation [18]. Location data from mobile phone activity
canalso be used to measure spatiotemporal changes in populationto
infer land use and supplement urban planning and zoningregulations
[11].
Collecting large-scale, persistent data to study human
mo-bility, however, is especially difficult. Aerial surveillance
sys-tems are limited by practical issues such as flight
endurancelimitations and line-of-sight occlusions. Even in the
absenceof these issues, accurate vehicle tracking is very
difficult. Fur-thermore, ground truth is required to assess the
accuracy ofthe observed network, but activity of interest rarely
includesground truth or requires pre-planned instrumentation such
asGPS sensors. While cell phones equipped with GPS sensorsare
ubiquitous, such data is difficult to obtain because of le-gal and
privacy concerns. Simulating large-scale, persistentdata sets to
study human mobility would greatly benefit thedevelopment of
algorithms for applications where such datais limited.
4.2. Observational ModelThe observational model is comprised of
four parts. First,
we adapt the activity model to our human mobility applica-tion
by specifying agents, actions, and roles in terms of hu-man
mobility. Next, we obtain map data and define the roadnetwork on
which we will create vehicle tracks. Then we la-bel nodes in the
network according to the type of physical lo-cation they represent.
The final step is to create vehicle tracksby defining paths between
intended locations and modelingsensor observations of that
motion.
Our observation model can be made arbitrarily complex
toreplicate specific sensor observability and noise
characteris-tics. It is also possible to simulate observational
data for anyconceivable type of sensor. However, detailed sensor
mod-eling is beyond the scope of this paper and we assume
thatalgorithms of interest will work at the track level. That
is,the algorithms will work with processed detections associ-ated
over time to produce only location and time data foreach vehicle.
Also, since we are interested in constructingnetwork relationships
and not behavioral phenomena, we donot simulate low-level
micro-behavior of traffic, such as stop
lights, multiple lanes, collision avoidance, and traffic
conges-tion. These microscopic traffic phenomena can be studied
indepth with simulators such as SUMO (Simulation of UrbanMobility)
[13].
Adapting the Activity ModelWe create an observational model for
the modality of hu-
man mobility, specifically people driving throughout a city.To
adapt the activity model to this application we let agentsbe people
within a city. Then each agent performs the actionof driving to a
specific type of location dictated by the agentsrole at that
time.
Creating the Road NetworkWe can define a road network for any
almost any city of in-
terest using open-source mapping data. One source of data isthe
OpenStreetMap (OSM) wiki [5], which contains publicly-available
maps from around the world. We are interested intwo OSM data
elements: nodes and ways. Nodes are singu-lar geospatial points.
They are grouped together into orderedlists called ways that define
linear features such as roads orclosed polygons representing areas
or perimeters. Nodes andways can also carry attributes, such as
residential areas andhighway types.
We represent the road network with a graph where ver-tices are
points along every road. An edge exists betweentwo points if they
are adjacent on the same road or if theycoincide to indicate an
intersection. The graph is stored as aweighted adjacency matrix
whose weights correspond to thephysical length of the node-to-node
road segments. Thus, wealso have an encoding of the physical
distances between eachnode. This is illustrated in Figure 5.
500 1000
200400600800
1000nz = 235520
Adjacency Matrix
Figure 5: Illustration of road network representation. Dots
arenodes along the road and circles are intersections. A link
ex-ists for nodes along a common road and for nodes that
areintersections. Distances are labeled between nodes. These
re-lationships are stored in a weighted adjacency matrix.
Physical Location AssignmentNext, we label the nodes in the road
network according to
the types of physical locations they represent. There are
twoapproaches to this labeling depending on the types of mapdata
that are available.
-
For the first case, node labeling is done directly. This
ispossible with vectorized data where a one-to-one labeling ismade
for each node. For example, each node and way in theOSM data format
can be attributed to indicate characteristicslike businesses or
highway types.
For the other case, node labeling is done indirectly. Either
avectorized set of points defines a polygonal area encompass-ing
multiple nodes, or rasterized data specifies a grid whoseindividual
regions may cover several nodes at once. Labelingcan be done by
assigning all nodes within a region with thesame attributes.
Depending on the application, labels may not be availablefor
each type of location we would like to simulate. In thiscase we can
infer node types based on related categories. Forexample, land
usage data is available from the National LandCover Database (NLCD)
published by the U.S. GeologicalSurvey [21]. The NLCD data comes in
raster format and enu-merates each pixel with Census Feature Class
Codes (CFCC)that describes different levels of urban development as
well ashousing and commercial zones. Therefore we can label
eachnode in the road network with a CFCC and then map that la-bel
to a related location type. For example, in areas of high ur-ban
development, we could label nodes as high-density apart-ments or
small businesses. Because a CFCC could encapsu-late multiple types
of roles we define a probability vector picfor each CFCC, which is
a 1R-dimensional vector. Thenfor each node we draw
fromMultinomial(pic) to determinewhich role that node adopts for
the simulation. In this way wewill end up with neighborhoods of
generally similar locationsbut still maintain diversity.
Creating Vehicle RoutesAfter defining the road network and
labeling the nodes of
interest, we can create the final vehicle tracks. To do this
wefind a route between any two nodes by traversing the graphof the
road network using a pathfinding algorithm. Whileany number of
algorithms can be used we assume that peo-ple always take the
shortest path. In this paper we use Dijk-stras shortest path
algorithm [6]. However, (non-optimal) al-gorithms can be easily
used instead, such as the breadth-first-search algorithm, A* [15],
or for extremely large networks,contraction hierarchies [16].
The output of the pathfinding algorithm is a sequence ofnodes.
However, to produce a proper track containing vehiclepositions over
regular time intervals, we need to assume ba-sic vehicle motion
parameters and sensor observation param-eters. In terms of vehicle
motion, we sample a distributionof expected vehicle speed to assign
to each track. In termsof sensor observations, we assume a
zero-mean, Gaussian-distributed noise on the vehicle position and
that observationsare reported at a constant frame rate. The
particular framerate for each track is drawn from a distribution of
expected
frame rates in case the sensor reports measurements
asyn-chronously.
4.3. ResultsA model is only useful if it simulates data faithful
to the
target. In this section we show the results of the activity
andobservation models with parameters set to match the NGAdata set
as closely as possible and show a comparison of sim-ulated vehicle
tracks to those from the NGA data set.
The full road network for Baghdad using OSM map datais shown in
Figure 6. It is comprised of 14,087 ways, 68,548nodes, and covers
approximately 10km by 10km.
Figure 6: Full road network for Baghdad constructed withOSM data
on latitude and longitude axes.
For comparison purposes, we used the location labels fromthe NGA
data set to label the nodes in our road network. Foreach node in
the NGA data having a location category, wechose the
nearest-neighbor node on our road network to havethe same location
category.
To produce our track results, we had to adjust only
twoparameters of the observation model: the distribution of
vehi-cle velocity and the distribution of observation frame rate.
Todetermine these distributions, we generated a histogram foreach
with 100 bins and normalized to create a proper prob-ability mass
function. Then during each phase of creating atrack, these two
distributions were sampled to determine therespective parameters
for that track.
Figure 7 shows the distributions of track velocity and
tracklength for our simulation compared to the NGA data
afterrunning the entire simulation for all agents. As expected,
thevelocity distribution matches very closely. An
unanticipatedresult is the similarity of the track length
distributions, whichare influenced by the similar labeling of the
road network lo-cations and the shortest-path assumption of the
pathfindingalgorithm.
Figure 8 shows measurement density heatmaps for the theNGA data
and our simulation. As a result of matching theobservation frame
rate, the overall density of our simulationclosely resembles that
of NGA. Another notable similarityis that the traffic behavior in
our simulation resembles NGA
-
4 6 8 100
0.020.040.060.08
IDA Velocity Distribution
Velocity (m/s)
Prob
abili
ty
4 6 8 100
0.020.040.060.08
Simulation Velocity Distribution
Velocity (m/s)
Prob
abili
ty
0 5 10 150
0.01
0.02
IDA Track Length Distribution
Length (km)
Prob
abili
ty
0 5 10 150
0.01
0.02
Simulation Track Length Distribution
Length (km)Pr
obab
ility
Figure 7: Comparison of NGA and simulated tracks: (a) ve-locity
distribution and (b) track length distribution.
data very well. Major highways have the heaviest traffic
whilesecondary roads are proportionately less dense.
IDA ObservationDensity
Sim ObservationDensity
Figure 8: Comparison of NGA and simulation observationdensity
heatmaps. Density level is on logarithmic scale.
5. CONCLUSIONThis paper presents a novel, mixed-membership,
agent-
based simulation model to generate network activity datafor a
broad range of research applications. The model com-bines the best
of real world data, statistical simulation mod-els, and agent-based
simulation models by being easy to im-plement, having narrative
power, and providing statistical di-versity through random draws.
We apply this framework tostudy human mobility and demonstrate the
models utility ingenerating high fidelity traffic data for network
analytics. Wealso adapted the model to a high-fidelity NGA data set
andshowed that the model can replicate its important
properties.
While we were able to closely match the average agentactivity to
the target NGA dataset it is clear that the activ-ity model as is
may not be rich enough to satisfy all re-search needs. Agents
propensity towards locations is cur-rently defined on the
population level so there is no directsense of community, an
important aspect of many networkanalyses. To extend the model, we
will introduce the con-cept of lifestyles which will group agents
together by simi-lar affinities to locations. Additionally,
locations in the NGAdataset are of only one type each but we
foresee the powerof allowing an activity to be executed while an
agent is invarious different roles. To do so we will apply the
mixed-
membership concept to actions in terms of roles and allowthat to
vary over time to create even greater richness. We alsowish to
further explore the parameterization of the activitymodel by
adapting it to applications other than human mo-bility and be able
to use the subsequent sensor output. Todo so we will develop an
observational model for a differ-ent application such as
interactions on social networks suchas Facebook which will also
demonstrate our models utilityto different types of
researchers.
REFERENCES[1] W. Aiello, F. Chung, and L. Linyuan. A random
graph model for massive graphs.
pages 171180, 2000.
[2] C. Bishop. Pattern recognition and machine learning, chapter
9: Mixture Modelsand EM. Springer, 2006.
[3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation.
J. Mach. Learn. Res.,3:9931022, March 2003.
[4] W. Cohen. Enron email dataset, 2009. http://www.cs.cmu.edu/
enron/.
[5] OpenStreetMap contributors. National land cover database
2006, September2012. http://www.openstreetmap.org/.
[6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein.
Introduction to Algorithms (2nded.), chapter 24.3: Dijkstras
algorithm. MIT Press and McGraw-Hill, 2001.
[7] P. Erdos and A. Reyni. On random graphs i. Publicationes
Mathematicae, 6:290297, 1959.
[8] A. Gavin et al. Proteome survey reveals modularity of the
yeast cell machinery.Nature, 440:631636, 2006.
[9] E. Airoldi et al. Mixed membership stochastic blockmodels.
J. Mach. Learn. Res.,9:19812014, June 2008.
[10] J. Leskovec et al. Realistic, mathematically tractable
graph generation and evo-lution, using kronecker multiplication. In
Knowledge Discovery in Databases:PKDD 2005.
[11] J. Toole et al. Inferring land use from mobile phone
activity. In Proc. of the ACMSIGKDD Intl. Wkshp. on Urban
Computing.
[12] K. Carley et al. Biowar: Scalable agent-based model of
bioattacks. IEEE Trans-actions on Systems, Man, and Cybernetics,
Part A, 36(2):252265, 2006.
[13] M. Behrisch et al. Sumo - simulation of urban mobility: An
overview. In SIMUL2011, 3rd Intl. Conf. on Advances in System Sim.,
pages 6368, Barcelona, Spain,October 2011.
[14] M. Girvan et al. Community structure in social and
biological networks. Proc.Natl. Acad. Sci. USA, 99:7821, 2002.
[15] P.E. Hart et al. A formal basis for the heuristic
determination of minimum costpaths. volume 4, pages 100 107, July
1968.
[16] R. Geisberger et al. Contraction hierarchies: faster and
simpler hierarchical rout-ing in road networks. In Proc. of the 7th
Intl. Conf. on Experimental Algorithms.
[17] S. Smith et al. Network discovery using wide-area
surveillance data. In Informa-tion Fusion, 2011 Proc. of the 14th
Intl. Conf. on, pages 1 8, july 2011.
[18] W. Dong et al. Modeling the co-evolution of behaviors and
social relationships us-ing mobile phone data. In Proc. of the 10th
Intnl. Conf. on Mobile and UbiquitousMultimedia, pages 134143,
2011.
[19] Z. Xie et al. Proteome survey reveals modularity of the
yeast cell machinery.Bioinformatics, 27:159166, 2011.
[20] B. Junker and F. Schreiber. Analysis of biological
networks, 2008. Wiley-Interscience.
[21] U.S. Geological Survey. National land cover database 2006
(nlcd2006), June2012. http://www.mrlc.gov/nlcd06 data.php.
[22] W. Zachary. An information flow model for conflict and
fission in small groups.J. of Anthropological Res., 33:452473,
1977.
1. INTRODUCTION2. BACKGROUND3. ACTIVITY MODEL3.1. Mathematical
Description3.2. Activity Model Results
4. APPLICATION: HUMAN MOBILITY4.1. Motivation4.2. Observational
Model4.3. Results
5. CONCLUSION