Top Banner
Stochastic Agent-Based Simulations of Social Networks Garrett Bernstein and Kyle O’Brien MIT Lincoln Laboratory; 244 Wood Street; Lexington, MA 02421 (garrett.bernstein, kyle.obrien)@ll.mit.edu Keywords: Social networks, agent-based, mixed- membership, activity model, observational model, human mobility Abstract The rapidly growing field of network analytics requires data sets for use in evaluation. Real world data often lack truth and simulated data lack narrative fidelity or statistical gener- ality. This paper presents a novel, mixed-membership, agent- based simulation model to generate activity data with narra- tive power while providing statistical diversity through ran- dom draws. The model generalizes to a variety of network ac- tivity types such as Internet and cellular communications, hu- man mobility, and social network interactions. The simulated actions over all agents can then drive an application specific observational model to render measurements as one would collect in real-world experiments. We apply this framework to human mobility and demonstrate its utility in generating high fidelity traffic data for network analytics. 1 1. INTRODUCTION Understanding phenomena in real world networks is a prominent field of research in many areas. There are a wide variety of inferential tasks on phenomena in communica- tion, social, and biological networks; for example, email traf- fic between employees of a company [4], vehicle traffic be- tween physical locations [17], collaborations between sci- entists [14], protein-protein interactions [20]. These studies range from clustering nodes into discrete communities to anomaly detection to inferring attributes on individual nodes to searching for specific activity embedded in background population clutter. The analytical approaches taken to study networks require vast amounts of complex, truthed data for algorithmic verification and there is currently a dearth of suf- ficient network data. We explore the causes of this data prob- lem and introduce our solution: a novel, two-tiered model that addresses drawbacks in current options. The first tier is an agent-based, mixed-membership, activity model that is easy to parametrize and abstractly generates agents’ actions over time. The second tier is an application specific obser- 1 This work is sponsored by the Assistant Secretary of Defense for Re- search & Engineering under Air Force Contract #FA8721-05-C-0002. Opin- ions, interpretations, conclusions and recommendations are those of the au- thor and are not necessarily endorsed by the United States Government vational model that supplies researchers with the simulated sensor data necessary to conduct experiments. Researchers face a trilemma of inadequate data from real world datasets, statistical simulation models, and agent-based simulation models. Large-scale real world data sets are ex- pensive to collect and difficult to obtain high fidelity ground truth for. Statistical models, such as Er˝ ods-R´ enyi, Chung-Lu, and blockmodels, have parameters that are easy to specify and allow for simple replication of large-scale data sets. What is often missing, however, is the ability to encode narratives into the data because there is no sense of individual agents, just in- teractions between nodes. Hand-crafted agent-based models address this problem by allowing for narratives in the sense of specific actions taken by an agent throughout time. Those networks, however, may not result in the desired aggregate statistical behavior and are usually difficult to adapt to other applications. Additionally, generating network data is only half of the modeling problem. In the real world, data sets are not deliv- ered as clean networks with nodes and edges. Instead, algo- rithms must process them in the form of noisy sensor observa- tions. Therefore, simply using an activity model is not enough to effectively simulate data for network analytics. Instead, the simulation must be augmented with an observational model for the particular application we wish to study. Once the ob- servable sensor data has been simulated, we can then feed it into the desired network analytics and construct a network. This flow can be seen in Figure 1 where the top half depicts the data synthesis aspect, with parameters describing a pop- ulation’s behavior, the activity model generating the popula- tion’s actions based on the parameters, and the observational transforming the actions into observable sensor data. The bot- tom half depicts the network analysis problem, with networks being constructed from observed sensor data and then algo- rithms inferring desired properties and parameters of the net- work. This paper is organized as follows: In Section 2. we give a brief background of previous work on simulation models and discuss their advantages and disadvantages. In Section 3. we introduce our activity model that employs the most desirable aspects of statistical and agent-based models. This model uses high-level population parameters to drive an agent-based nar- rative, enabling it to create rich network datasets but also al- lows for generation of numerous, statistically similar datasets for Monte Carlo purposes in analyzing network algorithms. arXiv:1309.1747v1 [cs.SI] 6 Sep 2013
8
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Stochastic Agent-Based Simulations of Social NetworksGarrett Bernstein and Kyle OBrien

    MIT Lincoln Laboratory; 244 Wood Street; Lexington, MA 02421(garrett.bernstein, kyle.obrien)@ll.mit.edu

    Keywords: Social networks, agent-based, mixed-membership, activity model, observational model, humanmobility

    Abstract

    The rapidly growing field of network analytics requires datasets for use in evaluation. Real world data often lack truthand simulated data lack narrative fidelity or statistical gener-ality. This paper presents a novel, mixed-membership, agent-based simulation model to generate activity data with narra-tive power while providing statistical diversity through ran-dom draws. The model generalizes to a variety of network ac-tivity types such as Internet and cellular communications, hu-man mobility, and social network interactions. The simulatedactions over all agents can then drive an application specificobservational model to render measurements as one wouldcollect in real-world experiments. We apply this frameworkto human mobility and demonstrate its utility in generatinghigh fidelity traffic data for network analytics. 1

    1. INTRODUCTIONUnderstanding phenomena in real world networks is a

    prominent field of research in many areas. There are a widevariety of inferential tasks on phenomena in communica-tion, social, and biological networks; for example, email traf-fic between employees of a company [4], vehicle traffic be-tween physical locations [17], collaborations between sci-entists [14], protein-protein interactions [20]. These studiesrange from clustering nodes into discrete communities toanomaly detection to inferring attributes on individual nodesto searching for specific activity embedded in backgroundpopulation clutter. The analytical approaches taken to studynetworks require vast amounts of complex, truthed data foralgorithmic verification and there is currently a dearth of suf-ficient network data. We explore the causes of this data prob-lem and introduce our solution: a novel, two-tiered modelthat addresses drawbacks in current options. The first tieris an agent-based, mixed-membership, activity model that iseasy to parametrize and abstractly generates agents actionsover time. The second tier is an application specific obser-

    1This work is sponsored by the Assistant Secretary of Defense for Re-search & Engineering under Air Force Contract #FA8721-05-C-0002. Opin-ions, interpretations, conclusions and recommendations are those of the au-thor and are not necessarily endorsed by the United States Government

    vational model that supplies researchers with the simulatedsensor data necessary to conduct experiments.

    Researchers face a trilemma of inadequate data from realworld datasets, statistical simulation models, and agent-basedsimulation models. Large-scale real world data sets are ex-pensive to collect and difficult to obtain high fidelity groundtruth for. Statistical models, such as Erods-Renyi, Chung-Lu,and blockmodels, have parameters that are easy to specify andallow for simple replication of large-scale data sets. What isoften missing, however, is the ability to encode narratives intothe data because there is no sense of individual agents, just in-teractions between nodes. Hand-crafted agent-based modelsaddress this problem by allowing for narratives in the senseof specific actions taken by an agent throughout time. Thosenetworks, however, may not result in the desired aggregatestatistical behavior and are usually difficult to adapt to otherapplications.

    Additionally, generating network data is only half of themodeling problem. In the real world, data sets are not deliv-ered as clean networks with nodes and edges. Instead, algo-rithms must process them in the form of noisy sensor observa-tions. Therefore, simply using an activity model is not enoughto effectively simulate data for network analytics. Instead, thesimulation must be augmented with an observational modelfor the particular application we wish to study. Once the ob-servable sensor data has been simulated, we can then feed itinto the desired network analytics and construct a network.This flow can be seen in Figure 1 where the top half depictsthe data synthesis aspect, with parameters describing a pop-ulations behavior, the activity model generating the popula-tions actions based on the parameters, and the observationaltransforming the actions into observable sensor data. The bot-tom half depicts the network analysis problem, with networksbeing constructed from observed sensor data and then algo-rithms inferring desired properties and parameters of the net-work.

    This paper is organized as follows: In Section 2. we give abrief background of previous work on simulation models anddiscuss their advantages and disadvantages. In Section 3. weintroduce our activity model that employs the most desirableaspects of statistical and agent-based models. This model useshigh-level population parameters to drive an agent-based nar-rative, enabling it to create rich network datasets but also al-lows for generation of numerous, statistically similar datasetsfor Monte Carlo purposes in analyzing network algorithms.

    arX

    iv:1

    309.

    1747

    v1 [

    cs.SI

    ] 6 S

    ep 20

    13

  • Parameters

    Analysis

    Synthesis

    Parameters

    Activity Model

    Observational Model

    Network Construction

    Algorithms/Analytics

    Figure 1: Network analytics workflow

    In Section 4. we further the utility of the activity model by in-troducing the observational model that takes abstract networkinteractions and transforms them into the simulated output ofa real sensor, allowing for realistic experimentation methods.We conclude in Section 5. and discuss future work.

    2. BACKGROUNDNetwork data come from two sources: collected real world

    data and simulated data. Real world data sets are are exactlyon what the inferential tasks will be run in deployment andthus can claim the highest fidelity. In practice, however, col-lection of this type of data faces many hurdles. Privacy ofpersonal data can cause both regulatory issues and hinder av-enues of potential research, such as [18] needing to followexperimental oversight regulations on cell phone GPS track-ing. The desired data can take a long time to collect and onlyresult in one data set, which is insufficient for Monte Carlopurposes. For example, [22] required two years to observe asocial network of only thirty-four people. Possibly most im-portantly for algorithm development, collecting sufficientlyrepresentative and comprehensive data on a large scale is adaunting task. [19] explores the prediction of clustering inprotein interactions collected by [8], but that data set rep-resents only a snapshot of the proteome averaged over allphases of the cell cycle.

    Simulation models can be a powerful alternative for ob-taining the requisite experimental data and can be brokeninto two categories. First, statistical models employ high-level statistics to describe the aggregate behavior of a popu-lation. This allows researchers to closely match the simulatedpopulation to the desired real world population, but gener-ally leads to network interactions that do not have narrativefidelity. The simplest, such as Erdos-Reyni or Chung-Lu , areeasy to parameterize and can quickly provide many large it-erations but are only able to create populations with homo-geneous behavior. [7] [1] Slightly more complicated mod-els, such as RMAT [10] and Blockmodels [9], address thatissue by generating populations with power-law degree dis-

    tributions and with block community structures, respectively,properties which are prevalent in many real world popula-tions. These models, however, only specify interactions be-tween nodes and thus fail to encode a narrative of specificactions over time that accurately reflect the constraints andbehavior of the target data.

    The second type of simulation model, agent-based, fixesthe lack of narrative by building the network from theground up. Instantiating individual agents and directly sim-ulating their behavior necessitates that the resulting networkobeys the individual constraints imposed on agents activ-ities. Achieving a high-fidelity narrative comes at a cost,however, as agent-based models struggle with some aspectsthat come naturally to statistical models. A simulated vehiclemotion dataset made available by the National Geospatial-Intelligence Agency (NGA) is a one-off, hand-crafted, agent-based model that simulates over 4000 people driving betweenover 5000 locations throughout Baghdad, Iraq over a 48-hourtime period. Because the network was carefully handcraftedit reflects real world behavior relatively well but it took 2-manyears to create and only one instance of it exists, so it cannotbe used for Monte Carlo experiments.

    Agent-based models tend to be very specific in their appli-cation focus and require highly tuned parameters and intri-cate understanding of the application space. [12] introduces apowerful agent-based simulation model which provides thenecessary data to successfully study the spread of diseasethrough a city in the context of bioterrorism. The model em-ploys pertinent data sources, such as that from census, schooldistricts, drug purchases, emergency room visits, etc. The at-tention to detail ensures a realistic simulation model but ne-cessitates expert knowledge and copious amounts of targetdata, thus making it difficult to transfer the model to otherdesirable applications or even for someone not fully experi-enced with the model to re-parameterize it.

    3. ACTIVITY MODELAs mentioned in the Introduction, the development of net-

    work inference algorithms suffers from a lack of truthed,high-fidelity data. Real data is hard to collect and truthingit often runs into privacy concerns. Simulated data can begenerated but popular generation methods lack statistical fi-delity or narrative flow. In this section we introduce a mixed-membership, agent-based model that aims to provide both de-sirable aggregate statistics and realistic agent interactions.

    Instead of directly simulating nodes and edges our ap-proach to simulating network data recognizes the fact thatmost network data are created by individuals taking actionsover time. This could be users clicking on websites, peoplesending emails, or vehicles driving to locations. We leveragethis insight by directly simulating agents actions over time.

    Just simulating agents actions over time would be difficult

  • both to parameterize and to allow agents to be diverse in theirnature. To add richness to the data we introduce the conceptof roles, which we define as the agents intention in executingan action. In this way, prior to selecting an action the agentfirst decides what role it will adopt and then chooses an actionbased on that role.

    The model draws from the widely-used concept of mixed-membership in which data are treated as a mixture of classes.Latent Dirichlet Allocation [3] treats documents as a mix-ture of thematic topics. Optical Character Recognition [2]can be implemented to treat a written character as a mixtureof potential digits. Mixed-Membership Stochastic Blockmod-els [9] treat networks as a mixture of community membership.We extend these concepts so that instead of agents taking ononly one role throughout the entire simulation they can in-stead have a mixture of intentions at any given time

    To add further fidelity to the data we enable agents to dy-namically change their mixture of roles over the total timeof the simulation. The mixture of roles can easily be madecyclical to reflect diurnal patterns of behavior.

    3.1. Mathematical DescriptionThe essence of the the activity model can be broken into

    three parts for each event of choosing an action: drawing thetime at which the agents event occurs, drawing the role theagent adopts during the event, and drawing the action theagent actually executes as a result of the event.

    The plate model in Figure 2 represents the activity modelthat will be described in detail in this section. Plate modelsare convenient methods to depict algorithms with replicatedvariables. Shaded boxes denote input parameters, circles de-note random variables, arrows denote variable dependencies,and plates denote repeated variables, with the variable in thebottom corner of the plate denoting the number of replica-tions. We let N be the number of agents, R be the number ofroles, A be the number of possible actions, T be the number oftimespans in which we discretize the total time of the simula-tion (e.g. day, evening, night), and Ei be the number of eventsfor the ith agent

    The algorithm depicted by the plate model in Figure 2 iswritten out in Algorithm 1.

    Choosing the events timeChoosing the times of each agents events first requires de-

    ciding how many events will occur for the agent. We drawthe number of events Ei Poisson(), where the input isthe total time of the simulation and the input is the aver-age amount of time before the agent waits to initiate anotherevent. The times of the Ei events are then drawn uniformlyover the entire time of the simulation.

    To lend fidelity to agent behavior we allow their parameter-izations to take on different values in the input T timespans

    EiN

    i 2 {1 : N}; j 2 {1 : Ei}

    (R)I(z)ijI

    (t)ij (T ) (A)

    I(a)ij

    (A)

    (R,A)Yij

    (R,A)Hi

    (R, T )X P (R) (R,A)G

    (R)ij ij

    N = # agents; R = # roles; A = # actions;T = # timespan discretizations; Ei = # events for agent i

    I(g)ij

    Figure 2: Activity model

    Algorithm 1: Activity model algorithmData: User defines , , X, , P, GResult: A list of actions ai for each agent i and the time of each

    actionforeach agent i {1 : N} do

    \\ denotes element-wise divisionHi Multinomial(GG1(A),P)Ei Poisson()

    for event j {1 : Ei} do

    I(t)i j Uniform(0,)pii j Dirichlet(I(t)i j X)I(z)i j Multinomial(pii j)I(g)i j Bernoulli()Yi j = I

    (g)i j Hi+(1 I(g)i j )(Hi)

    i j Dirichlet(I(z)i j Yi j)ai j Multinomial(i j)

    endend

    (e.g. propensity towards work during the day, restaurants inthe evening, and home at night). We create the indicator I(t)ij ,which specifies during which of the T timespans the jth eventfor the ith agent occurs.

    In actual implementation we employ Poisson count-timeduality to determine the number and timing of events by hav-ing each agent draw sequential waiting times between actionsfrom Exponential(). This allows us to build action dura-tions into the wait times and ensures a realistic narrative inwhich an agents actions cannot overlap.

  • Choosing the agents roleOnce the agent has picked a time for each of its events it

    then chooses roles for those actions. We wish to avoid in-putting a specific distribution over roles for each agent as thiswill lead to bias in Monte Carlo experiments. Instead we in-put X, the Dirichlet concentration parameters, which specifiesthe propensity of the population toward roles at each times-pan. To obtain the actual distribution over roles for an eventwe draw pii j Dirichlet(I(t)i j X). The actual role for that actionis then drawn as the indicator I(z)i j Multinomial(pii j,1).

    Choosing the agents actionOnce the agent has drawn a role for its jth event it then

    must draw an actual action to execute. Each action belongsto one of the roles, as specified by the user input G, and canonly be executed by an agent in that role. Again, to avoidbias, instead of inputting the distribution over all actions wewill draw the distribution as i j Dirichlet(I(z)i j Yi j).Yi j is theDirichlet concentration parameters which specify the propen-sity of an agent towards every action given the role of theevent.Yi j is constructed in two conceptual parts, whether or not

    this is a normal event and which are the normal actions. Atthe beginning of the simulation, every agent must determinewhich actions it deems normal and which actions are abnor-mal. The number of each type of action the agent deems nor-mal is specified by the user input P. In splitting the actionsthis way the agent will have a routine over time but still beallowed deviate from the norm. The agent draws its normalactions from Hi Multinomial(GG1(A),P), where de-notes element-wise division. This multinomial draw is set upto uniformly draw normal actions from all possible actions ofeach type.

    To determine whether each event is normal or abnormalthe agent draws the indicator I(g)i j Bernoulli(). Then theagent constructs its Dirichlet propensity towards all actionsfor that event by Yi j = I

    (g)i j Hi+(1 I(g)i j )(Hi), where de-

    notes negation. From that the agent draws its probability dis-tribution over all actions as i j Dirichlet(I(z)i j Yi j). Finally,the agent draws the actual action executed for that event asthe indicator I(a)i j Multinomial(i j,1).

    3.2. Activity Model ResultsWe present two types of results to show the capabilities of

    the activity model. First we display the actions of an indi-vidual agent under different parameter settings to show therichness of model. Second, we tune the parameters to a targetdata set to demonstrate the ability to simulate data with spe-cific desired behavior and thus successfully provide sufficientdata for network analytic experiments.

    So far we have discussed the activity model in anapplication-agnostic manner. To ground the discussion of theresults we will place the activity model in the human mo-bility modality, specifically people driving throughout a city.This application and the motivation for network analysis ofhuman mobility will be more fully discussed in Section 4. butfor now it suffices to say that agents are people, an action isa person driving from their current location to a destinationlocation, and a role is the persons intention to drive to thedestination. For example, a person may assume a work roleand decide to drive to their companys building.

    Richness of parameter settingsWe show the spectrum of the attainable richness of data

    with varying parameter settings using a toy human mobil-ity example. We simulate 100 agents taking on three possibleroles of Home, Work, and Public, with 25, 5, and 10 locationsof each type, respectively. A public location may be a park orsports arena. Figure 3 provides spatio-temporal plots of a sin-gle agents actions over a 24-hour time period to show the ef-fects of varying parameters on behavior. These plots displaythe progression of an agents movements over the durationof the simulation. Time proceeds along the positive x-axis inminutes since midnight. Each integer on the positive y-axisrepresents a location and the locations are separated verti-cally with dashed lines by category. A solid horizontal lineindicates the agent staying at that single location over thatspan of time. A diagonal line indicates an event occurring inwhich the agent chooses and then travels to a new location.

    We create three different parameter settings: deterministic,random, and realistic. Figure 3a depicts parameters that causean agent to deterministically adopt the three roles over spec-ified time spans and choose to execute specified actions foreach role. Figure 3b depicts parameters that cause an agent torandomly choose a role at any given point in time and to thenrandomly choose an action to execute given that role. Figure3c depicts parameters in the middle-ground where an agenthas a normal lifestyle but still retains the ability to deviatefrom the norm.

    Comparison to target datasetSimulated data is only useful if it faithfully reflects the tar-

    get data. Publicly available real-world network data is diffi-cult to employ in this venue given privacy and security con-cerns so we will compare our model to the NGA dataset. Asdiscussed in Section 2., the dataset is a one-off, hand-crafted,agent-based model that simulates over 4000 people drivingthroughout Baghdad, Iraq over a 48-hour time period. Themodel is not perfect at simulating realistic movements withina city but it is close enough for our purposes to show that wecan achieve a high fidelity comparison to a specific target.

  • 0 200 400 600 800 1000 1200 1400

    Public

    Home

    Work

    Time since midnight (m)

    Actio

    n Ty

    pe

    Agents Deterministic Actions Over Time

    (a) Deterministic parameter settings

    0 200 400 600 800 1000 1200 1400

    Public

    Home

    Work

    Time since midnight (m)

    Actio

    n Ty

    pe

    Agents Random Actions Over Time

    (b) Random parameter settings

    0 200 400 600 800 1000 1200 1400

    Public

    Home

    Work

    Time since midnight (m)

    Actio

    n Ty

    pe

    Agents Realistic Actions Over Time

    (c) Realistic parameter settings

    Figure 3: Spatio-temporal plots depicting an agents actionsover a 24-hour time period in a toy example. The x-axis isminutes since midnight. The y-axis is divided into three roles,with different y values representing different actions and themarker shape also denoting to which role the action belongs.

    First we parameterized the activity model to match theNGA data as closely as possible. One strength of our modellies in using relatively few, easy-to-set parameters. We di-rectly matched simplistic parameters such as the number ofagents, N, the types of roles, R, the number of locations, A,and the location categories, G. With trivial analysis we wereable to determine two input parameters not directly accessi-ble in the target data: How many of each actions type an agentdeems normal, P, and agent propensities towards roles, X.

    After using these parameters to simulate an instance of thedata we compared our output to the NGA dataset. In our dataagents chose to belong to 0.30 fewer locations of each type onaverage and visited locations of each type 1.98 fewer timeson average. Matching agents actions so closely to the targetallows the simulated data to stand in for the NGA dataset indesired Monte Carlo experiments. The current activity modeldoes not account for community structure in the population sowe omit any comparison in that domain, though an expansionof the model to include community membership and affinitytowards shared locations is certainly feasible.

    The simulated agents individual activity also matched thatof the NGA agents as can be seen in the two spatio-temporalplots in Figure 4. Over many simulations the agents producebehavior statistically similar but not identical to the targetagents, as is desirable for Monte Carlo experiments.

    In addition to fidelity, running time is an important consid-eration in simulating data so that Monte Carlo experiments

    0 200 400 600 800 1000 1200 1400Home

    Work

    WorshipMarket

    ShopBarberRestaurant

    CoffeeHouse

    InternetParkSoccerOther

    Time since midnight (m)

    Actio

    n Ty

    pe

    IDA Agents Actions Over Time

    (a) NGA Agent

    0 200 400 600 800 1000 1200 1400Home

    Work

    WorshipMarket

    ShopBarber

    Restaurant

    CoffeeHouse

    InternetParkSoccerOther

    Time since midnight (m)

    Actio

    n Ty

    pe

    Simulated Agents Actions Over Time

    (b) Simulated Agent

    Figure 4: Spatio-temporal plots comparing a simulatedagents and an NGA agents actions over a 24-hour time pe-riod.

    with large numbers of trials are feasible. To simulate NGAdata with 4623 agents and 5444 locations over 24 hours on astandard Mac desktop currently takes approximately 25 min-utes on average to draw the actions via the activity model andapproximately 35 minutes on average to generate the tracksvia the observational model that will be discussed in Section4.. Both models are parallelizable with either Matlabs Par-allel Toolbox or grid computing so we envision being ableto significantly speed up the implementation from its currentstate.

    4. APPLICATION: HUMANMOBILITYThe previous section introduced a generalized agent-based

    activity model and then applied the model to a human mobil-ity application by tuning the parameters to simulate the NGAdata set. That simulated data contains agents abstract actionsbut researchers may also want to study the sensor observa-tions resulting from the execution of those actions.

    Therefore, we introduce an observational model for hu-man mobility applications that generates tracks of vehiclemovement within a city. The observational model is flexibleenough to create simulations for any place in the world forwhich open-source map data is available. To demonstrate thisflexibility and assess the fidelity of the results, we will tailorparameters of the model to match properties of the NGA dataset and then show a comparison of the results.

  • 4.1. MotivationTracking movement from location to location lends it-

    self naturally to network-related data and many interest-ing inference and prediction applications. For example, inwide-area, aerial surveillance systems, vehicle tracks canbe used to forensically reconstruct clandestine terrorist net-works. Greater detection can be achieved using networktopology modeling to separate the foreground network em-bedded in a much larger background network [17]. Locationdata recorded by mobile phones has been used to track themovement of large groups of students on a university cam-pus to learn more about human behavior and social networkformation [18]. Location data from mobile phone activity canalso be used to measure spatiotemporal changes in populationto infer land use and supplement urban planning and zoningregulations [11].

    Collecting large-scale, persistent data to study human mo-bility, however, is especially difficult. Aerial surveillance sys-tems are limited by practical issues such as flight endurancelimitations and line-of-sight occlusions. Even in the absenceof these issues, accurate vehicle tracking is very difficult. Fur-thermore, ground truth is required to assess the accuracy ofthe observed network, but activity of interest rarely includesground truth or requires pre-planned instrumentation such asGPS sensors. While cell phones equipped with GPS sensorsare ubiquitous, such data is difficult to obtain because of le-gal and privacy concerns. Simulating large-scale, persistentdata sets to study human mobility would greatly benefit thedevelopment of algorithms for applications where such datais limited.

    4.2. Observational ModelThe observational model is comprised of four parts. First,

    we adapt the activity model to our human mobility applica-tion by specifying agents, actions, and roles in terms of hu-man mobility. Next, we obtain map data and define the roadnetwork on which we will create vehicle tracks. Then we la-bel nodes in the network according to the type of physical lo-cation they represent. The final step is to create vehicle tracksby defining paths between intended locations and modelingsensor observations of that motion.

    Our observation model can be made arbitrarily complex toreplicate specific sensor observability and noise characteris-tics. It is also possible to simulate observational data for anyconceivable type of sensor. However, detailed sensor mod-eling is beyond the scope of this paper and we assume thatalgorithms of interest will work at the track level. That is,the algorithms will work with processed detections associ-ated over time to produce only location and time data foreach vehicle. Also, since we are interested in constructingnetwork relationships and not behavioral phenomena, we donot simulate low-level micro-behavior of traffic, such as stop

    lights, multiple lanes, collision avoidance, and traffic conges-tion. These microscopic traffic phenomena can be studied indepth with simulators such as SUMO (Simulation of UrbanMobility) [13].

    Adapting the Activity ModelWe create an observational model for the modality of hu-

    man mobility, specifically people driving throughout a city.To adapt the activity model to this application we let agentsbe people within a city. Then each agent performs the actionof driving to a specific type of location dictated by the agentsrole at that time.

    Creating the Road NetworkWe can define a road network for any almost any city of in-

    terest using open-source mapping data. One source of data isthe OpenStreetMap (OSM) wiki [5], which contains publicly-available maps from around the world. We are interested intwo OSM data elements: nodes and ways. Nodes are singu-lar geospatial points. They are grouped together into orderedlists called ways that define linear features such as roads orclosed polygons representing areas or perimeters. Nodes andways can also carry attributes, such as residential areas andhighway types.

    We represent the road network with a graph where ver-tices are points along every road. An edge exists betweentwo points if they are adjacent on the same road or if theycoincide to indicate an intersection. The graph is stored as aweighted adjacency matrix whose weights correspond to thephysical length of the node-to-node road segments. Thus, wealso have an encoding of the physical distances between eachnode. This is illustrated in Figure 5.

    500 1000

    200400600800

    1000nz = 235520

    Adjacency Matrix

    Figure 5: Illustration of road network representation. Dots arenodes along the road and circles are intersections. A link ex-ists for nodes along a common road and for nodes that areintersections. Distances are labeled between nodes. These re-lationships are stored in a weighted adjacency matrix.

    Physical Location AssignmentNext, we label the nodes in the road network according to

    the types of physical locations they represent. There are twoapproaches to this labeling depending on the types of mapdata that are available.

  • For the first case, node labeling is done directly. This ispossible with vectorized data where a one-to-one labeling ismade for each node. For example, each node and way in theOSM data format can be attributed to indicate characteristicslike businesses or highway types.

    For the other case, node labeling is done indirectly. Either avectorized set of points defines a polygonal area encompass-ing multiple nodes, or rasterized data specifies a grid whoseindividual regions may cover several nodes at once. Labelingcan be done by assigning all nodes within a region with thesame attributes.

    Depending on the application, labels may not be availablefor each type of location we would like to simulate. In thiscase we can infer node types based on related categories. Forexample, land usage data is available from the National LandCover Database (NLCD) published by the U.S. GeologicalSurvey [21]. The NLCD data comes in raster format and enu-merates each pixel with Census Feature Class Codes (CFCC)that describes different levels of urban development as well ashousing and commercial zones. Therefore we can label eachnode in the road network with a CFCC and then map that la-bel to a related location type. For example, in areas of high ur-ban development, we could label nodes as high-density apart-ments or small businesses. Because a CFCC could encapsu-late multiple types of roles we define a probability vector picfor each CFCC, which is a 1R-dimensional vector. Thenfor each node we draw fromMultinomial(pic) to determinewhich role that node adopts for the simulation. In this way wewill end up with neighborhoods of generally similar locationsbut still maintain diversity.

    Creating Vehicle RoutesAfter defining the road network and labeling the nodes of

    interest, we can create the final vehicle tracks. To do this wefind a route between any two nodes by traversing the graphof the road network using a pathfinding algorithm. Whileany number of algorithms can be used we assume that peo-ple always take the shortest path. In this paper we use Dijk-stras shortest path algorithm [6]. However, (non-optimal) al-gorithms can be easily used instead, such as the breadth-first-search algorithm, A* [15], or for extremely large networks,contraction hierarchies [16].

    The output of the pathfinding algorithm is a sequence ofnodes. However, to produce a proper track containing vehiclepositions over regular time intervals, we need to assume ba-sic vehicle motion parameters and sensor observation param-eters. In terms of vehicle motion, we sample a distributionof expected vehicle speed to assign to each track. In termsof sensor observations, we assume a zero-mean, Gaussian-distributed noise on the vehicle position and that observationsare reported at a constant frame rate. The particular framerate for each track is drawn from a distribution of expected

    frame rates in case the sensor reports measurements asyn-chronously.

    4.3. ResultsA model is only useful if it simulates data faithful to the

    target. In this section we show the results of the activity andobservation models with parameters set to match the NGAdata set as closely as possible and show a comparison of sim-ulated vehicle tracks to those from the NGA data set.

    The full road network for Baghdad using OSM map datais shown in Figure 6. It is comprised of 14,087 ways, 68,548nodes, and covers approximately 10km by 10km.

    Figure 6: Full road network for Baghdad constructed withOSM data on latitude and longitude axes.

    For comparison purposes, we used the location labels fromthe NGA data set to label the nodes in our road network. Foreach node in the NGA data having a location category, wechose the nearest-neighbor node on our road network to havethe same location category.

    To produce our track results, we had to adjust only twoparameters of the observation model: the distribution of vehi-cle velocity and the distribution of observation frame rate. Todetermine these distributions, we generated a histogram foreach with 100 bins and normalized to create a proper prob-ability mass function. Then during each phase of creating atrack, these two distributions were sampled to determine therespective parameters for that track.

    Figure 7 shows the distributions of track velocity and tracklength for our simulation compared to the NGA data afterrunning the entire simulation for all agents. As expected, thevelocity distribution matches very closely. An unanticipatedresult is the similarity of the track length distributions, whichare influenced by the similar labeling of the road network lo-cations and the shortest-path assumption of the pathfindingalgorithm.

    Figure 8 shows measurement density heatmaps for the theNGA data and our simulation. As a result of matching theobservation frame rate, the overall density of our simulationclosely resembles that of NGA. Another notable similarityis that the traffic behavior in our simulation resembles NGA

  • 4 6 8 100

    0.020.040.060.08

    IDA Velocity Distribution

    Velocity (m/s)

    Prob

    abili

    ty

    4 6 8 100

    0.020.040.060.08

    Simulation Velocity Distribution

    Velocity (m/s)

    Prob

    abili

    ty

    0 5 10 150

    0.01

    0.02

    IDA Track Length Distribution

    Length (km)

    Prob

    abili

    ty

    0 5 10 150

    0.01

    0.02

    Simulation Track Length Distribution

    Length (km)Pr

    obab

    ility

    Figure 7: Comparison of NGA and simulated tracks: (a) ve-locity distribution and (b) track length distribution.

    data very well. Major highways have the heaviest traffic whilesecondary roads are proportionately less dense.

    IDA ObservationDensity

    Sim ObservationDensity

    Figure 8: Comparison of NGA and simulation observationdensity heatmaps. Density level is on logarithmic scale.

    5. CONCLUSIONThis paper presents a novel, mixed-membership, agent-

    based simulation model to generate network activity datafor a broad range of research applications. The model com-bines the best of real world data, statistical simulation mod-els, and agent-based simulation models by being easy to im-plement, having narrative power, and providing statistical di-versity through random draws. We apply this framework tostudy human mobility and demonstrate the models utility ingenerating high fidelity traffic data for network analytics. Wealso adapted the model to a high-fidelity NGA data set andshowed that the model can replicate its important properties.

    While we were able to closely match the average agentactivity to the target NGA dataset it is clear that the activ-ity model as is may not be rich enough to satisfy all re-search needs. Agents propensity towards locations is cur-rently defined on the population level so there is no directsense of community, an important aspect of many networkanalyses. To extend the model, we will introduce the con-cept of lifestyles which will group agents together by simi-lar affinities to locations. Additionally, locations in the NGAdataset are of only one type each but we foresee the powerof allowing an activity to be executed while an agent is invarious different roles. To do so we will apply the mixed-

    membership concept to actions in terms of roles and allowthat to vary over time to create even greater richness. We alsowish to further explore the parameterization of the activitymodel by adapting it to applications other than human mo-bility and be able to use the subsequent sensor output. Todo so we will develop an observational model for a differ-ent application such as interactions on social networks suchas Facebook which will also demonstrate our models utilityto different types of researchers.

    REFERENCES[1] W. Aiello, F. Chung, and L. Linyuan. A random graph model for massive graphs.

    pages 171180, 2000.

    [2] C. Bishop. Pattern recognition and machine learning, chapter 9: Mixture Modelsand EM. Springer, 2006.

    [3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res.,3:9931022, March 2003.

    [4] W. Cohen. Enron email dataset, 2009. http://www.cs.cmu.edu/ enron/.

    [5] OpenStreetMap contributors. National land cover database 2006, September2012. http://www.openstreetmap.org/.

    [6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms (2nded.), chapter 24.3: Dijkstras algorithm. MIT Press and McGraw-Hill, 2001.

    [7] P. Erdos and A. Reyni. On random graphs i. Publicationes Mathematicae, 6:290297, 1959.

    [8] A. Gavin et al. Proteome survey reveals modularity of the yeast cell machinery.Nature, 440:631636, 2006.

    [9] E. Airoldi et al. Mixed membership stochastic blockmodels. J. Mach. Learn. Res.,9:19812014, June 2008.

    [10] J. Leskovec et al. Realistic, mathematically tractable graph generation and evo-lution, using kronecker multiplication. In Knowledge Discovery in Databases:PKDD 2005.

    [11] J. Toole et al. Inferring land use from mobile phone activity. In Proc. of the ACMSIGKDD Intl. Wkshp. on Urban Computing.

    [12] K. Carley et al. Biowar: Scalable agent-based model of bioattacks. IEEE Trans-actions on Systems, Man, and Cybernetics, Part A, 36(2):252265, 2006.

    [13] M. Behrisch et al. Sumo - simulation of urban mobility: An overview. In SIMUL2011, 3rd Intl. Conf. on Advances in System Sim., pages 6368, Barcelona, Spain,October 2011.

    [14] M. Girvan et al. Community structure in social and biological networks. Proc.Natl. Acad. Sci. USA, 99:7821, 2002.

    [15] P.E. Hart et al. A formal basis for the heuristic determination of minimum costpaths. volume 4, pages 100 107, July 1968.

    [16] R. Geisberger et al. Contraction hierarchies: faster and simpler hierarchical rout-ing in road networks. In Proc. of the 7th Intl. Conf. on Experimental Algorithms.

    [17] S. Smith et al. Network discovery using wide-area surveillance data. In Informa-tion Fusion, 2011 Proc. of the 14th Intl. Conf. on, pages 1 8, july 2011.

    [18] W. Dong et al. Modeling the co-evolution of behaviors and social relationships us-ing mobile phone data. In Proc. of the 10th Intnl. Conf. on Mobile and UbiquitousMultimedia, pages 134143, 2011.

    [19] Z. Xie et al. Proteome survey reveals modularity of the yeast cell machinery.Bioinformatics, 27:159166, 2011.

    [20] B. Junker and F. Schreiber. Analysis of biological networks, 2008. Wiley-Interscience.

    [21] U.S. Geological Survey. National land cover database 2006 (nlcd2006), June2012. http://www.mrlc.gov/nlcd06 data.php.

    [22] W. Zachary. An information flow model for conflict and fission in small groups.J. of Anthropological Res., 33:452473, 1977.

    1. INTRODUCTION2. BACKGROUND3. ACTIVITY MODEL3.1. Mathematical Description3.2. Activity Model Results

    4. APPLICATION: HUMAN MOBILITY4.1. Motivation4.2. Observational Model4.3. Results

    5. CONCLUSION