Top Banner
18 May/June 2009 Published by the IEEE Computer Society 0272-1716/09/$25.00 © 2009 IEEE Visual-Analytics Evaluation Generating Synthetic Syndromic- Surveillance Data for Evaluating Visual-Analytics Techniques Ross Maciejewski, Ryan Hafen, Stephen Rudolph, George Tebbetts, William S. Cleveland, and David S. Ebert Purdue University Shaun J. Grannis Indiana University V isual analytics (VA) is often referred to as a means for dealing with complex, large information sources that require human judgment to detect the expected and discover the unexpected. 1 However, the lack of readily avail- able data sources compounds the problem of test- ing and evaluating tools that fit these criteria. To effectively evaluate VA techniques, stan- dard test data sets must be cre- ated that require both high- and low-level analysis, during which analysts sift through noise and other confounding factors to find and communicate unex- pected results. An ideal data source for testing and evaluating VA techniques is public-health data. This data contains a low signal-to-noise ratio, making events difficult to detect and prone to false posi- tives. Because VA is said to en- hance knowledge discovery, such data characteristics are useful in analyzing what level of event detectability the VA tools can enhance. Syndromic-surveillance data contains patient locations, derived syndrome classifications for high-level analysis, and text-field collected patient complaints for low-level analysis. High-level analy- sis includes searching for signals in the aggregated time series data from the hospital, whereas low- level analysis consists of looking at patient records to verify the detected outbreaks. Furthermore, or- ganizations must communicate to other agencies any unexpected events that analysts find in the data and potentially put into place quarantines. Unfortunately, privacy concerns, the US Health In- surance Portability and Accountability Act guide- lines, and a lack of electronic record availability in many areas encumbers such data. To circumvent these concerns, we developed a novel system that lets users generate nonaggre- gated synthetic data records from emergency de- partments (EDs), using derived signal components from the Indiana Public Health Emergency Sur- veillance System (Phess). 2 ED data primarily com- prises chief complaints (free text) the ED nurse takes prior to the patient seeing a doctor. Phess classifies these notes into eight categories (respi- ratory, gastrointestinal, hemorrhagic, rash, fever, neurological, botulinic, and other) and uses them as syndromic indicators to detect public-health emergencies before such an event is confirmed by diagnoses or overt activity. Our system synthe- sizes the daily, weekly, and seasonal syndromic trends seen in Indiana EDs and lets users inject outbreaks into the data. This practice creates a data set in which analysts can be asked to solve a problem with a known solution, allowing for stan- dard evaluations among various techniques. Data generated includes synthetic patient location and demographic information (age and gender), along This system generates synthetic syndromic- surveillance data for evaluating visualization and visual-analytics techniques. Modeling data from emergency room departments, the system generates two years of patient data, into which system users can inject spatiotemporal disease outbreak signals. The result is a data set with known seasonal trends and irregular outbreak patterns. Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.
11

Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

May 13, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

18 May/June 2009 Published by the IEEE Computer Society 0272-1716/09/$25.00 © 2009 IEEE

Visual-Analytics Evaluation

Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics TechniquesRoss Maciejewski, Ryan Hafen, Stephen Rudolph, George Tebbetts, William S. Cleveland, and David S. Ebert ■ Purdue University

Shaun J. Grannis ■ Indiana University

Visual analytics (VA) is often referred to as a means for dealing with complex, large information sources that require human

judgment to detect the expected and discover the unexpected.1 However, the lack of readily avail-able data sources compounds the problem of test-

ing and evaluating tools that fit these criteria. To effectively evaluate VA techniques, stan-dard test data sets must be cre-ated that require both high- and low-level analysis, during which analysts sift through noise and other confounding factors to find and communicate unex-pected results.

An ideal data source for testing and evaluating VA techniques is public-health data. This data contains a low signal-to-noise ratio, making events difficult to detect and prone to false posi-tives. Because VA is said to en-hance knowledge discovery, such data characteristics are useful in

analyzing what level of event detectability the VA tools can enhance.

Syndromic-surveillance data contains patient locations, derived syndrome classifications for high-level analysis, and text-field collected patient complaints for low-level analysis. High-level analy-sis includes searching for signals in the aggregated

time series data from the hospital, whereas low-level analysis consists of looking at patient records to verify the detected outbreaks. Furthermore, or-ganizations must communicate to other agencies any unexpected events that analysts find in the data and potentially put into place quarantines. Unfortunately, privacy concerns, the US Health In-surance Portability and Accountability Act guide-lines, and a lack of electronic record availability in many areas encumbers such data.

To circumvent these concerns, we developed a novel system that lets users generate nonaggre-gated synthetic data records from emergency de-partments (EDs), using derived signal components from the Indiana Public Health Emergency Sur-veillance System (Phess).2 ED data primarily com-prises chief complaints (free text) the ED nurse takes prior to the patient seeing a doctor. Phess classifies these notes into eight categories (respi-ratory, gastrointestinal, hemorrhagic, rash, fever, neurological, botulinic, and other) and uses them as syndromic indicators to detect public-health emergencies before such an event is confirmed by diagnoses or overt activity. Our system synthe-sizes the daily, weekly, and seasonal syndromic trends seen in Indiana EDs and lets users inject outbreaks into the data. This practice creates a data set in which analysts can be asked to solve a problem with a known solution, allowing for stan-dard evaluations among various techniques. Data generated includes synthetic patient location and demographic information (age and gender), along

This system generates synthetic syndromic-surveillance data for evaluating visualization and visual-analytics techniques. Modeling data from emergency room departments, the system generates two years of patient data, into which system users can inject spatiotemporal disease outbreak signals. The result is a data set with known seasonal trends and irregular outbreak patterns.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 2: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

IEEE Computer Graphics and Applications 19

with the ED chief complaint and chief-complaint classification.

The following are key features of our synthetic data generation tool:

user-defined geographical placement of emer- ■

gency departments,adjustable population probability density control ■

for patient distribution (that is, the distribution of patient’s home address in terms of latitude and longitude),adjustable demographic probability density con- ■

trols for age and gender, anduser-defined spatiotemporal outbreak controls. ■

We’ve used our system to create synthetic data sets that closely follow real data trends seen in Phess. Sample synthetic data sets are available for down-load at https://engineering.purdue.edu/PURVAC/SyntheticData. These data sets comprise 32 hos-pitals spread across Indiana and various outbreak injection scenarios for robust testing and analysis of tools.

Synthetic-Data CreationED data has great potential for use in developing and evaluating novel algorithms and methods for ana-lyzing multivariate, spatiotemporal patterns and re-lationships. Such data contains home addresses, age and gender, and syndrome. The visit date provides a temporal component coupled with spatial loca-tions of patients, requiring a wide range of analysis

across syndromes, demographics, social networks, and other factors. This data also contains text-field descriptors and derived classifications, allowing for low- and high-level analysis. Furthermore, analysts can couple this data with news feeds, pharmacy sales, and other data streams to create a rich en-vironment for VA evaluation. Unfortunately, when considering the release of sensitive data sets (such as healthcare data) to the public domain, agencies face competing objectives in terms of data quality and breadth versus privacy. Agencies want to pro-vide data to users with as much individual-level data as possible while also guarding data providers’ confidentiality.

Our methods for synthetic-data creation in-clude using seasonal decomposition of time series by loess (locally weighted regression), an analysis of population distribution using multiple kernel density estimation models, and an analysis of the age and gender of the populations and their corre-lation to chief complaints. (See the “Related Work in Synthetic-Data Generation” for prior work in this field.)

Time Series SimulationThe first step in our synthetic-data creation is choosing a model for the temporal component of the data. The goal here is to generate a time series of the number of patients a given ED sees daily. The time series generated should closely reflect the historical Phess data while maintaining privacy about hospital business practices. The temporal

Common methods for releasing emergency department data include aggregating, suppressing, and swapping

data, as well as adding random noise to the data.1 Analysts could apply any of these methods to generate syndromic-surveillance data, for example, releasing the number of patients seen per county rather than patient addresses or longitudinal pairs or reporting patient ages by category (under 18, 19–25, and so on) rather than individual ages. Unfortunately, these methods compromise the data by distorting the relationships between variables and reduc-ing the analysis specificity. Furthermore, data aggrega-tion can still violate privacy owing to the intersection of overlapped data.

To overcome such issues, Donald Rubin proposed that, instead of data reduction as a means of privacy preserva-tion, agencies could release multiply imputed, synthetic data sets.2 Jerome Reiter further analyzed the benefits and limitations of this fully synthetic data approach.3 His stud-ies found that fully synthetic data is a viable solution to these issues, as long as analysts can estimate the models

for data imputation reasonably well from the synthetic data. The work of Thomas Lotze and his colleagues has characterized several methods for simulating syndromic-surveillance data.4 Their work is comparable to ours, but the methodology and user interface we present use a dif-ferent approach to syndromic surveillance.

References 1. L. Willenborg and T. De Waal, Elements of Statistical Disclosure

Control, Springer, 2001.

2. D.B. Rubin, “Discussion: Statistical Disclosure Limitation,” J.

Official Statistics, vol. 9, no. 2, 1993, pp. 461–468.

3. J.P. Reiter, “Releasing Multiply Imputed, Synthetic Public

Use Microdata: An Illustration and Empirical Study,” J. Royal

Statistical Soc.: Series A, vol. 168, part 1, 2005, pp. 185–205.

4. T. Lotze, G. Shmueli, and I. Yahav, “Simulating and Evaluating

Biosurveillance Datasets,” to be published in Biosurveillance:

A Health Protection Priority, T. Kass-Hout and X. Zhang, eds.,

Taylor & Francis, 2009.

Related Work in Synthetic-Data Generation

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 3: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

20 May/June 2009

Visual-Analytics Evaluation

pattern of ED visits can be difficult to character-ize because of variations in yearly periodicities. Although the time series clearly exhibits yearly periodicity, the onset, duration, and magnitude of peaks and troughs varies from year to year.

Previous work has shown that if the data to be simulated has a long historical basis from which to draw, seasonal models with Arima (Autoregressive Integrated Moving Average) can cope with yearly quasiperiodicity.3 However, many syndromic- surveillance systems, including Phess, contain only a few years (or even months) of electronic data, making such an approach difficult. Tom Burr and his colleagues also addressed the quasiperiodicity problem by using a Bayesian hierarchical model with a scalable Gaussian function to model yearly peaks, allowing the peak’s location and duration to vary from year to year.4 This approach constrains the shape of peaks to Gaussian. Our approach fo-cuses on applying a nonparametric method, sea-sonal decomposition of time series by loess (STL).5 This method allows for flexible modeling of the differing onsets and shapes of seasonal peaks and lets us ac-count for other components of variation also.

A time series can be viewed as the sum of mul-tiple components, and STL is a method that sepa-rates a time series into these components using the local smoothing method loess.6 We decompose our daily patient count data into a day-of-week

component, an intraseasonal component that models seasonal fluctuations, and an interannual component that models long-term effects, such as hospital growth:

√Yit = Tit + Sit + Dit + rit,

where for the ith hospital on the tth day, Yit is the original series, Tit is the interannual component, Sit is the intraseasonal component, Dit is the day-of-week effect, and rit is the remainder.

We extract components on the square-root scale of the original series to remove the dependence of a signal’s variance on its mean. We first extract the day-of-week seasonal component, Dit, directly employing the method Robert Cleveland and his colleagues described.5 We use degree-0 smoothing for the day-of-week component with a seasonal window of 39 days. After removing the day-of-week component from the data, we use loess smoothing to extract the interannual component, Tit, using a window of 1,000 days. Finally, we apply loess smoothing to the data with the day-of-week and interannual components removed, thereby obtaining the intraseasonal component, Sit, using a window of 125 days.

After removing the day-of-week, interannual, and intraseasonal components from the time se-ries, the remainder is adequately modeled as in-dependent, identically distributed Gaussian white noise. Figure 1 shows a decomposition of square-root daily counts for one hospital. In Figure 1, the remainder, intraseasonal, interannual, and day-of-week terms sum to the total square-root daily counts. Furthermore, the day-of-week and intrase-asonal components are centered around a mean of zero, so that sharing them with others wouldn’t reveal information about the hospital they came from. Because the square-root transformation re-moved the dependence of the mean and variance, these two components can be added to any speci-fied interannual trend component to obtain a vi-able syndromic time series. This serves as the basis for our temporal simulation method.

The goal is now to synthetically recreate the patient count series without disclosing informa-tion about the ED. As the Phess system comprises a large range of EDs, we performed the STL de-composition on 32 hospitals, allowing us to cap-ture slight variations across the state. For each decomposition, we store the fitted daily intrasea-sonal and day-of-week component values. The in-terannual term contains information about the ED growth and average number of patients. As such, we don’t use the interannual term from our

121110

1213

1110

10

–1

10

–1

10

–1

Time (year)

Squa

re-r

oot

daily

cou

ntSquare-root counts

Interannual

Intraseasonal

Day of week

2006.5 2007.0 2007.5

Remainder

Figure 1. Decomposition of a total daily count time series. Removing the day-of-week, interannual, and intraseasonal components from this time series from one Indiana emergency department leaves independent, identically distributed Gaussian white noise.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 4: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

IEEE Computer Graphics and Applications 21

historical data, replacing it with a user-specified mean for the average number of patients a given ED sees per day. The simulation’s randomness comes from the remainder term, which for each hospital was Gaussian and had a variance close to 0.28. The remainder term has variation, but it’s small. The mean variance is 0.28, and the first and third quartiles are 0.270 and 0.308. Tight distribu-tion is because of the square-root transformation, which we found did an excellent job of stabilizing the variance.

To create a synthetic hospital, the user inputs the desired mean total daily cases the hospital sees, M, from which the fixed interannual compo-nent T is calculated as

T MS D

n

ij ijj

n

= −+( )

−=∑2

1 0 28. .

With this, we obtain the following equation for generating the synthetic time series data:

Zit =(T + Sit + Dit + rit)2,

where Z is the synthetically generated series and rit is randomly generated white noise with mean 0 and

variance 0.28. We randomly choose day-of-week (Dit) and intraseasonal (Sit) components from one of the 32 historical decomposed ED signals.

This approach supplies us with only the total number of patients the ED sees each day. The next step is to classify these patients into their appropri-ate syndromic category (gastrointestinal, constitu-tional, respiratory, rash, neurological, hemorrhagic, botulinic, and other). To this end, we apply a mul-tinomial distribution to categorize patients. The parameters of the multinomial distribution are based on a daily average of cases seen in a given category across the set of 32 Phess hospitals. This preserves the differing seasonalities that occur from one classification to another in syndromic-surveillance data.

An alternative to this approach of first gener-ating the total counts and then classifying them would be to simulate the data from each classifica-tion and then add them together to get the total. We didn’t employ this method because the STL decomposition doesn’t deal well with low counts, and some classifications, such as botulinic, have many 0-count days.

Figure 2 compares a real syndromic time se-ries to our synthetically generated one, showing

Time (year)

Squa

re-r

oot

daily

cou

ntGastrointestinal—actual

Gastrointestinal—simulated

Respiratory—actual

Respiratory—simulated

Total—actual

2006.0 2006.5 2007.0 2007.5 2008.0

Total—simulated

65432

65432

65432

65432

1314

121110

9

1314

121110

9

Figure 2. Real versus simulated syndromic time series. The daily counts, broken down by syndrome classification, show a slight difference.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 5: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

22 May/June 2009

Visual-Analytics Evaluation

the time series of the total daily counts and daily counts of respiratory and gastrointestinal classifi-cations. The simulated data set uses the fitted in-traseasonal and day-of-week effects from the real data and a constant mean, which can be set to the overall mean of the real data’s total counts. Because we obtained the simulated data’s respiratory and gastrointestinal components using multinomial distribution parameters, which we obtained by av-eraging over all of the hospitals, we can see a slight difference between the actual and simulated sig-nals. The yearly peaks correlate well, indicating that we’ve preserved the original data’s key features.

Spatial-Distribution EstimationThe second step in our synthetic-data creation is to assign patients an appropriate geospatial loca-tion. Previous work in syndromic-surveillance data has focused specifically on generating time series data for analysis.3,4,7 Our system generates both temporal and spatial components for the data. To simulate the patient’s location, we first analyze the patient distribution from the 32 Phess EDs, employing a kernel density estimation.8 The following equation defines the multivariate kernel density estimation:

f̂N

KX

h di

nix

hx

h( )= −

=∑1 1

1.

We chose to employ the Epanechnikov kernel:

K u u u( )= −( ) ≤( )34

1 121 .

In these equations, h represents the multidimen-sional smoothing parameter, N is the total number of samples, d is the data dimensionality, and the func-tion 1 1u ≤( ) evaluates to 1 if the inequality is true and zero for all other cases. This estimate scales the parameter of estimation by letting the kernel width vary on the basis of the distance from Xi to the kth nearest neighbor in the set comprising N - 1 points:

ˆ, ,

fN d

KX

dh

i ki

Ni

i kx

x( )= −

=∑1 1

1 .

Here, the window width of the kernel placed on the point Xi is proportional to di,k (where di,k

is the distance from the ith sample to the kth nearest neighbor), so that data points in regions with sparse data will have flatter kernels. Such a method lets us have various kernels that can bet-ter fit dense populations (urban) and sparse popu-lations (rural).

Figure 3 shows the density distribution for four Indiana hospitals. Such distributions would be ex-tremely difficult to model parametrically because the density distributions follow various population groupings across the state, accounting for nearby towns and other seemingly random population cen-ters. As such, to model patient locations, we created an interface that lets users define custom probability models centered around their synthetic hospital.

When a user creates a hospital, the user can de-fine the density window’s width and height. The system then lays a 100 × 100 grid onto the width and height and assigns probability density to each square of the grid. This in effect will aggregate pa-tients across regions of the state proportional to the height and width. If the user employs the de-fault width and height of 100 miles by 100 miles, the system will aggregate the patient location to a 1-mile block. It then uniformly distributes sub-jects in their associated grid cell. We chose these parameters on the basis of observations from all Phess EDs that patients were most likely to visit EDs within 50 miles of their primary residence. To manipulate the probability density on the grid, the user can create polygons and increase or decrease the probability density in the region the polygon covers. The system normalizes the final probability density to integrate to 1 and uses it to randomly generate patient locations.

Our system also lets users import data on a 100 × 100 grid representing the density they wish to use. The user specifies the grid’s width and height (in miles), and the system normalizes the values in the grid to integrate to 1. We also supply users with sample density distributions as well as a default Gaussian distribution. However, the sam-ple density distributions are based on data from Indiana and, as such, are influenced by surround-ing geography. The sample distributions are based on our kernel density estimation of the hospital’s population distribution. But, in doing a kernel

Figure 3. Kernel density estimations. The distributions of these patient locations for four Indiana Public Health Emergency Surveillance System (Phess) hospitals would be difficult to model parametrically.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 6: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

IEEE Computer Graphics and Applications 23

density estimation, our system doesn’t take into consideration boundary conditions of geographical features, such as bodies of water. Thus, geographi-cal features’ heavy influence makes it important to let users create and manipulate their own dis-tributions to obtain better simulations. Figure 4 illustrates the use of our system in generating a probability distribution using an imported kernel density estimation.

Age and Gender Distribution EstimationThe third step in our synthesis is to assign the patient’s age and gender. In a method similar to

that of assigning the patient’s location, we use user-defined probability functions for generating a patient’s age and gender. Figure 5a represents the patient age distribution averaged across all 32 Phess EDs used in our data synthesis. This age distribution is relatively constant across all Phess EDs, minus the specialty hospitals such as chil-dren’s hospitals and disease-specific hospitals. Us-ers can use this as the default probability model for patient age distribution, or they can modify the distribution through an interactive widget.

The plot’s probability is relative, meaning that the area under the curve will always integrate to one.

(a) (b) (c) (d)

Figure 4. Probability distributions. Our system generated the following distributions: (a) the default density distribution when creating a new hospital, (b) an imported patient density distribution based on a kernel density estimation of patient visits, (c) removing the probability that a patient will come from Lake Michigan, and (d) one day’s worth of simulated patients overlaid on the population density distribution.

(a) Age

Prob

abili

tyPr

obab

ility

Prob

abili

tyPr

obab

ility

Age

Age Age

(c)

(b) (d)

Figure 5. Age distribution. The red lines indicate a user-defined probability function, and the blue lines indicate the synthesized data’s distribution. (a) The default probability function is based on data from the Indiana Phess system. (b) Users can customize the distribution by adding control points or manipulating the curve. The resulting data curves in (c) and (d) show that the resulting simulated distributions match the input model distributions.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 7: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

24 May/June 2009

Visual-Analytics Evaluation

Figure 5 shows both a default and user-defined prob-ability function and the resulting output data curve. Figure 5a is the default control mechanism used in our synthetic-data-creation system. As Figure 5c shows, users can add control points to the plot and manipulate the curve by dragging the points around the screen. For defining a patient’s gender, users sim-ply input the desired percentage breakdown. Default values are set at 50 percent male and 50 percent fe-male and are adjustable. Figures 5b and 5d show the resulting data curves for age distribution.

Correlating Chief ComplaintsThe final step in our synthesis is to map the pa-tient’s syndrome to an actual chief-complaint text. Given the patient’s age, gender, and chief-complaint classification, we randomly assign a chief complaint that matches the chief-complaint classi-fication generated in the time series simulation. The system also correlates this chief-complaint classification to gender and age to ensure that ap-propriate complaints are generated. For example, assigning a chief complaint of “vaginal bleeding” to an 8-year-old male classified as hemorrhagic wouldn’t make sense.

The chief complaints we use here represent a set of the most common chief complaints for each category correlated with age and gender across the entire Phess. We use the most common complaints to simulate both real-world noise in the data as well as protect any potentially private informa-tion found in unique chief complaints. Often the chief complaints contain typographical errors, variations on the same complaint (“fever, cough” versus “cough, fever”) and abbreviations (SOB for shortness of breath). We maintain these errors to better simulate real-world data.

Synthetic Outbreak InjectionOnce we simulated the baseline data, we developed the disease injection models so that users can create known outbreaks in which to test their detection tools and algorithms. Previous work in synthetic outbreak generation falls into two main categories: creating a multivariate mathematical model to pro-duce the signal and defining a series of parameters to enable the generation of a controlled feature set. In terms of creating complex mathematical mod-els based on population distributions, highway travel, and spread vectors, the IBM Eclipse STEM (Spatiotemporal Epidemiological Modeler) project is doing significant work.9 The project focuses on helping scientists and public-health officials create and use models of emerging infectious diseases. It uses built-in geographic information system (GIS)

data for almost every country in the world, includ-ing travel routes for modeling disease spread. The difficulty of this type of approach is that these de-tailed models often require a great deal of hand-crafting and fine-tuning.

Instead, we adopted the approach of parameter-izing clusters using simpler mathematical models, as Chris Cassa has done.10 In Cassa’s work, users can create clusters for their supplied baseline data by selecting the cluster’s center relative to the hos-pital, temporal duration, magnitude, and curve type. The cluster types are based on three math-ematical distribution models: random, linear, and exponential. Our extensions to this work include generating baseline patient data and arbitrarily shaped clusters (as opposed to only circular) and using seasonal data shifts to approximate appro-priate cluster size. Furthermore, we let the user create epidemiological distribution curves for the outbreak. This epidemiological distribution cluster lets users set the probability that the disease out-break will affect a certain age and gender more, by providing them with a linked probability distribu-tion control for these parameters.

To create a cluster using our system, the user draws a polygon on the map. The user then selects the date that the cluster begins, the duration of the cluster in days, and the type of cluster (respiratory, gastrointestinal, and so on) corresponding to one of the eight classification categories in our system. To determine the appropriate number of patients in the cluster, our system calculates the noise profile of the model over the particular space-time period in which the outbreak is occurring. By this, we mean that if the user selects an outbreak with a duration of five days, the system calculates the mean and standard deviation of the baseline data on the basis of the related syndrome counts over these five days. The system then uses the standard deviation over the time period to determine the total number of points that will occur in this cluster. This is scalable through the user’s control of cluster size, where the minimum cluster size is .5 to 5 times the standard deviation. This allows the user to create ranges of clusters with variable degrees of difficulty to find. By using the standard deviation as a means of de-termining the cluster size, we can account for vari-ability among seasonal trends and create detectable clusters in any season without the user needing a priori knowledge of data trends.

Once the cluster size is determined, the number of cases the system generates for each day of the cluster is based on the user-defined distribution models. We provide users with a baseline epidemiological distri-bution curve based on a log-normal plot, as Figure 6a

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 8: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

IEEE Computer Graphics and Applications 25

shows. Figure 6b shows a user-defined epidemiologi-cal distribution, and Figures 6c and 6d show the re-sulting output patient counts. We chose log-normal as the default distribution because of the work of Co-lin Goodall and his colleagues, which showed that many natural disease outbreaks follow this type of distribution (as opposed to Cassa’s exponential and linear distributions).11

The system then assigns the created patients an age and a gender on the basis of user-specified probability controls for the model. In this way, we move toward the more-complicated mathematical approaches the Eclipse project is developing while still maintaining a degree of simplicity. Through the user-generated probability functions, the system can now manufacture the disease clusters to affect only certain age ranges and certain genders, allowing for a larger model specificity. Finally, the system uni-formly distributes patient locations inside the poly-gon. Figure 7 (next page) provides a time series view of an injected cluster of varying magnitudes.

Scenario CreationUsing the methods we’ve described, a user can now generate any number of EDs for locations across the US, given some inherent knowledge of the pa-tient distributions, or using the defaults provided.

Once the system generates the baseline data, a user can then consider what stories and scenarios to create and couple the synthesized data from our system with other systems, creating a more robust stream of data for evaluation.

Previous work in creating synthetic data for testing analytic tools includes the Threat Stream Generator, which focuses on creating a particular scenario and expressing this as data in the forms of news articles, security agency reports, geographi-cally referenced locations, population movement during catastrophic events, and a variety of other data, all linked together to tell a complete event story.12 In terms of disease outbreaks, the Eclipse STEM project focuses on creating data that will simulate large-scale epidemics across populations.9

These systems let users create complex scenarios that cover a broader set of the population or tell a complex story and create various data sources to corroborate and unify a given scenario. In contrast, our synthetic-data-generation system focuses on a distinct segment of the population: individuals who visit EDs. This is a well-defined, often-studied population for which data is typi-cally not publicly available. By creating a system in which users can generate baseline syndromic-surveillance data, we’ve enabled other systems to

(a)

(b)

(c)

(d)

20 4 6Day

Patie

nt c

ount

Patie

nt c

ount

Day8 10 12 20 4 6 8 10 12

Figure 6. Outbreak curves. Comparing (a) an outbreak epidemiology curve based on a log-normal distribution and (b) its resulting plot of the actual number of patients to (c) a user-created outbreak curve and (d) its resulting plot of patients verifies that the distribution input controls are valid.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 9: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

26 May/June 2009

Visual-Analytics Evaluation

add stories to create more-robust data sets for VA application training. Another comparable system for generating disease outbreaks is Project Mimic, which provides methods for describing and simu-lating biosurveillance data, creating realistic mim-ics of their own data sets.7

In Figure 8, the user created a disease outbreak cluster representing the release of waste water into Lake Michigan. In Figure 8a, the user has drawn a cluster in which people living close to the lake are appearing with gastrointestinal illnesses a few days after the Fourth of July, a prime swimming holiday; in Figure 8b, the system has generated the corresponding epidemic curve. Furthermore, the user adjusted the age distribution of this cluster so that the syndrome is affecting only the younger portion of the population, as Figure 8c shows.

In another sample scenario, a chemical agent is being developed. People living near lab are getting sick, producing clusters of disease outbreaks. A user could correlate this data with a series of news sto-ries and reports from the Threat Stream Generator and produce another segment of the population to analyze. Furthermore, in creating complex outbreaks using the Eclipse STEM software, a user could add these outbreaks to the baseline data our system gen-erated and work out population distributions to see how such outbreaks would affect ED visits.

Only our system lets users create viable spatiotem-poral data for analysis and testing of any spatiotem-poral techniques. A user can generate baseline data and couple it with known outbreaks, allowing a variety of testing from algorithm validation to ex-pert evaluation of system tools. Multiple communi-

ties can benefit from the release of a standard set of test data to evaluate various spatiotemporal algo-rithms. Moreover, the generated data set isn’t trivial. The 32-hospital data scenario in Figure 8 comprises more than 1 million patient records, averaging ap-proximately 2,000 records per day for the two-year span. Overall, researchers can use data our system generates for a wide variety of testing and evaluation applications, from large data visualizations to infor-mation visualization topics to spatiotemporal cluster detection algorithms to VA scenarios.

It’s also important to characterize ways in which researchers can use this data to evaluate VA tools. In syndromic surveillance, algorithms exist to detect spatial, temporal, and spatiotemporal out-breaks with a given level of specificity. The first area that VA can improve is the time it takes for a decision maker to classify an event as a false posi-tive or as a true outbreak. An applicable scenario is one in which several false-positive outbreaks occur along with a true outbreak. These cases would be identifiable only by looking further into the pa-tient case reports. A user with a suite of VA tools could more readily identify the true alert.

The second area in which to evaluate VA is ana-lyzing outbreaks where the signal-to-noise ratio leaves the outbreak undetected by conventional syndromic-surveillance algorithms. Can VA tools be created that would provide a user with enough insight into the data to find such outbreaks?

Finally, such data can be useful in the strictly analytical sense. Researchers can test and validate new syndromic-detection algorithms against sim-ple outbreak baselines or large, complex-shaped

29 June2006

14

12

10

8

6

4

2

02 July2006

5 July2006

8 July2006

11 July2006

14 July2006

17 July2006

Date

Patie

nt c

ount

20 July2006

23 July2006

26 July2006

29 July2006

1 August2006

4 August2006

BaseBase + .5σBase + .2σ

Figure 7. A time series view of an emergency department’s respiratory-syndrome counts. The curves represent the baseline data, baseline data injected with a cluster of magnitude .5s, and baseline data injected with a cluster of magnitude 2s.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 10: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

IEEE Computer Graphics and Applications 27

outbreaks to determine which algorithms perform best in various analytic situations.

Previous work in creating VA tools to analyze such data can be found in the work of Ross Ma-ciejewski and his colleagues.13 However, this work focuses solely on methods for analyzing multi-variate syndromic-surveillance data. Their system presents a suite of tools for analyzing data similar to what we’ve presented. Further work is needed to test their tools with synthetic data (such as the data presented in this work) to begin to evaluate the usefulness of such tools.

Although it’s likely that the seasonal trends seen in EDs within the Phess system are likely

to be similar across the Midwest and perhaps the entire US, it’s not likely that these trends would translate to other climatological regions owing to variations among healthcare systems.

The next phase of this work is to take the syn-thetic data such a system generates and incor-porate it into current syndromic-surveillance systems to begin an evaluation of these tools. Sce-narios created under our system can range from small, low-signal-strength outbreaks, to high, eas-ily detectable clusters. Researchers can use such scenarios to test where algorithms fail in detecting outbreaks and where VA can begin helping users notice clusters that are difficult to detect.

Future work includes developing more-realistic outbreak injection methods and working on a bet-

ter generation of the baseline population distribu-tion to look at local geographic features. We also hope to incorporate work on space-time dependen-cies to model repeat visits among patients, because our current methods fail to handle this particular instance. Furthermore, our output doesn’t look at spatiotemporal dependencies between the syndromes, age, and gender. Our analysis of the data seems to in-dicate that seasonal trends don’t exhibit a spatiotem-poral component; however, they do have dependencies on the patient’s age and gender. Future work must be done to better capture these correlations.

AcknowledgmentsWe thank the Indiana State Department of Health for providing the data. The US Department of Homeland Security Regional Visualization and Analytics Center (RVAC) Center of Excellence and the US National Science Foundation (grants 0081581, 0121288, and 0328984) funded this research.

References 1. J.J. Thomas and K.A. Cook, eds., Illuminating the

Path: The R&D Agenda for Visual Analytics, IEEE Press, 2005.

2. S.J. Grannis et al., “The Indiana Public Health Emergency Surveillance System: Ongoing Progress, Early Findings, and Future Directions,” Proc. Am. Medical Informatics Assoc. Ann. Symp., Am. Medical Informatics Assoc., 2006, pp. 304–308.

(a) (c)

(b)

Figure 8. Our interface for generating syndromic outbreaks. (a) We injected a spatial cluster drawn near the Lake Michigan shore, representing gastrointestinal illnesses directly following the Fourth of July holiday. (b) The system automatically creates the corresponding epidemic curve. (c) We then modify the correlated age curve to simulate an outbreak that is affecting primarily children and young adults.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.

Page 11: Generating Synthetic Syndromic-Surveillance Data for Evaluating Visual-Analytics Techniques

28 May/June 2009

Visual-Analytics Evaluation

3. B. Reis and K. Mandl, “Time Series Modeling for Syndromic Surveillance,” BMC Medical Informatics and Decision Making, vol. 3, 2003, article 2; www.biomedcentral.com/1472-6947/3/2.

4. T. Burr et al., “Accounting for Seasonal Patterns in Syndromic Surveillance Data for Outbreak Detection,” BMC Medical Informatics and Decision Making, vol. 6, 2006, article 40; www.biomedcentral.com/1472-6947/6/40.

5. R.B. Cleveland et al., “STL: A Seasonal-Trend Decomposition Procedure Based on Loess,” J. Official Statistics, vol. 6, no. 1, 1990, pp. 3–73.

6. W.S. Cleveland and S.J. Devlin, “Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting,” J. Am. Statistical Assoc., vol. 83, no. 403, 1988, pp. 596–610.

7. T. Lotze, G. Shmueli, and I. Yahav, Simulating Multivariate Syndromic Time Series and Outbreak Signatures, research paper RHS-06-054, Robert H. Smith School of Business, Univ. of Maryland, May 2007; http://ssrn.com/abstract=990020.

8. B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall/CRC, 1986.

9. D.A. Ford, J.H. Kaufman, and I. Eiron, “An Extensible Spatial and Temporal Epidemiological Modelling System,” Int’l J. Health Geographics, vol. 5, no. 1, 2006, p. 4.

10. C.A. Cassa, “Spatial Outbreak Detection Analysis Tool: A System to Create Sets of Semi-synthetic Geo-spatial Clusters,” master’s thesis, Dept. of Electrical Eng. and Computer Science, Massachusetts Inst. Technology, 2004.

11. C.R. Goodall et al., “A System for Simulation: Introducing Outbreaks into Time Series Data,” Advances in Disease Surveillance, vol. 2, 2007, p. 199; www.isdsjournal.org/article/viewArticle/933.

12. M.A. Whiting, J. Haack, and C. Varley, “Creating Realistic, Scenario-Based Synthetic Data for Test and Evaluation of Information Analytics Software,” Proc. Conf. Beyond Time and Errors (BELIV 08), ACM Press, 2008, pp. 1–9.

13. R. Maciejewski et al., “Understanding Syndromic Hotspots: A Visual Analytics Approach,” Proc. IEEE Symp. Visual Analytics Science and Technology (VAST 08), IEEE CS Press, 2008, pp. 35–42.

Ross Maciejewski is a PhD student in electrical and computer engineering at Purdue University. His research interests include nonphotorealistic rendering, volume rendering, and visual analytics. Maciejewski has an MS in electrical and computer engineering from Purdue Uni-versity. Contact him at [email protected].

Ryan Hafen is a PhD student in statistics at Pur-due University. His research interests include explor-

atory data analysis and visualization, massive data, computational statistics, time series, modeling, and nonparametric statistics. Hafen has a master’s in sta-tistics from the University of Utah. Contact him at [email protected].

Stephen Rudolph is a master’s student in electri-cal computer engineering at Purdue University. His research interests include casual information visu-alization and visual analytics. Rudolph has a BS in computer systems engineering from Arizona State University. Contact him at [email protected].

George Tebbetts is an undergraduate student in com-puter science and mathematics at Purdue University. His research interests include visual analytics and computer graphics. Contact him at [email protected].

William S. Cleveland is the Shanti S. Gupta Dis-tinguished Professor of Statistics and courtesy pro-fessor of computer science at Purdue University. His research interests include statistics, machine learning, and data visualization. Cleveland has a PhD in sta-tistics from Yale University. He’s the author of The Elements of Graphing Data (Hobart Press, 1994) and Visualizing Data (Hobart Press, 1993). Contact him at [email protected].

Shaun J. Grannis is an assistant professor of family medicine at Indiana University and a medical infor-matics research scientist at the Regenstrief Institute. His research interests include developing, implement-ing, and studying technology to overcome the chal-lenges of integrating data from distributed systems for use in healthcare delivery and research. Grannis has an MD from Michigan State University. Contact him at [email protected].

David S. Ebert is a professor in the School of Electrical and Computer Engineering at Purdue University, a Uni-versity Faculty Scholar, director of the Purdue University Rendering and Perceptualization Lab, and director of the Purdue University Regional Visualization and Analytics Center. His research interests include novel visualization techniques, visual analytics, volume rendering, informa-tion visualization, perceptually based visualization, il-lustrative visualization, and procedural abstraction of complex, massive data. Ebert has a PhD in computer science from Ohio State University and is a fellow of the IEEE and member of the IEEE Computer Society’s Publi-cations Board. Contact him at [email protected].

For further information on this or any other computing

topic, please visit our Digital Library at www.computer.

org/csdl.

Authorized licensed use limited to: Purdue University. Downloaded on August 1, 2009 at 16:18 from IEEE Xplore. Restrictions apply.