Systematic Approach for Validating Traffic Simulation Models

Modeling processes and model testing processes are discussed as partsof the model life cycle, and the tasks of these processes and their rela-tions are highlighted. Of particular interest is the model validationprocess, which ensures that the model closely simulates what the realsystem does. A collection of validation techniques is presented to facili-tate a systematic check of model performance from various perspec-tives. Under the qualitative category, a few graphical techniques arepresented to help a visual examination of the differences between thesimulation and the observation. Under the quantitative category, severalstatistical measures are discussed to quantify the goodness of fit; toachieve a higher level of confidence about model performance, a simul-taneous statistical inference technique is proposed that tests both modelaccuracy and precision. As an illustrative example, these validationtechniques are comprehensively applied to test an enhanced macroscopicsimulation model, KWaves, in a systematic manner.

Simulation has become an increasingly important means of solvingreal-world problems. To apply a simulation model successfully,the “correctness” or “credibility” of the model is crucial, and sometesting processes must be performed to ensure the quality of themodel. In the traffic simulation community, the number of simulationmodels and software is growing fast, each of which possesses somemerits. Generally, validation of these models is largely conductedthrough the use of some customized procedures. There has been aconsensus in the community that a set of shared rules and proceduresneeds to be established to guide model calibration and validation.A number of efforts have been identified in this direction (1–5 ).Benekohal (1) suggested a procedure for validating microscopicmodels and applied this procedure to CARSIM. The basic idea ofthe procedure is a two-stage, two-level testing procedure in whichvalidations at the conceptual stage and the computerized/operationalstage are carried out at two levels: microscopic and macroscopic.Similar to Benekohal’s approach, Rao and Owen (2) also proposed atwo-stage framework consisting of conceptual validation and opera-tional validation. The operational validation involves two levels of sta-tistical tests: a two-sample t-test and a two-dimensional two-sampleKolmogorov-Smirnov test. Rakha et al. (3) and Hellinga (4) gave arigorous definition of the terminology of model testing and proposeda framework for the systematic verification, validation, and calibra-

tion of traffic simulation models. Particularly, they delineated theresponsibilities that should be assumed by parties for model devel-opment and specified detailed steps that signify the beginning of thenext testing process. Hourdakis et al. (5) aimed at microscopic mod-els and adopted a goodness-of-fit measure called Theil’s inequalitycoefficient, on which a three-stage calibration process was developed.

Considering that model testing involves many stages, such asverification, calibration, validation, and accreditation, it is difficultto cover all the details of these stages in a single paper while stillproviding enough details on how to conduct these tests. This paperfocuses on techniques of performing model validation—a criticaltesting process that compares the model output with real-worldsystem behavior.

The paper is organized in the following sequence. A general dis-cussion of a model life cycle is first presented, in which modelingprocesses and testing processes are defined and their relations areillustrated. Next, issues arising from validating traffic simulationmodels are identified and validation techniques are summarized.In particular, a simultaneous inference technique is proposed toconduct quantitative evaluation of model performance. Then, an illus-trative example is provided to demonstrate the application of thesevalidation techniques. Finally, the paper is summarized and somepertinent conclusions are identified.

MODEL LIFE CYCLE

The fundamental problem that concerns everyone, including themodel developers, the model users, and the decision makers whoindirectly use the model, is the “correctness” or “credibility” of themodel that is addressed throughout the model life cycle. There aregenerally four stages in a model life cycle. The first stage concernsproblem definition, which involves observing the real system andthe interactions among its various components, collecting data onits behavior, and determining the inputs (variables that are readilyavailable and come at relatively cheap cost) and outputs (variablesthat are of primary interest, but usually hard to obtain) of the system.The second stage concerns model conceptualization, which involvesmaking several pertinent assumptions on the components and thestructure of the system, establishing the association between modelinputs and outputs, and developing some theory to explain the systembehavior. The association and the theory or both can be deterministicor stochastic, with the former always giving the same responses tothe same input while the responses in the latter case vary. The thirdstage concerns model implementation. This is the computerized rep-resentation of the conceptual model. It involves choosing an appro-priate developing platform, developing various algorithms, selectingproper programming languages, and coding the algorithms to real-ize the conceptual ideas. The fourth stage is model application. The

Systematic Approach for Validating Traffic Simulation Models

Daiheng Ni, John D. Leonard II, Angshuman Guin, and Billy M. Williams

Transportation Research Record: Journal of the Transportation Research Board,No. 1876, TRB, National Research Council, Washington, D.C., 2004, pp. 20–31.

D. Ni and A. Guin, School of Civil and Environmental Engineering, Georgia Institute of Technology, 790 Atlantic Drive, Atlanta, GA 30332-0355. J. D. Leonard II, State Road and Tollway Authority, 47 Trinity Avenue SW, Suite 522,Atlanta, GA 30334. B. M. Williams, Department of Civil Engineering, CampusBox 7908, Room 424B Mann Hall, North Carolina State University, Raleigh, NC 27695-7908.

20

models are usually used to help solve real-world problems or serve asa tool to assist decision making. However, rarely do models solveall the problems in a single round. Bugs and limitations that have goneunnoticed during previous testing processes are discovered duringthe application process and necessitate one or several more roundsor even a total overhaul.

Accompanying the modeling processes are a series of testingprocesses that ensure the accuracy and correctness of the model.Conceptual validation is the comparison of the real system to theconceptual model. This involves determining whether the assump-tions, association, and theory underlying the conceptual model arecorrect and whether the abstraction and representation of the real-world problems are “reasonable” for the intended purpose of themodel. Model verification is concerned with building the model cor-rectly. It assures that the conceptual model is reflected accurately inthe computerized representation. Model calibration is an iterativeprocess of comparing the model to actual system behavior andusing the discrepancies between the two, and the insights gained, toimprove the model. More specifically and to make a distinctionwith validation, calibration is more concerned with estimating andimproving the values of various constants and parameters in themodel structure, and this process is repeated until model accuracy isjudged to be acceptable. Model validation is concerned with build-ing the correct model. The goal of validation process is to producea model that represents true system behavior closely enough forthe model to be used as a substitute for the actual system for the pur-pose of experimenting with the system. Model accreditation deter-mines if a model satisfies a specified model accreditation criterionaccording to a specified process, and if it is reliable enough to beused by managers and other decision makers.

Ni, Leonard, Guin, and Williams 21

The above modeling and testing processes as well as their rela-tions are summarized in Figure 1. The dashed line means possiblemodification on previous stages.

Considering that issues on model calibration have been coveredcomprehensively in Hourdakis et al. (5), this paper focuses on modelvalidation—a testing process that involves various techniques tocompare the model output with the real system behavior. Morespecifically, the following discussion is made with validating traf-fic simulation models in mind, though this discussion can be adaptedfor other models as well.

VALIDATION OF TRAFFIC SIMULATION MODELS

Issues of Validating Traffic Simulation Models

Traffic simulation model validation is not about selecting the bestamong several alternatives or about testing the goodness of fit betweentwo random samples. It is about comparing two processes, the sim-ulated and the observed, and checking how one approximates theother. To apply various techniques to make the comparison, one hasto first define measures of performance (MOPs)—target variables onwhich the model assessment is based. In the traffic simulation com-munity, MOPs that are often employed include flow, density, speed,queue length, waiting time, delay, throughput, and so forth.

After the target variables have been determined, one needs to useproper techniques to show how the simulated values compare withthe observed values. This can be done qualitatively by presenting thetwo in the same figure so that discrepancies are readily discernable, andalso quantitatively by employing mathematical proof of correctness.

Conceptual validation

Model verification

Model calibration

Model validation

Model accreditation

Model implementation

Mod

el

appl

icat

ion

Problem definition

Model

conceptualization

ModelLife

Cycle

FIGURE 1 Illustration of model life cycle.

In quantitative traffic model validation, there are at least threesalient problems. First, in some cases the timing, or ordering, of thesamples is very important. For example, traffic density of an urbanfreeway is typically high around 8:00 a.m., but a peak in the night israrely seen. This implies a frequent concern with testing the good-ness of fit (GOF) between two time series processes, that is, the sim-ulation and the observation over time. Second, in some cases thesamples are correlated. For example, the traffic density at the nextstep is a result, at least in part, of the traffic density at previous steps.Correlation is usually a problem in performing statistical tests,which typically require random samples. Third, a traffic model canbe a deterministic one in which no randomness is involved. Unlikestochastic models in which statistical analysis is often employedto study the average behavior of the model, our interest here is tocheck whether the model makes proper responses at the right times.The preceding problems show that regular statistical tests may notbe sufficient and a quantitative evaluation procedure has to be care-fully devised to deal with these problems. The following sectionpresents various validation techniques that are available.

Techniques of Validating Traffic Simulation Models

The idea of model validation largely stems from the software engi-neering community and numerous techniques have been proposed.Balci (6 ) compiled a taxonomy that classifies these techniques intofour categories: informal, static, dynamic, and formal. Consideringthat most of the techniques come from the software engineering dis-cipline and address both verification and validation, Sargent (7) nar-rowed them down to a 2 × 2 matrix specifically for (operational)validation:

Observable System Nonobservable System

Qualitative ApproachA. Comparison using A. Exploration of graphical displays model behaviorB. Exploration of B. Comparison with model behavior other models

Quantitative ApproachComparison using Comparison with otherstatistical tests and models using statistical procedures tests and procedures

Since the traffic system is generally an observable system, thefirst column is of interest. In this section, a list of frequently usedtechniques is presented and their relative merits are briefly dis-cussed. In addition, a quantitative technique is proposed that is rel-atively new to the traffic simulation community but has been provento be very useful in addressing the previously mentioned problems.

Qualitative Techniques

Qualitative techniques, also known as subjective, visual, or informaltechniques on some other occasions, are typically performed on thebasis of visual comparison of the predicted and observed data in var-ious graphs and plots. It is a generally accepted and fairly reliablemeans to evaluate model performance and identify problems. How-ever, the downside of this approach is also obvious: its result is alsoqualitative and fuzzy. This is also the reason it is necessary to employquantitative techniques to provide complementary information.

22 Transportation Research Record 1876

Qualitative techniques generally include, but are not limited to,the following:

• Series plot, where values of the target variable are plotted againsttheir observation number (e.g., time-series or space-series).

• Contour plot, where a curve links all the points in x-y spacehaving the same z value in a x-y-z coordinate system. For example, adensity contour may visualize congested regions in time-space domainif the density for congestion is properly defined.

• Surface plot, where data points are graphed in a three-dimensional space. This plot contains the most detailed informa-tion and can be reduced to the previous two plots by cutting thesurface.

• Diagonal plot, where observed values are plotted against thesimulated values and an ideal fit would be a 45° line. Sometimes atransformation might be necessary to stretch or squeeze data pointsso that they are aligned evenly along the line.

• Histogram, where the frequency of errors is displayed and afavorable outcome generally shows a bell shape with most errorscentered around 0. All the above techniques fall into the category of“Comparison using graphical displays” in the matrix above.

• Animation, where a graphical interface plays back what hap-pened during simulation and this falls in the category of “Explo-ration of model behavior” in the matrix above. Animation can beused to examine the “reasonableness” of the model behavior, and itmay also provide useful information if one makes a side-by-sidecomparison with real-world video.

Quantitative Techniques

Quantitative techniques, also known as objective, numerical, or formaltechniques on some other occasions, quantify the difference betweenthe simulated and the observed. Depending on how much confi-dence is expected, quantitative validation can be performed with theuse of statistical measures and statistical inference.

Statistical Measures Suppose there are two processes X (thesimulated) and Y (the observed): X1, X2, . . . , Xn and Y1, Y2, . . . , Yn,where n is the sample size. Let residuals Z be the paired differencebetween the two processes: Zi = Yi − Xi, i = 1, 2, . . . , n. A lot ofstatistical measures are available to quantify the difference. Thefollowing is a frequently employed subset of the total.

• Mean error (ME ): ME is computed as

Obviously, a drawback of this measure is that positive and negativeerrors can cancel out each other and result in a small ME while the sim-ulated substantially differs from the observed, so this measure is not avery good indicator of overall fit and should be used with caution.

• Mean absolute error (MAE ): MAE is computed as

This fixes the problem of ME by flipping negative errors to thepositive side. However, MAE might “de-emphasize” outliers ascompared to root-mean-square error.

MAEn

Zi

i

n

==∑1

1

MEn

Zi

i

n

==∑1

1

• Mean squared error (MSE) and root-mean-square error (RMSE):MSE is computed as

and RMSE is computed as

Generally, MSE and RMSE are frequently used goodness-of-fitmeasures.

All the preceding statistics measure error in absolute terms. How-ever, in some cases, the relative error might be more informativebecause it is more readily interpretable. The following are someexamples:

• Mean percentage error (MPE ):

• Mean absolute percentage error (MAPE):

• Root-mean-square percentage error (RMSPE ):

All the preceding statistics are about the mean of the residuals. Itis sometimes desirable to show how spread out the residuals are, andmeasures often employed are variance (VAR) and standard deviation(SD), a measure of the scatter or variability about the mean in aseries of data points:

Statistical Inference Although the statistical measures providesome information on the GOF between the simulated and the observed,researchers are often faced with such questions as “How much con-fidence should be given to these measures?” As their names sug-gested, those measures on means are all “on average,” that is, 50%of time they are true and 50% of time they are false. Very often, oneneeds a higher level of confidence when interpreting simulation resultand this has to be done by statistical inference.

In response to the three problems posed at the beginning of thissection, the remainder of this section and the following section dis-cuss a statistical inference technique used to address these issues. Thefirst problem pertains to the ordering of data points, which meansthe sequence of simulated results has to match the observed results.This problem can be solved by pairing the two groups of data. Forexample, rather than working on the raw data as two groups, the

VARn

Z Z VAR Zn

Zi

i

n

i

i

n

= −( ) = == =∑ ∑1 12

1 1

and SD where ,

RMSPE

X YY

n

i i

ii

n

=

−

=

∑2

1

MAPEn

Y X

Yi i

ii

n

= −

×=∑1

1001

MPEn

Y X

Yi i

ii

n

= −

×

=

∑1100

1

RMSEn

Z i

i

n

==∑1 2

1

MSEn

Z i

i

n

==∑1 2

1


residuals are worked on after pairing the simulated and the observed,such as the Zi’s above. In this way, the ordering of data has beenautomatically accounted for. The second problem is the correlation/autocorrelation in the raw data as well as the residuals. Correlation/autocorrelation is usually a problem in performing statistical testsbecause they typically require random/independent samples. To cir-cumvent the problem of autocorrelation, the effective sample size hasto be reduced and the reduction in the number of independent obser-vations has implications for hypothesis tests. For example, if a serieshas a sample size of 100 and a first-order autocorrelation (dependenceon lag-1 only) of 0.50, the effective sample size after first-order adjust-ment would be 33 (8) (i.e., two-thirds of the raw data is wasted), andthis does not guarantee that autocorrelation has been eliminated fromthe resulting data. Fortunately, a batch means technique, which is dis-cussed in the next section, can reasonably deal with autocorrelationand still make full use of the raw data.

The last problem has to do with the meaning of goodness of fit.GOF may have different meanings in different contexts and thenuance may not be easily captured. For example, in model valida-tion, GOF often means how close the model output approximatesthe observation, whereas in statistics, GOF typically means whethertwo sets of observations could reasonably have come from the samedistribution. What makes things even more complicated is that modelvalidation often involves statistical tests and the two meanings ofGOF are easily mixed. Therefore, one must be clear about which oneis needed. Traffic simulation model validation frequently calls forthe former meaning of GOF. The next section discusses a simul-taneous inference technique, which involves statistical tests onboth model accuracy and precision, to ensure that our assessment ofmodel performance is objective and informative.

SIMULTANEOUS INFERENCE TECHNIQUE

Generally, the objectives of quantitative model validation are thefollowing:

• Objective A: Test whether the model is capable of replicatingthe real system with sufficient accuracy (i.e., the simulation is un-biased). This translates to testing whether the mean of modelingerror is statistically different from zero.

• Objective B: Keep in mind that the passing of the test in A isonly a necessary condition for a good model because large positiveand negative errors can cancel each other and still yield a zero mean.Hence it is important to check whether the variance of the model-ing error is reasonably small (i.e., whether the model is capable ofreplicating the real system with sufficient precision).

• Objective C: With Tests A and B combined, one is fairly con-fident about drawing a conclusion. However, something has to bedone to address the problem of correlation/autocorrelation before per-forming the above tests. Otherwise their validity can be underminedsubstantially.

With these considerations in mind, simultaneous statistical tests aredevised based on the following two hypotheses:

• Hypothesis 1. The prediction is unbiased. This is intended toaddress Objective A.

• Hypothesis 2. Modeling error is reasonably small. This isintended to address Objective B.

The batch means estimator for variance σ2 is VB:

Thus

and they are nearly independent. Therefore,

The batch means confidence interval is constructed as

It can be shown that, as m → ∞ while b is held constant, the cover-age of the confidence interval → 1 − α. On the other hand, to mini-mize the mean squared error of VB as an estimator of σ2, b >> m isdesirable. Therefore, when choosing m and b, there are two com-peting criteria and one has to properly trade off between the twobased on his or her goals.

It is interesting to note that there is also an overlapping batchmeans technique where two consecutive batches overlap. Compar-ing with regular batch means technique, overlapping batch meanstechnique has larger b for the same n and m. In this way, more degreesof freedom are obtained at some cost of correlation, and this trans-lates to a shorter confidence interval. Because this technique is notour main interest, it will not be discussed in detail here.

Simultaneous Statistical Inference Technique

On the basis of the analysis at the beginning of this section, thesimultaneous hypotheses are formally made as follows:

• Hypothesis 1: H 10 : µZ = 0

• Hypothesis 2: H 20 : σ2

Z ≤ �

where µ and σ2Z are the mean and variance of process Z and � is

a preselected number that is reasonably small depending on the specific problem. Since Z might be a non-IID process, thebatch means technique is applied and the result is a new process

with batch size m and number of batches b.After applying the batch means technique, we are assumed to haveobtained a (nearly) IID stationary process Z

–and the simultaneous

hypotheses are redefined as follows:

• Hypothesis 1: H 10 . µZ

_ = 0• Hypothesis 2: H 2

0 . σ2Z_ ≤ �

To test Hypothesis 1, a regular Student’s t-test will suffice. The teststatistic t0 is computed:

Z Z Z Zm m bm= { }1 2, , . . . ,

µ α∈ ±−

Z t V nnb

B

21,

ˆ

Z

n

V

Z

V nt b

n

B

n

B

−

= − ≈ −( )

µσ

σ

µ2

21

ˆ ˆ

Vb

bB ≈ −( )

−σ χ2 2 1

1

Z

nNn − ≈ ( )µ

σ20 1,

Vm

bZ ZB im n

i

b

=−

−( )=∑1

2

1

To address Objective C, the above tests are going to be performed byusing the batch means technique (9), which is particularly designed tohandle correlation/autocorrelation in samples. To facilitate subsequentdiscussion, the batch means technique is presented below.

Batch Means Technique

Suppose there are two processes as above where X is the steady-statesimulation output and Y is the observation in the field. Their residu-als are computed as Zi = Xi − Yi, i = 1, 2, . . . , n. Notice that Z mightbe a nonidentically independently distributed (non-IID) process. Forexample, it can be correlated so that the samples are not indepen-dent, or it can be nonstationary because its variability increases as X and Y get large. Generally, there is no uniform treatment to turn anonstationary process stationary. However, if the process exhibitssome special pattern, a log transformation might be helpful to serveour purpose. Considering that traffic flow measurements such asdensity, flow, speed, and queue length often take positive values anda log transformation happens to lead to a nice feature of percentageerror measurement at a predetermined confidence level, this trans-formation deserves special note and hence serves as the basis of thefollowing discussion. If, luckily, process Z is already stationarybefore transformation, the general idea about the following proce-dure still applies, except that no transformation is needed and themeasurement of error is in absolute terms rather than percentage.

Let Zi = ln Xi − ln Yi, i = 1, 2, . . . , n, and assume this makes processZ stationary. Next, let’s address the problem of nonindependenceby batching the data

where m is the batch size and b is the number of batches. Obvi-ously, n = m × b.

For each batch, its batch mean is used as the “representative” of thebatch

If m is large enough, batch means are approximately normallydistributed, that is, .Here, independence is roughly achieved because most Zi’s in onebatch are nearly independent of most Zi’s in another batch for largem. Identical distribution is also achieved by central limit theorem,considering being in steady state by assumption and stationary Zi’safter transformation.

Next, one estimates the mean of process Z,µ = E[Z], and con-structs a batch means confidence interval for it. The batch meansestimator for µ is the grand mean Z

–n

Zb

Zn im

i

b

==∑1

1

Z Z Z N E Z VAR Zm m b m m m1 2 1 1, , , , ,, , . ,. . , ≈ ( ) ( )[ ]

Zm

Zb m b m j

j

m

, = −( ) +=∑1

1

1

Zm

Zm m j

j

m

2

1

1, = +

=∑

M

Zm

Zm j

j

m

1

1

1, =

=∑

Z Z Z Z Z Z

Z Z Z

m m m m

b m b m bm

b

1 2 1 2 2

1 1 1 2

, , . , , . . ..

, , . ,

. . , . ., . . . .. .

. .

batch 1 batch 2

batch

1 244 344 1 2444 3444

1 244444 344444

+ +

−( ) + −( ) +



6104N

. Holcom

b Br. R

d.

5104S

. Holcom

b Br. R

d.

S. H

aynes Br. R

d.

5105S

. Northridge R

d.

4001104K

imball B

r. Rd.

6105N

orthridge Rd.

5103S

. Mansell R

d.

Haynes B

r. Rd

6103S

. Maxw

ell Rd.

4001128P

itts Rd

6102

5102

S N

4005105

4006105

4005104 4006104

4005103 4006103

4005102 4006102

FIGURE 2 Test site (not to scale).

where Z–

n is the grand mean of the batch means and VB is the batchmeans estimator for variance VAR(Z

–im) as previously defined. If

, the null hypothesis H 10 is rejected. Otherwise, there is a

lack of evidence that E[Z–

im] is statistically different from 0.For Hypothesis 2, because

under null hypothesis H 20, a chi-squared test serves the need. The test

statistic χ 20 is computed:

which is χ2(b − 1) distributed. If χ20 > χ2

α ,b−1, H2o is rejected. Otherwise,

there is a lack of evidence that VAR(Z–

im) is statistically greater than �.In order to interpret the above tests, let’s plug Xi’s and Yi’s in the

above test statistics. Suppose a log transformation on Xi’s and Yi’s(i.e., Zi = ln Xi − ln Yi) has been made. From the definition of batchmeans,

where Xim denotes the geometric mean of the ith batch of the simu-lation output, and Y im denotes the geometric mean of the ith batch ofthe field observation.

EX

Yeim

im

E Zimˆ)

= [ ]

EX

YE

X

YE Zim

im

im

im

imlnˆ

lnˆ

) )

=

= [ ]

Zm

X YX

Y

X

Y

im i m j i m j

j

mi m j

i m jj

m m

i m j

j

m m

i m j

j

m m

= −( ) =

=

−( ) + −( ) +=

−( ) +

−( ) +=

−( ) +=

−( ) +=

∑ ∏

∏

∏

11 1

1

1

11

1

1

1

1

1

1

1

ln ln ln

ln

=

ln

)

)X

Yim

im

χ02 1= −( )b VB

�

ˆ ~Vb

bB �

χ2 1

1

−( )

−

t ta b0 2 1> −/ ,

tZ

V b

n

B

0 =ˆ

Notice that the above equations are derived from the geometricmeans of the simulated and observed batches, so interpretation ofthe simultaneous statistical test has to be based on batches. The t testimplies that, if , there is strong evidence that E [Z

–im] is

statistically different than 0 (i.e., the expected value of the geo-metric mean of the simulated is statistically different than that of the observed). Otherwise, the opposite is accepted. The chi-squared test can be translated to a confidence interval. Since one can rea-sonably assume that Z

–im’s are normally distributed, the confidence

interval for the ratio of the geometric mean of the simulated and theobserved can be constructed as with a predetermined smallnumber �. This actually gives us a type of percentage error with con-fidence level of (1 − α).

APPLICATION OF VALIDATION TECHNIQUES

This section serves as an illustration of the application of the previ-ously mentioned validation techniques. An enhanced KWavesmodel is used to demonstrate the application. The original KWavesmodel was proposed by Newell (10–12), which is equivalent to theconceptual model shown in Figure 1. Newell did not conduct anytest on the effectiveness on his model and this conceptual validationwas done by Son (13) and Hurdle and Son (14). To be able to applythis model, Leonard (15 ) coded the original model in softwareGTWaves, which is equivalent to the computerized model. Later on,this model was enhanced and enabled to handle network traffic,which corresponds to a modification of the conceptual model asindicated by the horizontal dashed arrow in Figure 1. The enhancedmodel was totally reimplemented on the basis of Java and XMLtechnologies, resulting in a new computerized model that serves asthe illustrative example on which the application of the above vali-dation techniques is going to be demonstrated.

This example simulates kinematic waves at the freeway corridorof GA-400 (a toll road but “freeway by design” as far as our test siteis concerned) in Atlanta, Georgia. Starting from somewhere nearKimball Bridge Road to the north and ending at somewhere near PittsRoad to the south, the test site includes four on-ramps and four off-ramps as shown in Figure 2. Observation stations (video cameras)are deployed along the study site approximately every 1⁄3 mi, butonly a few strategic stations are shown. Both main line and rampsare monitored. A station consists of a few detectors with each detec-tor collecting data on a single lane. Numbers starting with “400” inthe figure correspond to real observation stations and other numbersshow illustrative locations of merges and diverges (i.e., they are not

eZ tn a b± −/ ,2 1 �

t ta b0 2 1> −/ ,

observation stations). Data are collected from the field every 20 sand, to smooth out local variation while still achieving fine resolu-tion, are merged into 5-min bins, which are typically adopted intraffic studies. Usable information in the data set includes vehiclecounts and time mean speeds. Although density and occupancy areprovided in the raw data, analysis shows that they are generally in-accurate and inappropriate to be used in the test. On the other hand,space mean speed is unavailable, so density has to be synthesizedfrom flow and time mean speed based on the fundamental rela-tionship between speed, flow, and density. Considering that this isa freeway scenario where speed variance is relatively small and thatthis simulation is performed at macroscopic level, this synthesisapproach is acceptable.

Three categories of input data are involved in the simulation: linkgeometry data (link length and number of lanes), traffic character-istics data (capacity, free flow speed, and jam density), and origin–destination (O-D) flows. Link length and number of lanes are mea-sured in the field; capacity, free flow speed, and jam density areobtained from speed-flow plot of observations at each link; O-Dflows are generated from link flow by using the RPE O-D estimatorproposed by Nihan and Davis (16 ). The computerized model out-puts various measures of effectiveness (MOEs), including trafficdensity which is the aforementioned target variable based on whichmodel validation is going to be performed.

Data used for the test were collected on Friday, October 11, 2002,from 05:50:00 to 19:20:00. Results of the comparison between simu-lation output and field observation are presented as follows. Figure 3shows a subset of density versus time curves for links of the test site.These are actually time-series plots that show the variation of den-sity over time. The solid lines are observed density and the dashedlines are simulated/predicted density. The y-axis is density in vehi-cles per kilometer per lane (veh km pl) and the x-axis is time of dayin hh:mm:ss format. These plots provide detailed information onhow model simulation approximates the real system in continuoustemporal domain but discrete spatial domain.

Figure 4 shows density contours for traffic condition on the mainline based on density level of 28 veh km pl, which is chosen as thedelineator of congested and uncongested regions. The x-axis repre-sents location nodes running from north (left) to south (right) andthe y-axis is time of day. Bear in mind that this density contourpresents only limited information for the following reasons. First,it involves only the freeway main line which is usually the maininterest of traffic engineers. Second, it shows a binary choice plot–congested or uncongested regions. Much of the depth informationis hidden. Despite these limitations, this figure provides a perfectopportunity to view the formation and dissipation of queues andto compare the extents and shapes of the predicted and observedcongestion areas. The figure shows the predicted density contourin dashed lines and the observed contour in solid lines. The enclosedareas can be interpreted as the congested regions. Of course, one canreasonably view the contour lines as the trajectory of queues. Thefigure shows a morning peak and an afternoon peak. The morningpeak appears approximately between 7:00 and 8:30 and extendsfrom the downstream end backwards up to Node 6103. At about6:50, the morning queue starts to build up passing Node 4001128.About 40 min later, the queue reaches Node 6103. The queue thenstarts to dissipate and reaches somewhere at the downstream ofNode 5103 at 8:15. In the next 15 min, the queue swiftly shrinks toNode 4001128 and disappears before 9:05. The afternoon peak ismuch smaller in scale. It lasts from 17:15 to 17:55 and spans threelinks between Nodes 5102 and 6104. There are some slight dis-


crepancies between the simulated and the observed contours. Forexample, the simulated morning peak builds up a little bit earlier andshrinks a little bit later than the observed one. On the other hand,the simulated afternoon peak fails to capture a minor queue. Giventhese, the density contour shows a good agreement between thesimulation and the observation.

Density surface is a continuous three-dimensional graph of den-sity, which gives the most detailed information about density inthe time–space domain. In Figure 5, the x-axis is location/nodes,the y-axis is time of day, and the z-axis is density in vehicles perkilometer per lane. Visual comparison of the two surfaces showsthat they are very similar to each other except the sharp spike ofthe observed surface. Again, the formation and dissipation of thequeues are easier to read and all findings in the previous paragraphapply here.

Figure 6 is a scatter plot of the simulated density against theobserved density. Sometimes the data points may exhibit a funnelshape with smaller variability when density is low and higher vari-ability as density gets large, or the data points are clustered at one endof the line. To stretch out the figure, a transformation is often made.This figure shows densities after natural log transformation, so the“ln” leading the coordinate captions means “natural log” while the“ln” in the unit of density means “lane.” The purpose of this plot isto pull together all data points, regardless of time and space, and seewhether the predicted density is consistently close to the observeddensity. The ideal fit would be a 45° line that runs from the bottomleft to the upper right. The figure shows such a pattern with datapoints densely clustered along the 45° line.

Figure 7 shows the histogram of residuals (i.e., the plot of fre-quency of prediction/modeling error). In this example, there are2,592 samples and a group interval of 1 veh km pl is used. Normally,the residuals should be distributed in a bell shape (i.e., more pointsaround 0 and the rest balanced at both sides). The figure confirmssuch a pattern.

To apply the simultaneous statistical inference technique, values ofbatch size (m) and number of batches (b) have to be carefully selected.If the sample population is limited, one needs to trade off between twocompeting criteria, that is, m needs to go infinity while holding b con-stant to get a true (1 − α) coverage of confidence interval and b needsto be as large as possible to minimize the mean squared error of VB

as an estimator of σ2. Empirical studies recommend that, to havebetter coverage of confidence interval, one needs to get at least about30 batches. In this case, there is a total of 2,592 samples. If the num-ber of batches is set at 32, the batch size will be 81. This correspondsto dividing samples for each link into two batches with each con-taining a peak (morning or afternoon). Results of the simultaneousstatistical tests are summarized as follows:

Batch size (m) 81Batches (b) 32Total samples (n) 2,592Significance level (α) 0.05Grand mean (Z

_) 0.004135

Variance (VB) 0.002025t-test statistic (t0) 0.519792Critical value (t0.025,31) 2.04227t-test result Fail to reject H0, i.e., E[

–Zim] = 0.

Small number (�) 0.001395Critical χ2 value (χ 2

0.05,31) 45.095% C.I. for percentage error (−0.068126, 0.082017) × 100%

Interpretation of the results can be as follows. If the predicted den-sity is compared with the observed density based on batches of thespecified size, the difference of the two will be zero in approximately

95 out of every 100 trials (i.e., the mean of the prediction errors is notstatistically different than 0). But if it is accepted that the � is rea-sonably small, the variance of the ratio of the simulated value to theobserved value will be less than or equal to this number in approxi-mately 95 out of every 100 trials. This also translates to a 95% con-fidence interval of (−0.068126, 0.082017) × 100% for percentagemodeling error.

As a point of interest, traditional statistical tests are performedon the basis of observed density, Yi, and simulated density, Xi. Basicstatistic measures show that RMSE of the residuals is 4.37 andRMSPE is ±11%, which provides not only a poorer coverage but


also only 50% confidence compared with our approach. The resultof a two-sample paired t-test supports the same conclusion as ourapproach and provides a 95% confidence interval of (−0.1140838,0.2291890). Considering that the mean of observed density is about16.25, this confidence interval translates to a percentage intervalof roughly (−0.007, 0.014) × 100%. A two-sample Kolmogorov–Smirnov goodness-of-fit test shows that the distributions of observedand simulated densities are not statistically different. However,one must be very careful with these statistical test results. The two-sample t-test pools all samples together and the sample size increasesdramatically. That is why a much tighter confidence interval results.

(a) (b)

(c) (d)

FIGURE 3 Series plots (partial): (a) Link 1, (b) Link 2, (c) Link 3, and (d ) Link 4.


FIGURE 4 Density contour.

(a)

FIGURE 5 Density surfaces: (a) observed.(continued)


(b)

FIGURE 5 (continued) Density surfaces: (b) predicted.

[ ]

[]

FIGURE 6 Diagonal plot (values are log transformed).

On the other hand, this test fails to address problems of noninde-pendent and nonstationary samples that may significantly under-mine the power of statistical inference. Unfortunately, it turns outthat these problems do exist in our data. As for the goodness-of-fittest, it tests only the cumulative distribution functions (cdf ) of theobserved density and the simulated density regardless of the timingor ordering of each sample. Therefore, it is not the type of goodnessof fit that must be considered.

CONCLUSION

Although there are over 100 traffic simulation models, papers onmodel calibration and validation are relatively few. This paper seg-ments a model life cycle into four stages and fits various testingprocesses in the picture. To maintain focus, this paper discusses onlytechniques for model validation—a testing process that comparesthe model output with the real system behavior.

Validation of a traffic simulation model is typically characterizedby the following considerations. First, the timing of samples mightbe important, such as the case of a time series process. Second, themodel output might be correlated, which affects the way in whichstatistical inference is made. Third, a good fit might be based onhow the simulated process approximates the observed, rather thanthe GOF of their distributions. These considerations should drive theselection of validation techniques.

Model validation generally includes qualitative and quantitativetechniques. This paper compiles a list of qualitative validation tech-niques and the relative merits of these techniques are briefly dis-cussed. Quantitative validation is typically performed statistically.For example, one can measure model fit by some statistical mea-sures and/or making statistical inference. This paper presents a listof the commonly used statistical measures of GOF, together withtheir formulae and cautions when applying them. On the other hand,a simultaneous statistical inference technique is presented, which testsboth the accuracy (preferably zero mean) and precision (preferablysmall variance) of the model output.


Correlation/autocorrelation is a typical problem of simulation out-put which can be addressed by the batch means technique. By batch-ing the simulation outputs and making statistical tests based on thebatch means, the information contained in the simulation output canbe fully utilized while eliminating autocorrelation at the same time.

To illustrate the application of these techniques, results of vali-dating KWaves, an enhanced model based on Newell’s simplifiedtheory of kinematic waves, are presented. These techniques helpexamine the model performance from various aspects in a system-atic manner, and thus are very useful in model validation. Also, abrief comparison is made between the test results of traditionalapproaches and the approach proposed here, and the advantages ofthe latter approach are demonstrated.

It should be pointed out that validation techniques presented inthis paper are not meant to be a complete list of those available. Theendeavor of the authors was to contribute some ideas as well as pro-mote the establishment of a commonly accepted practice of modeltesting in the traffic simulation community.

ACKNOWLEDGMENT

The authors thank David Goldsman at the School of Industrial andSystems Engineering, Georgia Institute of Technology, for his manyinsightful suggestions in formulating the statistical inference tech-niques. Special thanks also are extended to the anonymous refereesfor their comments that helped improve the quality of this paper.

REFERENCES

1. Benekohal, R. F. Procedure for Validation of Microscopic Traffic FlowSimulation Models. In Transportation Research Record 1320, TRB,National Research Council, Washington, D.C., 1991, pp. 190–202.

2. Rao, L., and L. Owen. Development and Application of a ValidationFramework for Traffic Simulation Models. Proc., IEEE Winter SimulationConference, 1998, pp. 1079–1086.

FIGURE 7 Histogram of modeling error.

3. Rakha, H., B. Hellinga, M. Van Aerde, and W. Perez. Systematic Veri-fication, Validation, and Calibration of Traffic Simulation Models. Pre-sented at 75th Annual Meeting of the Transportation Research Board,Washington, D.C., 1996.

4. Hellinga, B. Requirements for the Validation and Calibration of TrafficSimulation Models. Proc., Canadian Society for Civil Engineering 1998Annual Conference, Vol. IVb, 1998, pp. 211–222.

5. Hourdakis, J., P. G. Michalopoulos, and J. Kottommannil. Practical Pro-cedure for Calibrating Microscopic Traffic Simulation Models. In Trans-portation Research Record: Journal of the Transportation ResearchBoard, No. 1852, TRB, National Research Council, Washington, D.C.,2003, pp. 130–139.

6. Balci, O. Verification, Validation and Accreditation of SimulationModels. Proc., IEEE Winter Simulation Conference, 1997, pp. 135–141.

7. Sargent, R. G. Verification and Validation of Simulation Models. Proc.,IEEE Winter Simulation Conference, 1998, Vol. 1, pp. 121–130.

8. Dawdy, D. R., and N. C. Matalas. Statistical and Probability Analysis ofHydrologic Data. Part III: Analysis of Variance, Covariance and TimeSeries. In Handbook of Applied Hydrology, a Compendium of Water-Resources Technology (V. T. Chow, ed.), McGraw-Hill, New York,1964, pp. 8.68–8.90.


9. Goldsman, D., and G. Tokol. Output Analysis: Output Analysis Pro-cedures for Computer Simulations. Proc., IEEE Winter SimulationConference, 2000, pp. 39–45.

10. Newell, G. F. A Simplified Theory on Kinematic Waves in HighwayTraffic. Part I: General Theory. Transportation Research B, Vol. 27,No. 4, 1993, pp. 281–287.

11. Newell, G. F. A Simplified Theory on Kinematic Waves in HighwayTraffic. Part II: Queueing at Freeway Bottlenecks. TransportationResearch B, Vol. 27, No. 4, 1993, pp. 289–303.

12. Newell, G. F. A Simplified Theory on Kinematic Waves in HighwayTraffic. Part III: Multi-destination Flows. Transportation Research B,Vol. 27, No. 4, 1993, pp. 305–313.

13. Son, B. A Study of G. F. Newell’s Simplified Theory of Kinematic Wavesin Highway Traffic. Ph.D. dissertation. University of Toronto, 1996.

14. Hurdle, V. F., and B. Son. Road Test of a Freeway Model. TransportationResearch A, Vol. 34, No. 7, 2000, pp. 537–564.

15. Leonard, J. D. Computer Implementation of a Simplified Theory of Kine-matic Waves. Report FHWA/CA/TO-98-01. California Department ofTransportation, Sacramento, 1997.

16. Nihan, N. L., and G. A. Davis. Recursive Estimation of Origin-DestinationMatrices from Input/Output Counts. Transportation Research B, Vol. 21,No. 2, 1987, pp. 149–163.

Systematic Approach for Validating Traffic Simulation Models

Documents