Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 1 CS5014 Research Methods in CS Dr. Ayman Abdel-Hamid Computer Science Department Virginia.

Performance Evaluation

© Dr. Ayman Abdel-Hamid, CS5014, Fall 2006

1

CS5014

Research Methods in CS

Dr. Ayman Abdel-Hamid

Computer Science Department

Virginia Tech




2

OutlinePerformance Evaluation

•Introduction

•Common Mistakes

Some of the material is based on Dr. Cliff Shaffer’s Notes for CS5014. Department of Computer Science, Virginia Tech



3

Examples 1/2

• Evaluate design alternatives

• Compare two or more computers, programs, algorithms

Speed, memory, usability

• Determine optimum value of a parameter (tuning, optimization)

• Locate bottlenecks

• Characterize load

• Prediction of performance on future loads

• Determine number and size of components required



4

Examples 2/2

•Which is the best sorting algorithm?

•What factors affect data structure visualizations?

•Code-tune a program

•Which interface design is better?

•What are the best parameter values for a biological model?



5

The Art of Performance Evaluation

•Throughput in Transactions per Second

System Workload1 Workload2

A 20 10

B 10 20

How Does system A compare to system B?



6

Evaluation Issues•System

• Hardware, software, network: Clear bounding for the “system" under study

•Techniques

• Measurement, simulation, analytical modeling

•Metrics

• Response time, transactions per second

•Workload: The requests a user gives to the system

•Statistical techniques

•Experimental design

• Maximize information, minimize number of experiments



7

Common Mistakes 1/5

•No goals

Each model is special purpose

Performance problems are vague when first presented

• Biased goals OUR system is better than THEIRS

• Unsystematic approach

• Analysis without understanding

People want guidance, not models

• Incorrect performance metrics

Want correct metrics, not easy ones

• Unrepresentative workload



8

Common Mistakes 2/5

•Wrong evaluation technique

Easy to become married to one approach

• Overlooking important parameters

• Ignoring significant factors

Parameters that are varied in the study are called factors.

There's no use comparing what can't be changed

• Inappropriate experimental design

• Inappropriate level of detail



9

Common Mistakes 3/5

•No analysis

• Erroneous analysis

• No sensitivity analysis

Measure the effect on changing a parameter

• Ignoring errors in input

• Improper treatment of outliers

Outliers are values that are too high or too low compared to a majority of values

Some should be ignored (can't happen)

Some should be retained (key special cases)



10

Common Mistakes 4/5

•Assuming no change in the future

• Ignoring variability

Mean is of low significance in the face of high variability

• Too complex analysis

Complex models are \interesting" and so get published and studied

Real world use is simpler

Decision makers prefer simpler models



11

Common Mistakes 5/5

•Improper presentation of results

•The proper metric to measure the performance of an analyst is the number of analyses that helped decision makers (not the number of analyses performed)

• Ignoring social aspects

• Omitting assumptions and limitations



12

The Error of one-sided Hypothesis

•Consider the hypothesis “X performs better than Y".

•The danger is the following chain of reasoning:

Could this hypothesis be true?

I have evidence that the hypothesis might be true.

Therefore it is true.

What got ignored is any evidence that the hypothesis might not be (or is not) true.



13

A Systematic Approach

1. State goals and define the system boundaries

2. List services and outcomes

3. Select metrics

4. List parameters (System and Workload)

5. Select factors to study

6. Select evaluation technique

7. Select workload

8. Design experiments

9. Analyze and interpret data

10. Present results. Start over, if necessary!



14

Technique Selection•Choices: Analytical Modeling, Simulation, and Measurement

“Until validated, all evaluation results are suspect."

•Validate one of these approaches by comparing against another.Measurement results are just as susceptible to experimental errors and biases as the other two techniques.

•Criteria for technique selection Stage of analysis Time required Tools Accuracy Trade-off evaluation Cost Saleability



15

Common Performance Metrics 1/3

•Response Time

Interval between a user’s request and the system response

Simplistic definition assuming request and responses are instantaneous

Definition 1 Time between the user finishes the request and the system starts response

Definition 2 Time between the user finishes the request and the system completes response



16


•Throughput: The rate (requests per unit of time) at which requests can be serviced by the system.

Throughput generally increases as the load initially increases.

Eventually it stops increasing, and might then decrease.

Nominal capacity is maximum achievable throughput under ideal workload conditions (bandwidth for computer networks)

Usable capacity is maximum throughput achievable without violating a limit on response time.

Efficiency is ratio of usable to nominal capacity.



17


•UtilizationFraction of time the resource is busy servicing requestsRatio of busy time to total elapsed time over a given period

•Bottleneck: Component with highest utilizationImproving this component often gives highest payoff

•ReliabilityProbability of errorsMean time between errors

•AvailabilityUptime and downtimeMean uptime (Mean time to Failure)

•Cost/performance ratio



18

Workloads•A workload is the requests made by users of the system under study.

A test workload is any workload used in performance studies A real workload is one observed on a real system. It cannot be repeated. A synthetic workload is a reproduction of a real workload to be applied to the tested system

•Examples (for CPU performance) Addition instruction Instruction mixes Kernels Synthetic programs Application benchmarks



19

Selecting Workloads

1. Determine the services for the SUT (System Under Test)

• View the system as a service provider

• Component Under Study (CUS)?

2. Select the desired level of detail

3. Confirm that the workload is representative

4. Is the workload still valid?

• A real-world workload is not repeatable

• Most workloads are models of real service requests



20

Selecting Workloads-Level of Detail

• Select the desired level of detail

Most frequent request

Frequency of request types

Time-stamped sequence of requests (trace)

Average resource demand

Might be necessary to specify complete probability distribution

Distribution of resource demands



21

Selecting Workloads-Representativeness

• A test workload representative of real application

• Test workload and real application should match in the following

Arrival rate

Resource demands

Resource usage profile



22

Selecting Workloads-Other Factors

• Loading Level

A workload might exercise a system to its

full capacity (best case)

Beyond its capacity (worst case)

Load level observed in real workload (typical case)

• Impact of external components

• Repeatability



23

Some Workload Characterization Techniques 1/3

• Averaging

• Uses arithmetic mean

• Alternatives?

• Specifying Dispersion

• Variance

• Markov Models

• Clustering



24


Markov Models Assume the next request depends only on the last

request Next system state depends only on current system state Transition matrices

Ex: typical distribution for some system is about 4 small packets followed by a single large packet

Random chance: probability of small is always .8, large is always .2.

Markov model: Small follows small .75, large follows small .25. In contrast, small always follows large.



25


• Clustering

To separate a population into groups with similar characteristics.

minimize the within-group variance while maximizing the between-group variance

Select representatives to simplify further processing



26

Statistics: Basic Concepts 1/2

• Independent Events: Knowing that one event has occurred does not in any way change our estimate of the probability of the other event.

• Random Variable: takes one of a specified set of values with a specified probability.



27

Statistics: Basic Concepts 2/2

For independent variables, covariance is zero. Why?



28

Mean, Median, and Mode

(highest probability xi)

Probability Density Function (pdf) is the derivative of the CDF. P(x1 < x ≤ x2) = F(x2) – F(x1)



29

Summarizing Data by a Single Number 1/2

• A single number gives key characteristic of data set

• Indices of central tendencies: mean, median, and mode

Sample mean

Sample median

Sort observations in increasing order and take observation in the middle of the series (if number even? take mean of middle two values)

Sample mode

Plot a histogram and specify the midpoint where the histogram peaks, or category occurring most frequently

Mean is affected more by outliers than median or mode



30

Summarizing Data by a Single Number 2/2

• Mean and Median always exist and are unique

• Mode may not exist, or might not be unique

• Examples

Unimodal, symmetrical pdf

Bimodal, symmetrical pdf

Uniform density function

Skewed pdf



31

Choosing Mean, Median, or Mode• You can't take the median or mean of categorical data

Use Mode.

• Is the total of interest?

Use mean

(Ex: Total CPU time for five database queries vs. number of windows on the screen open for each query.)

Probably use mean for the times

median for windows.

• Are the data skewed? Use Median



32

Common Misuses of Means• Using means of significantly different values

Mean of 10 and 1000 msec is 505 msec

• Using mean without regard to skewness of distribution

Sum Mean Typical

10 9 11 10 10 50 10 10

5 5 5 4 31 50 10 5

• Multiplying means to get the mean of a product

Mean of a product of two random variables is equal to product of means, IF two random variables are independent



33

Summarizing Variability• Mean, median, mode attempt to provide a single

“characteristic" value for the population.

But a single value might not be meaningful

People generally prefer systems with low variability

• There are different ways to measure variability (Indices of dispersion)

Range (max - min)

poor, tends to be unbounded, unstable over a range of observations, susceptible to outliers

Variance or standard deviation

10 and 90 percentiles

Quartiles (box plots)



34

Variance and Standard Deviation

• Why division by n-1?

Only n-1 of the n differences are independent

• Variance is expressed in units that are the square of the unit of the observations.

• Standard deviation is in units of the mean

• Better to use coefficient of variation, since it takes the scale of measurement unit out of variability consideration



35

Percentiles• Can be performed for any variables, even without bounds

• 0.9-quantile is the 90-percentile

• Percentiles at multiples of 10% are deciles

• Quartiles divide data into 4 parts at 25, 50, and 75%

25% of observations are less than or equal to first quartile

50% of observations are less than or equal to second quartile

-quantiles can be estimated by sorting the observations and taking [(n-1)+1]th element in the ordered set



36

Selecting Index of Dispersion

• Is variable bounded?

Specify range

• If no bounds, is distribution unimodal, symmetric?

Use coefficient of variation

• Is distribution nonsymmetric?

Use percentiles

• Use of average traffic for network design?

Usually designed to carry 95 to 99 percentile of observed load (range or percentile)

• Finding a percentile requires several passes through the data



37

Comparing Systems Using Sample Data• Most CS research work involves samples from some

“population.“ Performance experiments on workloads for programs, systems, and networks

• A fundamental goal of CS experimentation is to determine the mean () and variance of some population characteristic (a parameter).

• We can only measure the mean characteristic for the sample, not the population (a statistic)

The parameter value (population mean) is fixed.

The statistic (sample mean) is a random variable.

The values for the statistic come from some sampling distribution.



38

Hypothesis Testing• One fundamental experimental question What is the

parameter value?

• Another fundamental experimental question Are two populations different?

Does a subpopulation perform better?

Is one algorithm better than another?

• Null Hypothesis: Two populations have the same mean

• Rejection of the Null Hypothesis: Do so if you are “confident" that the means are different

• Significant Difference: Two means are different with a certain reference confidence (probability)



39

Confidence Intervals



40

Central Limit Theorem



41

Normal Distribution

The sum of a large number of independent observations from any distribution has a normal distribution

•Normal variate denoted as N(, σ), where is the mean and σ is the standard deviation

•A normal distribution with zero mean, and unit variance is called a unit normal or standard normal distribution denoted N(0,1)

•An -quantile of a unit normal variate z ~ N(0,1) is denoted by z

•If a random variable has N(, σ) distribution, then (x- )/ σ has a N (0,1) distribution



42

Confidence Interval Example



43

Confidence Intervals for Small Samples

•The sampling distribution only approximates a normal curve when n gets big enough, over about 30.

•Before that, the “tails" of the distribution are too big, so the estimated confidence interval is too small.

•You can calculate the confidence intervals more precisely, but you need a separate table for every n value.

•Usually there is a table to look it up in, or better yet, the statistical software will calculate it for you

•The 100(1-)% confidence interval is given by

nstx n /]1;2/1[



44

Small Sample CI Example

•8 observations

-0.04, -0.19, 0.14, -0.09, -0.14, 0.19, 0.04, and 0.09

•Mean = 0

•Sample standard deviation = 1.38

•t[0.95;7] = 1.895

•90% Confidence interval is

)0926.0,0926.0(8/138.0*895.10



45

Testing Against Zero

•Are the differences in processor times for two algorithms significant?

Testing if mean (differences) is zero. Check if zero is within confidence interval at desired confidence.

Note that the number of samples is likely to be small. Take from table for correct t value



46

Testing Against Zero ExampleDifference in processor times of two different implementations of the same algorithm was measured on seven similar workloads {1.5, 2.6, -1.8, 1.3, -0.5, 1.7, and 2.4}. Can we say with 99% confidence that one implementation is superior to the other?

•Sample size = 7

•Mean = 7.2/7 = 1.03

•Sample variance = 2.57


•100(1-) = 99, = 0.01, 1- /2=0.995

•t[0.995,6] = 3.707

•Confidence Interval is

•Interval includes zero, cannot say

)27.3,21.1(7/6.1*707.303.1



47

Paired Observations•Need to compare two systems on very similar workloads

If more than 2 systems, or workloads are significantly different, analysis of experimental design techniques should be used

•Conduct n experiments on each of the two systems, one-to-one correspondence between ith test on the 2 systems, observations are paired

•Calculate difference in performance

•Construct a difference interval for the difference

if includes zero, systems are not significantly different



48

Paired Observations ExampleSix similar workloads used on 2 systems

{(5.4, 19.1), (16.6, 3.5), (0.6, 3.4), (1.4, 2.5), (0.6, 3.6), (7.3,1.7)}

Performance difference {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6}

•Sample size = 6

•Mean = -0.32

•Sample variance = 81.62


•100(1-) = 90, = 0.1, 1- /2=0.95

•t[0.95,5] = 2.015

•Confidence Interval is

•Interval includes zero, two systems are not different

)11.7,75.7(69.3*015.232.0



49

Unpaired Observations•Two samples of different sizes

•No correspondence between ith observations in two samples

•Make an estimate of the variance and effective number of degrees of freedom classical t-test

Compute sample means

Compute sample standard deviations

Compute mean difference

Compute standard deviation of mean difference

Compute effective number of degrees of freedom

Compute confidence interval for mean difference



50

What Confidence Interval to Use?•Depends on the loss if wrong

•Sometimes 50% confidence is enough, or 99% confidence is not enough



51

Regression Models•A regression model predicts a random variable as a function of other variables.

Response variable: the estimated variable

Predictor variables, predictors, factors: variables used to predict the response

All must be quantitative variables to do calculations

•What is a good model?

Minimize the distance between predicted value and observed values



52

Linear Regression Model 1/2



53

Linear Regression Model 2/2

n

iii

n

ii

n

iii

n

ii

iii

xbbye

xbbyeSSE

yye

110

1

1

210

1

2

0)(

)(

ˆ



54

Estimation of Model Parameters 1/2



55

Estimation of Model Parameters 2/2

-5

0

5

10

15

20

25

30

0 20 40 60 80 100 120

Number of disk I/O

CPU

tim

e in

mse

c

Series1

Series2



56

Allocation of Variation 1/2

•The purpose of a model is to predict the response with minimum variability.

No parameters: Use mean of response variable

One parameter linear regression model: How good?

n

1i

2

n

1i

2

1

2

)(regression without SSE

y of variance)(1

1

1

1

regression without errors of Variance

yy

yyn

en

yyeError

i

i

n

ii

ii



57

Allocation of Variation 2/2



58

Confidence Intervals for Regression Parameters

2/122

2/1

22

2

1

0

1

xnx

ss

xnx

x

nss

eb

eb

1

0

1

0

b

b

tsb

tsb

•b0 and b1 are estimates from a single sample size n

•b0 and b1 are “statistics”

•Values for b0 and b1 are the mean value

•Obtain standard deviations for b0 and b1

•Confidence intervals are

2;2/1 ntt



59

CI for Regression Parameters Example



60

Verifying Regression Assumptions 1/2

•Assumption: Linear relationship between x and y

If you plot your two variables, you might find many possible results:

1. No visible relationship

2. Linear trend line

3. Two or more lines (piecewise)

4. Outlier(s)

5. Non-linear trend curve

Only the second case can use a linear model

The third case could use multiple linear models



61

Verifying Regression Assumptions 2/2

•Assumption: Independent Errors

Make a scatter plot of errors versus response value

If you see a trend, you have a problem dependence of errors on predictor value

•Assumption: The predictor variable x is nonstochastic and it is measured without any error.



62

Limits to Simple Linear Regression•Only one predictor is used

•The predictor variable must be quantitative (not categorical)

•The relationship between response and predictor must be linear

•The errors must be normally distributed

•Visual tests:

Plot x vs. y: Anything odd?

Plot residuals: Anything odd?

Residual plot especially helpful if there is a meaningful order to the measurements



63

Other Regression Models•Multiple Linear Regression

More than one predictor variable is used

•Categorical Predictors

Predictor variables not quantitative but represent categories

CPU type or disk type

•Curvilinear Regression

Nonlinear relationship between response and predictors

•Transformations

Errors are not normally distributed or variance is not homogeneous



64

Multiple Linear Regression 1/2

kkknnn

k

k

n e

e

e

b

b

b

xxx

xxx

xxx

y

y

y

.

.

.

.

.

.

.1

.....

.....

.....

.1

.1

.

.

.2

1

1

0

21

22212

12111

2

1



65

Multiple Linear Regression 2/2

pTT

pe

kn

thjjjjeb

e

TTT

TT

xXXxm

s

m

t

kn

Xjccss

kn

SSEs

SST

SSESSTR

yXbyySSE

SSSSYSST

yXXXb

j

1

1;2/1

1T

2

1

)(1

sprediction ofDeviation Standard

nsobservatio future ofMean :Prediction

using computed intervals Confidence

1freedom of degrees SSE

X of termdiagonal is where, deviation Standard

1 errors ofdeviation Standard

0

EstimationParameter



66

Multiple Linear Regression Example 1/3

Observations of disk I/Os, memory sizes, and CPU time for 7 programs

CPU Time Disk I/O Memory Size

2 14 70

5 16 75

7 27 144

9 42 190

10 39 210

13 50 235

20 83 400



67


•Find a linear equation to estimate CPU time

CPU time = b0+b1(Disk I/O)+b2(Memory Size)

•b = (-0.1614,0.1182,0.0265)T

•We can compute the error terms from this (difference between the regression formula and the actual observations)

•Compute R2 = SSR/SST=0.97 (regression explains 97% of variation of y)

•Coefficient of Multiple Correlation = R = 0.970.5 = 0.99

•Standard deviation of errors = Square Root(SSE/(n-3)) = 1.2



68


•The 90% confidence intervals for the parameters are (-2.11, 1.79), (-0.29, 0.53), and (-0.06, 0.11), respectively.

What does this mean?

•What is the predicted CPU time for 100 disk I/Os and a memory size of 550?

y=-0.1614+0.1182*(100)+0.0265*(550)=26.2375

Standard deviation of predicted observation = 3.3435

90% confidence interval using a t-value of 2.132 = (19.1096,33.3363)

Standard deviation for a large number of observations = 3.1385 90% confidence interval = (19.5467,32.9292)



69

Analysis of Variance (ANOVA) 1/5



70




71


•Assuming errors are independent and normally distributed. All are identically distributed (same mean and variance)

•y’s are normally distributed since x’s are nonstochastic (measured without errors)

•Sum of squares of normal variates has a chi-square distribution

•Given two sums of squares SSi and SSj with vi and vj degrees of freedom. The ratio (SSi / vi) / (SSj / vj) has an F distribution

•The hypothesis that SSi is less than or equal to SSj is rejected at the significance level is the ratio is greater than the 1- quantile of the F-variate

•Compare ratio with F[1-,vi,vj]



72


•MSR = SSR/k = mean square of the regression

Any sum of squares divided by its degrees of freedom gives the corresponding mean square

•MSE = SSE/(n-k-1)

•MSR/MSE has an F[k,n-k-1] distribution

F distribution with k numerator degrees of freedom and n-k-1 degrees of freedom

•If MSR/MSE is greater than value read from F-table, predictor variables are assumed to explain a significant fraction of the response variation (reject null hypothesis)

•This procedure is known as the F-test



73


•In simple linear regression

•One predictor variable

•F-test reduces to testing b1=0

•If confidence interval does include zero, parameter is nonzero

•Regression explains a significant part of the response variation

•F-test is not required



74

ANOVA Example 1/2

ANOVA calculations for disk-memory-CPU example

Computations performed column by column from the left



75

ANOVA Example 2/2

ANOVA calculations for disk-memory-CPU example

•Regression passes F-test

Computed ratio > value from F-Table

None of the regression parameters are significantly different than zero?

•We have very high confidence that the regression model has predictive ability



76

Problem of Multicollinearity•Dilemma: None of our parameters are significant, yet the model is!!

•The problem is that the two predictors (memory and disk I/O) are correlated (R = .9947).

•Next we test if the two parameters each give significant regression on their own.

We already did this for the Disk I/O regression model, and found that it alone accounted for about 97% of the variance.We get the same result for memory size.

•Conclusion: Each predictor alone gives as much predictive power as the two together!

•Moral: Adding more predictors is not necessarily better in terms of predictive ability (aside from cost considerations).



77

Curvilinear Regression 1/3

•Verify the validity of the linear regression model

Do a scatter plot of response vs. predictor to see if it is linear

•Often you can convert to a linear model and use the standard linear regression

Take the log when the curve looks like y = bxa

ln y = ln b + a ln x

y = a +b/x

y= x/(a+bx)



78


•Example: Amdahl’s law says I/O rate is proportional to the processor speed

I/O rate = (CPU rate)b1

Taking logs we get

log (I/O rate) = log + b1 log (MIPS rate) b0 = log Using standard linear regression, we find that the regression explains 84% of the variation.

No MIPS I/O Rate

1 19.63 288.6

2 5.45 117.30

3 2.63 64.60

4 8.24 356.40

5 14.00 373.20

6 9.87 281.10

7 11.27 149.60

8 10.13 120.60

9 1.01 31.10

10 1.26 23.70



79


Parameter Mean Standard Deviation

Confidence Interval

b0 1.423 0.119 (1.20,1.64)

b1 0.888 0.135 (0.64, 1.14)



80

Common Mistakes with Regression

•Verify that the relationship is linear. Look at a scatter diagram

•Don't worry about absolute value of parameters. They are totally dependent on an arbitrary decision regarding what dimensions to use

CPU time (sec) = 0.01 (disk I/Os)+0.001(Memory in KB)

•Always specify confidence intervals for parameters and coefficient of determination

•Test for correlation between predictor variables, and eliminate redundancy. Test to see what subset of the possible predictors is “best“ depending on cost vs. performance

•Don't make predictions too far out of the measured range



81

Experimental Design

“The goal of a proper experimental design is to obtain the maximum information with the minimum number of experiments“

•Determine the affects of various factors on performance

•Does a factor have significant effect?

•An experimental design consists of specifying

The number of experiments

The factor/level combinations of each experiment

The number of replications of each experiment



82

Potential Pitfalls in Experimentation

•All measured values are random variables. Must account for experimental error

•Control for important parameters

Example: User experience in a user interface study

•Must be able to allocate variation to the different factors

•There is a limit to the number of experiments you can perform

Some designs give more information per experiment.

•One-factor-at-a-time studies do not capture factor interactions.



83

Types of Experimental Design 1/3

•Simple designs

One factor is varied at a time

Inefficient Does not capture interactions Not recommended

sexperiment 1

)1(1

need levels, havingfactor th factors Given

k

i in

inik



84


•Full factorial design

Perform experiment (s) for every combination of factors at every level

Many experiments required

Captures interactions

If too expensive, can reduce number of levels (ex: 2k design k factors used at 2 levels), number of factors, or take another approach

sexperiment 1

need levels, havingfactor th factors Given

k

ii

n

inik



85


•Fractional Factorial Design

Only run experiments for a carefully selected subset of the full factorial combinations



86

2k Factorials Designs

•A 2k experimental design is used to determine the effect of k factors, each of which have two alternatives or levels.

Used to prune out the less important factors for further study

Pick the highest and lowest levels for each factor

Assumes that a factor's effect is unidirectional

Performance either continuously decreases or continuously increases as the factor is increased from minimum to maximum



87

22 Factorials Designs 1/2



88

22 Factorials Designs 2/2



89

Allocation of Variation



90

General 2k Factorials Designs 1/2

nsinteractiofactor - three!3)!3(

!

nsinteractiofactor - two!2)!2(

!effectsmain

kk

kk

k



91

General 2k Factorials Designs 2/2



92

2kr Factorials Designs

•Experiments are observations of random variables

•With only one observation, can't estimate error

•If we repeat each experiment r times, we get 2kr observations.

•With a 2 factor model, we now have:

y = q0 +qAxA +qBxB +qABxAxB +e

e is the experimental error

Use sign table, and put in y column the sample means of r measurements at the given factor levels



93

2kr Factorials Designs Example

SSESSABSSBSSASST

222SST

SSE

ˆ

,

2222222

2

1 1

2

2

jiijABBA

i

r

jij

iijij

erqrqrq

e

yyeFactor A explains 79%, B explains 15.4%, AB explains 4%,and error explains about 1.5%.



94

Confidence Intervals for Effects

iq

sr

ti

q

r

es

ABq

sB

qs

Aq

sq

s

r

r

res

)]1(22;2/1[ effectsfor intervals confidence

220

zero toup add should serror term Since

)1(k2 freedom of Degrees

)1(22

SSE2

SSE from computed erros of Variance

Confidence intervals for previous example

q0 (39.08, 42.91)

qA (19.58, 23.41)

qB (7.58,11.41)

qAB (3.08,6.91)



95

2k-p Fractional Factorial Designs



96

Selecting The Experiments

•The sum of the squares of each column is 2k-p

qA= (-y1+y2-y3+y4-y5+y6-y7+y8)/8



97

2k-p Fractional Factorial Designs Example



98

Assumptions•Big assumption: These experiments only provide so much information

•The effects due to interaction are summed into the values calculated for the separate variables

• The experiments are masking these “confounding" interactions

The effects whose influence cannot be separated are said to be confounded

• If the interaction effects are small, it is OK



99

One-Factor Experiments 1/2



100

One-Factor Experiments 2/2

...

.

,..

mean Column

1mean Grand

torcolumn vec a form ealternativ tobelonging nsobservatio

matrix an in arranged nsobservatio of Total

esalternativ theofeach for nsobservatio

yy

y

yar

y

jr

arar

ar

jj

j

jiij

th



101

One-Factor Experiments ExampleEffects of 3 processors, observed 5 times each

•First alternative requires 13.3 bytes less than an average processor

•Third alternative requires 37.7 bytes more than an average processor



102

Errors and Variation

errors alexperiment todue %6.89

%4.103.357,105

10,992.13

variationof Percentage

94,365.2SSA-SSTSSE

105,357.3SS0-SSY SST

1.992,10SSA

7.281,5287.18753 SS0

633,639 .... 120144SSY

2

22

22

j

jr

ar

For each observation, the error is the difference between the observation and the sum of themean and alternative effect.

Mean is 187.7, R's effect is -13.3, so the first error term is144 - (187.7-13.3)=-30.4



103

ANOVA



104

Two-Factor Full Factorial Design 1/2

•Two parameters carefully controlled and varied to study effect on performance

Compare several processors using several workloads

0

0

levelat factor ofeffect

levelat factor ofeffect

responsemean

levelat factor and levelat factor n with observatio

j

j

j

j

ij

ijijij

iB

jA

iBjAy

ey



105

Two-Factor Full Factorial Design 2/2

...

...

,..

1mean Grand

of levels tocorrespond rows and of levels tocorrespond Columns

matrix an in arranged nsobservatio of Total

sexperiment require levels and have and Factors

yy

yy

yab

y

BA

abab

abbaBA

ii

jj

jiij



106

Two-Factor Full Factorial ExampleEffects of 3 different cache choices, 5 different workloads

For each row (or column), compute mean of observations in that row (or column)

Difference between a row (or column) mean and overall mean gives row (or column) effect



107

Estimating Experimental Errors

8.236)4.2(....2.05.3SSE

SSE errors of Variance

ˆ

ˆ response Estimated

222

1 1

2

b

i

a

jij

ijijij

ijij

e

yye

y



108

Allocation of Variance

1.8%errors) to(due variationdUnexplaine

2.3% workloads todue variationof Percentage

%9.9541.402,13

12,857,20

caches by the explained variationof Percentage

236.80SSB-SSA-SSTSSE

13,402.41SS0-SSY SST

40.308SSB

20.857,12SSA

59.192,78)2.72(53 SS0

91,595 SSY

SSESSBSSASS0SSY

2

2

22

2

ii

jj

ijij

a

b

ab

y



109

ANOVA



110

Precautions



111

Quantile-Quantile Plots

•In order to determine distribution of data

Plot observed quantiles versus theoretical quantiles

•Assume yi is the observed qith quantile

•Using theoretical distribution, the qith quantile xi is computed and

a point is plotted at (xi, yi)

•To determine xi, need to invert CDF qi=F(xi) xi = F-1(qi)

For N(0,1) xi = 4.91[qi0.14-(1-qi)0.14]

For N(,σ), scale to + σxi before plotting



112

Quantile-Quantile Plots Example•Modeling error for eight predictions of a model were found to b

{-0.04, -0.19, 0.14, -.09, -0.14, 0.19, 0.04, and 0.09}

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-1.6 -1.1 -0.6 -0.1 0.4 0.9 1.4

Normal Quantile

Resi

dual

Qua

ntile



113

Q-Q Plot For Two-Factor Example

-10

-8

-6

-4

-2

0

2

4

6

8

10

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Normal quantile

Resid

ual q

uant

ile



114

Introduction to Simulation

•If system not available, a simulation model provides an easy way to predict the performance or compare several alternatives

•If system is available, a simulation model may be preferred over measurements (Why?)

•Simulation models might fail!

Produce no useful results

Produce misleading results



115

Common Mistakes in Simulation

•Inappropriate level of detail

•Improper implementation language

•Unverified models

•Invalid models (may not represent the real system correctly)

•Improperly handled initial conditions

•Too short simulations

•Poor random-number generators

•Improper selection of seeds



116

Terminology 1/3

•State variables

variables whose values define the state of the system

e.g., length of job queue in a CPU scheduling simulation

•Event

a change in the system state

•Continuous-time and discrete-time models

Continuous system state defined at all times (e.g., CPU scheduling)

Discrete system state defined at particular instants of time



117

Terminology 2/3

•Continuous-state and discrete-state models

Nature of state variables. Time spent by students studying a certain subject (can take an infinite number of values)Queue length in a CPU scheduling simulation can only assume integer values (discrete state model)Discrete-state discrete-event model Continuity of time does not imply continuity of state

•Deterministic and probabilistic modelsOutput of a model can be predicted with certainty deterministicDifferent results on repetition for same set of input parameters probabilistic



118

Terminology 3/3

•Static and dynamic models

System state changes with time?

•Open and closed models

Input external to the model and independent of it open

No external input closed

•Stable and unstable models

Steady state reached which is independent of time stable



119

Programming Language Selection

•Simulation language

•General-purpose language

•Extension of a general-purpose language

Extensions as a collection of routines to handle tasks commonly required in simulations

•Simulation package

inflexibility



120

Types of Simulation 1/3

•Emulation

A simulation using hardware or firmware

e.g., Terminal emulator, processor emulator.

•Monte Carlo Simulation

Static simulation or one without a time axis

Model probabilistic phenomenon that do not change characteristics with time

Evaluate non-probabilistic expressions using probabilistic models



121


•Trace-driven simulation

Use a trace as input

A trace is a time-ordered record of events on a real system

e.g., A trace of page reference patterns of key programs can be input to compare different memory management schemes

Advantages: (among others) credibility, easy validation, and accurate workload

Disadvantages: (among others) complexity and being a single point of validation



122


•Discrete-event simulation

ComponentsEvent schedulersimulation clock and a time advancing mechanism

Unit time versus event-driven approachesSystem state variablesEvent routinesInput routines, report routines, initialization routines, and trace routinesDynamic memory management



123

Analysis of Simulation Results

•Model Verification Techniques

•Model Validation Techniques

•Transient Removal

•Stopping Criteria



124

Model Verification Techniques•Top-down modular design

•Anti-bugging

•Structured walk-through

•Deterministic models

•Run simplified cases

•Trace: time-ordered list of events and associated variables

•On-Line graphic displays

•Continuity test

•Degeneracy tests

•Consistency tests

•Seed independence



125

Model Validation Techniques•Validate three key aspects of the model

Assumptions

Input parameter values and distributions

Output values and conclusions

•Each of three aspects maybe subjected to a validity test by comparing with that obtained from possible sources

Expert intuition

Real system measurements

Theoretical results



126

Transient Removal 1/5

•Steady-state performance is of interest

•What constitutes the transient state?

•Methods for transient removal are heuristic

Long runs

Proper initialization

Truncation

Initial data deletion

Moving average of independent replications

Batch means



127


•Truncation

Variability during steady state less than that during transient state

Measure variability in terms of range- min and max of observations

Observations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 10, 9, 10, 11, 10, 9, 10, 11, 10, 9

Given a sample of n observations

Ignore first l observations

Calculate min and max of n-l observations

Repeat until (l+1)th observation is neither min or max of remaining observations



128


•Initial data deletion

Average does not change much as observations are deleted

Randomness in observations causes averages to change slightly even during steady state

Average across replications (complete run with no change in input parameter values, seed value is different)

Get a mean trajectory across replications

Get overall mean

Delete the first l observations (init l to 1), and get overall mean for n-l values

Compute relative change in overall mean

Determine the knee where the relative change graph stabilizes



129


•Moving average of independent replications

Similar to initial data deletion but computes mean over a moving time interval window instead of the overall mean

Get a mean trajectory across replications

Plot a trajectory of moving average of successive 2k+1 values (init k to 1)

Repeat with k= 2,3,… until plot sufficiently smooth

Find the knee of the plot



130


•Batch means

Run a very long simulation and later divide into several parts of equal duration (batch)

Study the variance of batch means as a function of batch size

A long run of N observations divided into m batches of size n each (start with n=2 for example)

For each batch, compute a batch mean

Compute overall mean

Compute variance of batch means

Increase n and repeat

Plot variance as a function of n



131

Stopping Criteria: Variance Estimation 1/3

•Simulation could be run until confidence interval for the mean response narrows to a desired width

•Can obtain variance of sample mean = variance of observations/n

Valid if observations are independent

Observations in most simulations are not independent

•To correctly compute the variance of the mean of correlated observations

Independent replications

Batch means



132


•Independent replications

Means of independent replications are independent though observations in a single replication are correlated

Conduct m replications of size n+n0 (n0 is length of transient phase)

Compute a mean for each replication

Compute an overall mean for all replications

Calculate the variance of replicate means

Calculate confidence interval for mean response

Will discard mn0 initial observations



133


•Batch means

Run a long simulation run, discard initial transient interval, and divide remaining observations into several batches

Compute means for each batch

Compute overall mean

Calculate variance of batch means

Calculate confidence interval for mean response

Will discard n0 observations (less waste)

Calculate covariance of successive batch means to find correct batch size n

Covariance should be small compared to the variance

Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 1 CS5014 Research Methods in CS Dr. Ayman Abdel-Hamid Computer Science Department Virginia.

Documents