How to do Experiments: Empirical Methods for AI & CS Paul Cohen Ian P. Gent Toby Walsh [email protected] [email protected] [email protected].

How to do Experiments:Empirical Methods for AI & CS

Paul Cohen Ian P. Gent Toby [email protected] [email protected] [email protected]

Empirical Methods for CS

Can you do Empirical AI?

3

Can you do empirical AI?

See if you can spot a pattern in the following real empirical data (326 dp (T 1 0)) (327 dp (T 1 0)) (328 dp (T 1 0)) (329 dp (T 1 0)) (330 dp (T 1 0)) (331 dp (T 2 1)) (332 dp (T 1 0)) (333 dp (T 1 0)) (334 dp (T 3 2)) (335 dp (T 350163776 62))

This led to an Artificial Intelligence journal paper Gent & Walsh, “Easy Problems are Sometimes Hard”, 1994

4

Experiments are Harder than you think!

That pattern was pretty easy to spot but… To see the pattern you have to not

kill the experiment in the middle of its runassuming that the pipe to the output had got

lost! That’s what I did, but fortunately the effect occurred

again. that instance took a week to run

5

Overview of Tutorial

Can you do Empirical AI? yes!

Experiments are Harder than you think! What are empirical methods? Experiment design Some Problem Issues Data analysis & Hypothesis Testing Summary

6

Supplementary Material

How not to do it Case Study

Gregory, Gao, Rosenberg & Cohen Eight Basic Lessons

The t-test Randomization

7

Our objectives

Outline some of the basic issues exploration, experimental design, data analysis, ...

Encourage you to consider some of the pitfalls we have fallen into all of them!

Raise standards encouraging debate identifying “best practice”

Learn from your & our experiences experimenters get better as they get older!



9


Flawed problems: A case study from Constraint Satisfaction

40+ experimental papers over 5 years papers on the nature of hard random CSPs Authors include … (in alphabetical order!)

Fahiem Bacchus, Christian Bessiere, Rina Dechter, Gene Freuder, Ian Gent, Pedro Meseguer, Patrick Prosser, Barbara Smith, Edward Tsang, Toby Walsh, and many more

Achlioptas et al. spotted a flawasymptotically almost all problems are trivialbrings into doubt many experimental results

• some experiments at typical sizes affected• fortunately not many

10

Flawed random constraints?

e.g. “Model B”, domain size d. Parameters p1 and p2

Pick exactly p1C constraints (if there are C possible)

For each one• pick exactly p2d2 pairs of

values as disallowed e.g. d=3, p2=4/9

Constraints C1 & C2 C2 is flawed

• it makes X=2 impossible For any p2 ≥ 1/d, p1 > 0

as n ∞, there will always be one variable with all its values removed

asymptotically, all problems are trivial!

C1 X=1 X=2 X=3

Y=1 X X

Y=2 X

Y=3 X

C2 X=1 X=2 X=3

Y=1 X X

Y=2 X

Y=3 X

11

Flawless random problems

[Gent et al.] fix flaw …. introduce “flawless” model B choose d squares which must

always be allowedall in different rows &

columns choose p2d 2 X’s to disallow in

other squares For model B, I proved that these

problems are not flawed asymptotically any p2 < ½ so we think that we

understand how to generate random problems

C1 X=1 X=2 X=3

Y=1 X X O

Y=2 O X

Y=3 O X

12

But it wasn’t that simple…

Originally we had two different definitions of “flawless” problems

An undergraduate student showed they were inequivalent! after paper about it on the

web (journal paper reference

follows is correct )

C1 X=1 X=2 X=3

Y=1 X X O

Y=2 O X

Y=3 O X

13

Experiments are harder than you think!

This tripped up all constraints researchers who thought about it It concerned the most fundamental part of the experiments

i.e. generating the input data closely analogous flaw has turned up in SAT and in QBF

The flaw was not found by constraints researchers fruitful (in the end!) interaction between theory and

experiment experimental method justified theoretically

Even the fix was wrong at first Most experiments still use “flawed” models

(which is ok if you know what you’re doing:if you make a positive decision with a good reason )

14

Further reading

D. Achlioptas, L.M. Kirousis, E. Kranakis, D. Krizanc, M. Molloy, and Y. StamatiouRandom Constraint Satisfaction: A More Accurate Picture,Constraints, 6 (4), (2001), pp. 329-344.

I.P. Gent, E. MacIntyre, P. Prosser, B.M. Smith and T. Walsh Random Constraint Satisfaction: flaws and structures, Constraints, 6 (4), (2001), pp. 345-372.

Coincidence of title and publication details not at all coincidental

http://www.research.microsoft.com/~optas/papers/Constraints.ps

http://www.research.microsoft.com/~optas/papers/Constraints.ps

http://www.dcs.st-and.ac.uk/~apes/papers/RandomConstraints.ps.gz


What are Empirical Methods?

16

What does “empirical” mean?

Relying on observations, data, experiments Empirical work should complement theoretical work

Theories often have holes (e.g., How big is the constant term? Is the current problem a “bad” one?)

Theories are suggested by observations Theories are tested by observations Conversely, theories direct our empirical attention

In addition (in this tutorial at least) empirical means “wanting to understand behavior of complex systems”

17

Why We Need Empirical Methods Cohen, 1990 Survey of 150 AAAI Papers

Roughly 60% of the papers gave no evidence that the work they described had been tried on more than a single example problem.

Roughly 80% of the papers made no attempt to explain performance, to tell us why it was good or bad and under which conditions it might be better or worse.

Only 16% of the papers offered anything that might be interpreted as a question or a hypothesis.

Theory papers generally had no applications or empirical work to support them, empirical papers were demonstrations, not experiments, and had no underlying theoretical support.

The essential synergy between theory and empirical work was missing

18

Theory, not Theorems

Theory based science need not be all theorems otherwise science would be mathematics

Consider theory of QED based on a model of behaviour of particles predictions accurate to 10 decimal places

(distance from LA to NY to within 1 human hair)most accurate theory in the whole of science?

success derived from accuracy of predictionsnot the depth or difficulty or beauty of theorems

QED is an empirical theory!

19

Empirical CS/AI

Computer programs are formal objects so let’s reason about them entirely formally?

Two reasons why we can’t or won’t: theorems are hard some questions are empirical in naturee.g. are Horn clauses adequate to represent the sort

of knowledge met in practice?e.g. even though our problem is intractable in

general, are the instances met in practice easy to solve?

20

Empirical CS/AI

Treat computer programs as natural objects like fundamental particles, chemicals, living

organisms Build (approximate) theories about them

construct hypothesese.g. greedy hill-climbing is important to GSAT

test with empirical experimentse.g. compare GSAT with other types of hill-climbing

refine hypotheses and modelling assumptionse.g. greediness not important, but hill-climbing is!

21

Empirical CS/AI

Many advantage over other sciences Cost

no need for expensive super-colliders Control

unlike the real world, we often have complete command of the experiment

Reproducibility in theory, computers are entirely deterministic

Ethics no ethics panels needed before you run experiments

22

Types of hypothesis

My search program is better than yoursnot very helpful beauty competition?

Search cost grows exponentially with number of variables for this kind of problembetter as we can extrapolate to data not yet seen?

Constraint systems are better at handling over-constrained systems, but OR systems are better at handling under-constrained systemseven better as we can extrapolate to new situations?

23

A typical conference conversation

What are you up to these days?I’m running an experiment to compare the MAC-CBJ

algorithm with Forward Checking?Why?

I want to know which is fasterWhy?

Lots of people use each of these algorithmsHow will these people use your result?

...

24

Keep in mind the BIG picture

What are you up to these days?I’m running an experiment to compare the MAC-CBJ

algorithm with Forward Checking?Why?

I have this hypothesis that neither will dominateWhat use is this?

A portfolio containing both algorithms will be more robust than either algorithm on its own

25

Keep in mind the BIG picture

...Why are you doing this?

Because many real problems are intractable in theory but need to be solved in practice.

How does your experiment help? It helps us understand the difference between

average and worst case resultsSo why is this interesting?

Intractability is one of the BIG open questions in CS!

26

Why is empirical CS/AI in vogue?

Inadequacies of theoretical analysis problems often aren’t as hard in practice as theory

predicts in the worst-case average-case analysis is very hard (and often

based on questionable assumptions) Some “spectacular” successes

phase transition behaviour local search methods theory lagging behind algorithm design

27

Why is empirical CS/AI in vogue?

Compute power ever increasing even “intractable” problems coming into range easy to perform large (and sometimes meaningful)

experiments Empirical CS/AI perceived to be “easier” than

theoretical CS/AI often a false perception as experiments easier to

mess up than proofs experiments are harder than you think!


Experiment design

29

Experimental Life Cycle

Exploration Hypothesis construction Experiment Data analysis Drawing of conclusions

30

Checklist for experiment design*

Consider the experimental procedure making it explicit helps to identify spurious effects and sampling biases

Consider a sample data table identifies what results need to be collected clarifies dependent and independent variables shows whether data pertain to hypothesis

Consider an example of the data analysis helps you to avoid collecting too little or too much data especially important when looking for interactions

*From Chapter 3, “Empirical Methods for Artificial Intelligence”, Paul Cohen, MIT Press

31

Guidelines for experiment design

Consider possible results and their interpretation may show that experiment cannot support/refute

hypotheses under test unforeseen outcomes may suggest new

hypotheses What was the question again?

easy to get carried away designing an experiment and lose the BIG picture

Run a pilot experiment to calibrate parameters (e.g., number of processors in Rosenberg experiment)

32

Types of experiment

Manipulation experiment Observation experiment Factorial experiment

33

Manipulation experiment

Independent variable, x x=identity of parser, size of dictionary, …

Dependent variable, y y=accuracy, speed, …

Hypothesis x influences y

Manipulation experiment change x, record y

34

Observation experiment

Predictor, x x=volatility of stock prices, …

Response variable, y y=fund performance, …

Hypothesis x influences y

Observation experiment classify according to x, compute y

35

Factorial experiment

Several independent variables, xi

there may be no simple causal links data may come that way

e.g. individuals will have different sexes, ages, ... Factorial experiment

every possible combination of xi considered expensive as its name suggests!

36

Designing factorial experiments

In general, stick to 2 to 3 independent variables Solve same set of problems in each case

reduces variance due to differences between problem sets

If this not possible, use same sample sizes simplifies statistical analysis

As usual, default hypothesis is that no influence exists much easier to fail to demonstrate influence than to

demonstrate an influence


Some Problem Issues

38

Some problem issues

Control Ceiling and Floor effects Sampling Biases

39

Control

A control is an experiment in which the hypothesised variation does not occur so the hypothesized effect should not occur either

BUT remember placebos cure a large percentage of patients!

40

Control: a cautionary tale

Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) macaques gained immunity from SIV

Later, macaques given uninfected human T-cells and macaques still gained immunity!

Control experiment not originally done and not always obvious (you can’t control for all

variables)

41

Ceiling and Floor Effects

Well designed experiments (with good controls) can still go wrong

What if all our algorithms do particularly well Or they all do badly?

We’ve got little evidence to choose between them

42

Ceiling and Floor Effects

Ceiling effects arise when test problems are insufficiently challenging floor effects the opposite, when problems too

challenging A problem in AI because we often repeatedly use the

same benchmark sets most benchmarks will lose their challenge

eventually? but how do we detect this effect?

43

Machine learning example

14 datasets from UCI corpus of benchmarks used as mainstay of ML community

Problem is learning classification rules each item is vector of features and a classification measure classification accuracy of method (max

100%) Compare C4 with 1R*, two competing algorithms

Rob Holte, Machine Learning, vol. 3, pp. 63-91, 1993www.site.uottawa.edu/~holte/Publications/simple_rules.ps

44

Floor effects: machine learning example

DataSet: BC CH GL G2 HD HE …Mean

C4 72 99.2 63.2 74.3 73.6 81.2 ...85.9

1R* 72.5 69.2 56.4 77 78 85.1 ...83.8

Is 1R* above the floor of performance?How would we tell?

45

Floor effects: machine learning example

DataSet: BC CH GL G2 HD HE …Mean

C4 72 99.2 63.2 74.3 73.6 81.2 ...85.9

1R* 72.5 69.2 56.4 77 78 85.1 ...83.8

Baseline 70.3 52.2 35.5 53.4 54.5 79.4 … 59.9

“Baseline rule” puts all items in more popular category. 1R* is above baseline on most datasets

A bit like the prime number joke? 1 is prime. 3 is prime. 5 is prime. So, baseline rule isthat all odd numbers are prime.

46

Ceiling Effects: machine learning

DataSet: BC GL HY LY MU … MeanC4 72 63.2 99.1 77.5 100.0 ... 85.91R* 72.5 56.4 97.2 70.7 98.4 ... 83.8

How do we know that C4 and 1R* are not near the ceiling of performance?

Do the datasets have enough attributes to make perfect classification? Obviously for MU, but what about the rest?

47


DataSet: BC GL HY LY MU … Mean

C4 72 63.2 99.1 77.5 100.0 ... 85.9

1R* 72.5 56.4 97.2 70.7 98.4 ... 83.8max(C4,1R*) 72.5 63.2 99.1 77.5 100.0… 87.4max([Buntine]) 72.8 60.4 99.1 66.0 98.6 … 82.0

C4 achieves only about 2% better than 1R*Best of the C4/1R* achieves 87.4% accuracy

We have only weak evidence that C4 better Both methods performing appear to be near ceiling of

possible so comparison hard!

48


In fact 1R* only uses one feature (the best one) C4 uses on average 6.6 features 5.6 features buy only about 2% improvement Conclusion?

Either real world learning problems are easy (use 1R*)

Or we need more challenging datasets We need to be aware of ceiling effects in results

49

Sampling bias

Data collection is biased against certain data e.g. teacher who says “Girls

don’t answer maths question” observation might suggest:

girls don’t answer many questions

but that the teacher doesn’t ask them many questions

Experienced AI researchers don’t do that, right?

50

Sampling bias: Phoenix case study

AI system to fight (simulated) forest fires

Experiments suggest that wind speed uncorrelated with time to put out fire obviously incorrect as high

winds spread forest fires

51


Wind Speed vs containment time (max 150 hours):3: 120 55 79 10 140 26 15 110

12 54 10 103 6: 78 61 58 81 71 57 21 32

709: 62 48 21 55 101

What’s the problem?

52


The cut-off of 150 hours introduces sampling bias many high-wind fires get cut off, not many low wind

On remaining data, there is no correlation between wind speed and time (r = -0.53)

In fact, data shows that: a lot of high wind fires take > 150 hours to contain those that don’t are similar to low wind fires

You wouldn’t do this, right? you might if you had automated data analysis.


Data analysis & Hypothesis Testing

54

Kinds of data analysis

Exploratory (EDA) – looking for patterns in data Statistical inferences from sample data

Testing hypotheses Estimating parameters

Building mathematical models of datasets Machine learning, data mining…

We will introduce hypothesis testing and computer-intensive methods

55

The logic of hypothesis testing

Example: toss a coin ten times, observe eight heads. Is the coin fair (i.e., what is it’s long run behavior?) and what is your residual uncertainty?

You say, “If the coin were fair, then eight or more heads is pretty unlikely, so I think the coin isn’t fair.”

Like proof by contradiction: Assert the opposite (the coin is fair) show that the sample result (≥ 8 heads) has low probability p, reject the assertion, with residual uncertainty related to p.

Estimate p with a sampling distribution.

56

Common tests

Tests that means are equal Tests that samples are uncorrelated or independent Tests that slopes of lines are equal Tests that predictors in rules have predictive power Tests that frequency distributions (how often events

happen) are equal Tests that classification variables such as smoking history

and heart disease history are unrelated...

All follow the same basic logic

57

Probability of a sample result under a null hypothesis

If the coin were fair (the null hypothesis) what is the probability distribution of r, the number of heads, obtained in N tosses of a fair coin? Get it analytically or estimate it by simulation (on a computer): Loop K times

r := 0 ;; r is num.heads in N tosses

Loop N times ;; simulate the tosses• Generate a random 0 ≤ x ≤ 1.0• If x < p increment r ;; p is the probability of a head

Push r onto sampling_distribution Print sampling_distribution

58

The logic of hypothesis testing

Establish a null hypothesis: H0: the coin is fair Establish a statistic: r, the number of heads in N tosses Figure out the sampling distribution of r given H0

The sampling distribution will tell you the probability p of a result at least as extreme as your sample result, r = 8

If this probability is very low, reject H0 the null hypothesis Residual uncertainty is p

0 1 2 3 4 5 6 7 8 9 10

59

The only tricky part is getting the sampling distribution

Sampling distributions can be derived... Exactly, e.g., binomial probabilities for coins are

given by the formula

Analytically, e.g., the central limit theorem tells us that the sampling distribution of the mean approaches a Normal distribution as samples grow to infinity

Estimated by Monte Carlo simulation of the null hypothesis process

N!

r!(N r)!pN

60

A common statistical test: The Z test for different means

A sample N = 25 computer science students has mean IQ m=135. Are they “smarter than average”?

Population mean is 100 with standard deviation 15 The null hypothesis, H0, is that the CS students are

“average”, i.e., the mean IQ of the population of CS students is 100.

What is the probability p of drawing the sample if H0 were true? If p small, then H0 probably false.

Find the sampling distribution of the mean of a sample of size 25, from population with mean 100

61

The sampling distribution for the CS student example

If sample of N = 25 students were drawn from a population with mean 100 and standard deviation 15 (the null hypothesis) then the sampling distribution of the mean would asymptotically be normal with mean 100 and standard deviation 15 25 3

100 135

The mean of the CS students falls nearly 12 standard deviations away from the mean of the sampling distribution

Only ~1% of a normal distribution falls more than two standard deviations away from the mean

If the students were average, this would have a roughly zero chance of happening.

62

The Z test

100 135

Mean of sampling distribution

Samplestatistic

std=3

0 11.67


Teststatistic

std=1.0

Z x N

135 100

1525

353

11.67

63

Reject the null hypothesis?

Commonly we reject the H0 when the probability of obtaining a sample statistic (e.g., mean = 135) given the null hypothesis is low, say < .05.

A test statistic value, e.g. Z = 11.67, recodes the sample statistic (mean = 135) to make it easy to find the probability of sample statistic given H0.

We find the probabilities by looking them up in tables, or statistics packages provide them.

For example, Pr(Z ≥ 1.67) = .05; Pr(Z ≥ 1.96) = .01.

Pr(Z ≥ 11) is approximately zero, reject H0.

64

Summary of hypothesis testing

H0 negates what you want to demonstrate; find probability p of sample statistic under H0 by comparing test statistic to sampling distribution; if probability is low, reject H0 with residual uncertainty proportional to p.

Example: Want to demonstrate that CS graduate students are smarter than average. H0 is that they are average. t = 2.89, p ≤ .022

Have we proved CS students are smarter? NO! We have only shown that mean = 135 is unlikely if they

aren’t. We never prove what we want to demonstrate, we only reject H0, with residual uncertainty.

And failing to reject H0 does not prove H0, either!

65

Computer-intensive Methods

Basic idea: Construct sampling distributions by simulating on a computer the process of drawing samples.

Three main methods: Monte carlo simulation when one knows population

parameters; Bootstrap when one doesn’t; Randomization, also assumes nothing about the population.

Enormous advantage: Works for any statistic and makes no strong parametric assumptions (e.g., normality)

66

The Bootstrap

Monte Carlo estimation of sampling distributions assume you know the parameters of the population from which samples are drawn.

What if you don’t? Use the sample as an estimate of the population. Draw samples from the sample! With or without replacement? Example: Sampling distribution of the mean; check

the results against the central limit theorem.

67

Bootstrapping the sampling distribution of the mean

S is a sample of size N:Loop K = 1000 times

Draw a pseudosample S* of size N from S by sampling with replacement

Calculate the mean of S* and push it on a list L L is the bootstrapped sampling distribution of the

mean** This procedure works for any statistic, not just the

mean.

68

Randomization

Used to test hypotheses that involve association between elements of two or more groups; very general.

Not going to explain here, but it’s very nice A practical example with some explanation in

Singer, J., Gent, I.P. and Smaill, A. (2000) "Backbone Fragility and the Local Search Cost Peak", JAIR, Volume 12, pages 235-270.

http://www.jair.org/abstracts/singer00a.html

69

This is what I’d like more time for…

Bootstrapping and Randomisation are very neat well worth learning about e.g. see Cohen’s book, easy in modern statistical packages, e.g. “R” sorry I didn’t have time to go into them


Summary

71

Summary

Empirical CS and AI are exacting sciences There are many ways to do experiments wrong

We are experts in doing experiments badly As you perform experiments, you’ll make many

mistakes Learn from those mistakes, and ours!

And remember Experiments are Harder than you think!

72

Resources

Webwww.cs.york.ac.uk/~tw/empirical.html (this link is correct!)

Books“Empirical Methods for AI”, Paul Cohen, MIT Press, 1995

JournalsJournal of Experimental Algorithmics, www.jea.acm.org

ConferencesWorkshop on Empirical Methods in AI (IJCAI 01, future?)Workshop on Algorithm Engineering and Experiments, ALENEX

03

http://www.cs.york.ac.uk/~tw/empirical.html


Appendix: Supplementary Material

If I get to this bit the rest of the talk went faster than expected!

74

Supplementary Material

How not to do it Case Study

Gregory, Gao, Rosenberg & Cohen Eight Basic Lessons

The t-test Randomization


How Not To Do It

76

Tales from the coal face

Those ignorant of history are doomed to repeat it we have committed many howlers

We hope to help others avoid similar ones …… and illustrate how easy it is to screw up!

“How Not to Do It” I Gent, S A Grant, E. MacIntyre, P Prosser, P Shaw, B M Smith, and T WalshUniversity of Leeds Research Report, May 1997

Every howler we report committed by at least one of the above authors!

77

How Not to Do It

Do measure with many instruments in exploring hard problems, we used our best

algorithms missed very poor performance of less good algorithms

better algorithms will be bitten by same effect on larger instances than we considered

Do measure CPU time in exploratory code, CPU time often misleading but can also be very informative

e.g. heuristic needed more search but was faster

78

How Not to Do It

Do vary all relevant factors Don’t change two things at once

ascribed effects of heuristic to the algorithmchanged heuristic and algorithm at the same

timedidn’t perform factorial experiment

but it’s not always easy/possible to do the “right” experiments if there are many factors

79

How Not to Do It

Do Collect All Data Possible …. (within reason) one year Santa Claus had to repeat all our

experimentsECAI/AAAI/IJCAI deadlines just after new year!

we had collected number of branches in search treeperformance scaled with backtracks, not branchesall experiments had to be rerun

Don’t Kill Your Machines we have got into trouble with sysadmins

… over experimental data we never used often the vital experiment is small and quick

80

How Not to Do It

Do It All Again … (or at least be able to) e.g. storing random seeds used in experiments we didn’t do that and might have lost important

result Do Be Paranoid

“identical” implementations in C, Scheme gave different results

Do Use The Same Problems reproducibility is a key to science (c.f. cold fusion) can reduce variance

81

Choosing your test data

We’ve seen the possible problem of over-fitting remember machine learning benchmarks?

Two common approaches benchmark libraries random problems

Both have potential pitfalls

82

Benchmark libraries

+ve can be based on real problems lots of structure

-ve library of fixed size

possible to over-fit algorithms to library problems have fixed size

so can’t measure scaling

83

Random problems

+ve problems can have any size

so can measure scaling can generate any number of problems

hard to over-fit? -ve

may not be representative of real problemslack structure

easy to generate “flawed” problemsCSP, QSAT, …

84

Prototyping your algorithm

Often need to implement an algorithm usually novel algorithm, or variant of existing one

e.g. new heuristic in existing search algorithm novelty of algorithm should imply extra care more often, encourages lax implementation

it’s only a preliminary version

85

How Not to Do It

Don’t Trust Yourself bug in innermost loop found by chance all experiments re-run with urgent deadline curiously, sometimes bugged version was better!

Do Preserve Your Code Or end up fixing the same error twice Do use version control!

86

How Not to Do It

Do Make it Fast Enough emphasis on enough

it’s often not necessary to have optimal codein lifecycle of experiment, extra coding time not won

back e.g. we have published many papers with inefficient code

compared to state of the art• first GSAT version O(N2), but this really was too

slow!• Do Report Important Implementation Details

Intermediate versions produced good results

87

How Not to Do It

Do Look at the Raw Data Summaries obscure important aspects of behaviour Many statistical measures explicitly designed to

minimise effect of outliers Sometimes outliers are vital

“exceptionally hard problems” dominate meanwe missed them until they hit us on the head

when experiments “crashed” overnightold data on smaller problems showed clear

behaviour

88

How Not to Do It

Do face up to the consequences of your results e.g. preprocessing on 450 problems

should “obviously” reduce searchreduced search 448 timesincreased search 2 times

Forget algorithm, it’s useless? Or study in detail the two exceptional cases

and achieve new understanding of an important algorithm


A Case Study:Eight Basic Lessons

90

Rosenberg study

“An Empirical Study of Dynamic Scheduling on Rings of Processors”Gregory, Gao, Rosenberg &

CohenProc. of 8th IEEE Symp. on

Parallel & Distributed Processing, 1996

Linked to from

www.cs.york.ac.uk/~tw/empirical.html

91

Problem domain

Scheduling processors on ring network jobs spawned as binary

trees

KOSO keep one, send one to my

left or right arbitrarily KOSO*

keep one, send one to my least heavily loaded neighbour

92

Theory

On complete binary trees, KOSO is asymptotically optimal

So KOSO* can’t be any better?

But assumptions unrealistic tree not complete asymptotically not

necessarily the same as in practice!

Thm: Using KOSO on a ring of p processors, a binary tree of height n is executed within (2^n-1)/p + low order terms

93

Benefits of an empirical study

More realistic trees probabilistic generator that makes shallow trees,

which are “bushy” near root but quickly get “scrawny”

similar to trees generated when performing Trapezoid or Simpson’s Rule calculationsbinary trees correspond to interval bisection

Startup costs network must be loaded

94

Lesson 1: Evaluation begins with claimsLesson 2: Demonstration is good, understanding better

Hypothesis (or claim): KOSO takes longer than KOSO* because KOSO* balances loads better The “because phrase” indicates a hypothesis about

why it works. This is a better hypothesis than the beauty contest demonstration that KOSO* beats KOSO

Experiment design Independent variables: KOSO v KOSO*, no. of

processors, no. of jobs, probability(job will spawn), Dependent variable: time to complete jobs

95

Criticism 1: This experiment design includes no direct measure of the hypothesized effect

Hypothesis: KOSO takes longer than KOSO* because KOSO* balances loads better

But experiment design includes no direct measure of load balancing: Independent variables: KOSO v KOSO*, no. of

processors, no. of jobs, probability(job will spawn), Dependent variable: time to complete jobs

96

Lesson 3: Exploratory data analysis means looking beneath immediate results for explanations

T-test on time to complete jobs: t = (2825-2935)/587 = -.19

KOSO* apparently no faster than KOSO (as theory predicted)

Why? Look more closely at the data:

Outliers create excessive variance, so test isn’t significant

10

20

30

40

50

60

70

80

10000 20000

10

20

30

40

50

60

70

10000 20000

KOSO KOSO*

97

Lesson 4: The task of empirical work is to explain variability

run-time

Algorithm (KOSO/KOSO*)

Number of processors

Number of jobs

“random noise” (e.g., outliers)

Number of processors and number of jobs explain 74% of the variance in run time. Algorithm explains almost none.

Empirical work assumes the variability in a dependent variable (e.g., run time) is the sum of causal factors and random noise. Statistical methods assign parts of this variability to the factors and the noise.

98

Lesson 3 (again): Exploratory data analysis means looking beneath immediate results for explanations

Why does the KOSO/KOSO* choice account for so little of the variance in run time?

Unless processors starve, there will be no effect of load balancing. In most conditions in this experiment, processors never starved. (This is why we run pilot experiments!)

100 200 300

10

20

30

100 200 300

10

20

30

40

50Queue length at processor i Queue length at processor i

KOSO KOSO*

99

Lesson 5: Of sample variance, effect size, and sample size – control the first before touching the last

t x sN

magnitude of effect

backgroundvariance

sample size

This intimate relationship holds for all statistics

100

Lesson 5 illustrated: A variance reduction method

Let N = num-jobs, P = num-processors, T = run timeThen T = k (N / P), or k multiples of the theoretical best timeAnd k = 1 / (N / P T)

k(KOSO) k(KOSO*)

102030405060708090

2 3 4 5

10

20

30

40

50

60

70

2 3 4 5

t 1.61 1.4

.082.42, p .02

101

Where are we?

KOSO* is significantly better than KOSO when the dependent variable is recoded as percentage of optimal run time

The difference between KOSO* and KOSO explains very little of the variance in either dependent variable

Exploratory data analysis tells us that processors aren’t starving so we shouldn’t be surprised

Prediction: The effect of algorithm on run time (or k) increases as the number of jobs decreases or the number of processors increases

This prediction is about interactions between factors

102

Lesson 6: Most interesting science is about interaction effects, not simple main effects

Data confirm prediction KOSO* is superior on

larger rings where starvation is an issue

Interaction of independent variables choice of algorithm number of processors

Interaction effects are essential to explaining how things work

1

2

3

3 6 10 20

number of processors

multiples of optimal run-time KOSO

KOSO*

103

Lesson 7: Significant and meaningful are not synonymous. Is a result meaningful?

KOSO* is significantly better than KOSO, but can you use the result? Suppose you wanted to use the knowledge that the ring is controlled by

KOSO or KOSO* for some prediction. Grand median k = 1.11; Pr(trial i has k > 1.11) = .5 Pr(trial i under KOSO has k > 1.11) = 0.57 Pr(trial i under KOSO* has k > 1.11) = 0.43

Predict for trial i whether it’s k is above or below the median: If it’s a KOSO* trial you’ll say no with (.43 * 150) = 64.5 errors If it’s a KOSO trial you’ll say yes with ((1 - .57) * 160) = 68.8 errors If you don’t know you’ll make (.5 * 310) = 155 errors

155 - (64.5 + 68.8) = 22 Knowing the algorithm reduces error rate from .5 to .43. Is this

enough???

104

Lesson 8: Keep the big picture in mind

Why are you studying this?Load balancing is important to get good

performance out of parallel computersWhy is this important?Parallel computing promises to tackle many of our

computational bottlenecks

How do we know this? It’s in the first paragraph of the paper!

105

Case study: conclusions

Evaluation begins with claims Demonstrations of simple main effects are

good, understanding the effects is better Exploratory data analysis means using your

eyes to find explanatory patterns in data The task of empirical work is to explain

variability Control variability before increasing sample

size Interaction effects are essential to

explanations Significant ≠ meaningful Keep the big picture in mind

106

The t test

Same logic as the Z test, but appropriate when population standard deviation is unknown, samples are small, etc.

Sampling distribution is t, not normal, but approaches normal as samples size increases

Test statistic has very similar form but probabilities of the test statistic are obtained by consulting tables of the t distribution, not the normal

107

The t test

100 135


Samplestatistic

std=12.1

0 2.89


Teststatistic

std=1.0

t x

sN

135 100

275

35

12.12.89

Suppose N = 5 students have mean IQ = 135, std = 27

Estimate the standard deviation of sampling distribution using the sample standard deviation

108

Randomization

Used to test hypotheses that involve association between elements of two or more groups; very general.

Example: Paul tosses H H H H, Carole tosses T T T T is outcome independent of tosser?

Example: 4 women score 54 66 64 61, six men score 23 28 27 31 51 32. Is score independent of gender?

Basic procedure: Calculate a statistic f for your sample; randomize one factor relative to the other and calculate your pseudostatistic f*. Compare f to the sampling distribution for f*.

109

Example of randomization

Four women score 54 66 64 61, six men score 23 28 27 31 51 32. Is score independent of gender?

f = difference of means of men’s and women’s scores: 29.25 Under the null hypothesis of no association between gender and

score, the score 54 might equally well have been achieved by a male or a female.

Toss all scores in a hopper, draw out four at random and without replacement, call them female*, call the rest male*, and calculate f*, the difference of means of female* and male*. Repeat to get a distribution of f*. This is an estimate of the sampling distribution of f under H0: no difference between male and female scores.

How to do Experiments: Empirical Methods for AI & CS Paul Cohen Ian P. Gent Toby Walsh [email protected] [email protected] [email protected].

Documents