Top Banner
Benchmarking and Performance Evaluations Todd Mytkowicz Microsoft Research
40

Benchmarking and Performance Evaluations

Feb 23, 2016

Download

Documents

tiana

Benchmarking and Performance Evaluations. Todd Mytkowicz Microsoft Research. Let’s pole for an upcoming election. I ask 3 of my co-workers who they are voting for. Let’s pole for an upcoming election. I ask 3 of my co-workers who they are voting for. My approach does not deal with - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Benchmarking and Performance Evaluations

Benchmarking and Performance Evaluations

Todd MytkowiczMicrosoft Research

Page 2: Benchmarking and Performance Evaluations

Let’s pole for an upcoming election

I ask 3 of my co-workers who they are voting for.

Page 3: Benchmarking and Performance Evaluations

Let’s pole for an upcoming election

I ask 3 of my co-workers who they are voting for.

• My approach does not deal with – Variability – Bias

Page 4: Benchmarking and Performance Evaluations

Issues with my approach

Variability source: http://www.pollster.com

My approach is not reproducible

Page 5: Benchmarking and Performance Evaluations

Issues with my approach(II)

Bias

source: http://www.pollster.com

My approach is not generalizable

Page 6: Benchmarking and Performance Evaluations

Take Home Message

• Variability and Bias are two different things– Difference between reproducible and

generalizable!

Page 7: Benchmarking and Performance Evaluations

Take Home Message

• Variability and Bias are two different things– Difference between reproducible and

generalizable!

Do we have to worry about Variability and Bias when we benchmark?

Page 8: Benchmarking and Performance Evaluations

Let’s evaluate the speedup of my whizbang idea

What do we do about Variability?

Page 9: Benchmarking and Performance Evaluations

Let’s evaluate the speedup of my whizbang idea

What do we do about Variability?

Page 10: Benchmarking and Performance Evaluations

Let’s evaluate the speedup of my whizbang idea

What do we do about Variability?

• Statistics to the rescue– mean– confidence interval

Page 11: Benchmarking and Performance Evaluations

Intuition for T-Test

• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean

Page 12: Benchmarking and Performance Evaluations

Intuition for T-Test

• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean

Trial Mean of 10 throws

1 4.0

2 4.3

3 4.9

4 3.8

5 4.3

6 2.9

… …

Page 13: Benchmarking and Performance Evaluations

Intuition for T-Test

• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean

Trial Mean of 10 throws

1 4.0

2 4.3

3 4.9

4 3.8

5 4.3

6 2.9

… …

Page 14: Benchmarking and Performance Evaluations

Back to our Benchmark: Managing Variability

Page 15: Benchmarking and Performance Evaluations

Back to our Benchmark: Managing Variability

> x=scan('file')Read 20 items> t.test(x)

One Sample t-test

data: x t = 49.277, df = 19, p-value < 2.2e-1695 percent confidence interval: 1.146525 1.248241 sample estimates:mean of x 1.197383

Page 16: Benchmarking and Performance Evaluations

So we can handle Variability. What about Bias?

Page 17: Benchmarking and Performance Evaluations

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Evaluating compiler optimizations

Page 18: Benchmarking and Performance Evaluations

Madan:speedup = 1.18 ± 0.0002

Conclusion: O3 is good

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Evaluating compiler optimizations

Page 19: Benchmarking and Performance Evaluations

Madan:speedup = 1.18 ± 0.0002

Conclusion: O3 is good

Todd:speedup = 0.84 ± 0.0002

Conclusion: O3 is bad

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Evaluating compiler optimizations

Page 20: Benchmarking and Performance Evaluations

Madan:speedup = 1.18 ± 0.0002

Conclusion: O3 is good

Todd:speedup = 0.84 ± 0.0002

Conclusion: O3 is bad

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Why does this happen?

Evaluating compiler optimizations

Page 21: Benchmarking and Performance Evaluations

Madan:HOME=/home/madan

Todd:HOME=/home/toddmytkowicz

env

stack

text text

env

stack

Differences in our experimental setup

Page 22: Benchmarking and Performance Evaluations

Runtime of SPEC CPU 2006 perlbench depends on who runs it!

Page 23: Benchmarking and Performance Evaluations

32 randomly generated linking orders

Bias from linking ordersp

eedu

p

Page 24: Benchmarking and Performance Evaluations

32 randomly generated linking orders

Order of .o files can lead to contradictory conclusions

Bias from linking ordersp

eedu

p

Page 25: Benchmarking and Performance Evaluations

Where exactly does Bias come from?

Page 26: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Page 27: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Dead Code

Page 28: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Code affected by O3

Page 29: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Hot code

Page 30: Benchmarking and Performance Evaluations

Page N Page N + 1

Interactions with hardware buffers

O2

O3

O3 better than O2

Page 31: Benchmarking and Performance Evaluations

Page N Page N + 1

Interactions with hardware buffers

O2

O3

O2

O3

O3 better than O2

O2 better than O3

Page 32: Benchmarking and Performance Evaluations

Cachline N Cacheline N + 1

Interactions with hardware buffers

O2

O3

O2

O3

O3 better than O2

O2 better than O3

Page 33: Benchmarking and Performance Evaluations

Other Sources of Bias

• JIT • Garbage Collection• CPU Affinity• Domain specific (e.g. size of input data)

• How do we manage these?

Page 34: Benchmarking and Performance Evaluations

Other Sources of Bias

How do we manage these?– JIT:

• ngen to remove impact of JIT• “warmup” phase to JIT code before measurement

– Garbage Collection• Try different heap sizes (JVM)• “warmup” phase to build data structures• Ensure program is not “leaking” memory

– CPU Affinity• Try to bind threads to CPUs (SetProcessAffinityMask)

– Domain Specific:• Up to you!

Page 35: Benchmarking and Performance Evaluations

R for the T-Test

• Where to download– http://cran.r-project.org

• Simple intro to get data into R• Simple intro to do t.test

Page 36: Benchmarking and Performance Evaluations
Page 37: Benchmarking and Performance Evaluations
Page 38: Benchmarking and Performance Evaluations
Page 39: Benchmarking and Performance Evaluations
Page 40: Benchmarking and Performance Evaluations

Some Conclusions

• Performance Evaluations are hard!– Variability and Bias are not easy to deal with

• Other experimental sciences go to great effort to work around variability and bias– We should too!