Top Banner
Performance CS510 Computer Architectures Lecture 3 - 1 Lecture 3 Lecture 3 Benchmarks and Benchmarks and Performance Metrics Performance Metrics
38

PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Jan 01, 2016

Download

Documents

Leonard Barber
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 1

Lecture 3Lecture 3

Benchmarks and Benchmarks and Performance MetricsPerformance Metrics

Lecture 3Lecture 3

Benchmarks and Benchmarks and Performance MetricsPerformance Metrics

Page 2: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 2

Measurement ToolsMeasurement Tools

• Benchmarks, Traces, Mixes• Cost, Delay, Area, Power Estimation• Simulation (many levels)

– ISA, RT, Gate, Circuit• Queuing Theory• Rules of Thumb• Fundamental Laws

Page 3: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 3

The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns ....(Performance)– Throughput, bandwidth

610 mph

1350 mph

470

132

286,700

178,200

6.5 hours

3.0 hours

Plane

Boeing 747

BAD/Sud Concorde

SpeedTime (DC-Paris) Passengers Throughput

(pmph)

Page 4: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 4

The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)

ExTime(Y) Performance(X)

n = =

ExTime(X) Performance(Y)

“X is n times faster than Y” means:

Page 5: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 5

Performance TerminologyPerformance Terminology

“X is n% faster than Y” means:

100 100 xx (Performance(X) - Performance(Y)) (Performance(X) - Performance(Y)) n =n = Performance(Y)Performance(Y)

ExTime(Y) Performance(X) n = = 1 +

ExTime(X) Performance(Y) 100

Page 6: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 6

ExampleExample

1510

= 1.51.0

= Performance (X)Performance (Y)

ExTime(Y)ExTime(X)

=

n = 100 (1.5 - 1.0) 1.0

n = 50%

Example: Y takes 15 seconds to complete a task, Example: Y takes 15 seconds to complete a task, X takes 10 seconds. X takes 10 seconds. What % faster is X?What % faster is X?

Page 7: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 7

Programs to Evaluate Programs to Evaluate Processor PerformanceProcessor Performance

• (Toy) Benchmarks– 10~100-line program– e.g.: sieve, puzzle, quicksort

• Synthetic Benchmarks– Attempt to match average frequencies of real workl

oads– e.g., Whetstone, dhrystone

• Kernels– Time critical excerpts of real programs– e.g., Livermore loops

• Real programs– e.g., gcc, spice

Page 8: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 8

Benchmarking GamesBenchmarking Games

• Differing configurations used to run the same workload on two systems

• Compiler wired to optimize the workload• Workload arbitrarily picked• Very small benchmarks used• Benchmarks manually translated to optimize

performance

Page 9: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 9

Common Benchmarking Common Benchmarking MistakesMistakes

• Only average behavior represented in test workload• Ignoring monitoring overhead• Not ensuring same initial conditions• “Benchmark Engineering”

– particular optimization– different compilers or preprocessors– runtime libraries

Page 10: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 10

SPEC: System Performance SPEC: System Performance Evaluation CooperativeEvaluation Cooperative

• First Round 1989– 10 programs yielding a single number

• Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating

point programs)– VAX-11/780

• Third Round 1995– Single flag setting for all programs; new set of programs

“benchmarks useful for 3 years”– SPARCstation 10 Model 40

Page 11: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 11

SPEC First RoundSPEC First Round

• One program: 99% of time in single line of code

• New front-end compiler could improve dramatically

Benchmark

SPEC

Per

f

0

100

200

300

400

500

600

700

800

gcc

epre

sso

spice

dodu

c

nasa7 li

eqnt

ott

matrix300

fppp

p

tom

catv

Page 12: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 12

How to Summarize How to Summarize PerformancePerformance

• Arithmetic Mean (weighted arithmetic mean)

– tracks execution time: (Ti)/n or Wi*Ti

• Harmonic Mean (weighted harmonic mean) of execution rates (e.g., MFLOPS)

– tracks execution time: n/1/Ri or n/Wi/Ri

• Normalized execution time is handy for scaling performance

• But do not take the arithmetic mean of normalized execution time, use the geometric mean (Ri)1/n, where Ri=1/Ti

Page 13: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 13

Comparing and Summarizing Comparing and Summarizing PerformancePerformance

For program P1, A is 10 times faster than B, For program P2, B is 10 times faster than A, and so on...

The relative performance of computer is unclear with Total Execution Times

Computer A Computer B Computer C

P1(secs) 1 10 20

P2(secs) 1,000 100 20

Total time(secs) 1,001 110 40

Page 14: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 14

Summary MeasureSummary Measure

Arithmetic Mean n Execution Timei

i=1

1

n n n 1 / Ratei i=1

Ratei = ƒ(1 / Execcution Timei)

Good, if programs are run equally in the workload

Harmonic Mean(When performance is expressed as rates)

Page 15: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 15

Unequal Job MixUnequal Job Mix

­ Weighted Arithmetic Mean

­ Weighted Harmonic Mean

n Weighti x Execution Timei

i=1

n Weighti / Ratei i=1

Relative Performance

• Normalized Execution Time to a reference machine­ Arithmetic Mean­ Geometric Mean

n Execution Time Ratioi

i=1n

Normalized to the reference machine

• Weighted Execution Time

Page 16: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 16

Weighted Arithmetic MeanWeighted Arithmetic Mean

W(i)j x Timejj=1

nWAM(i) =

A B C W(1) W(2) W(3)

P1 (secs) 1.00 10.00 20.00 0.50 0.909 0.999

P2(secs) 1,000.00 100.00 20.00 0.50 0.091 0.001

1.0 x 0.5 + 1,000 x 0.5

WAM(1) 500.50 55.00 20.00

WAM(2) 91.91 18.19 20.00

WAM(3) 2.00 10.09 20.00

Page 17: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 17

Normalized Execution Normalized Execution TimeTime

P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

Normalized to A

Normalized to B Normalized to C

A B C A B C A B C

Geometric Mean = n Execution time ratioiI=1

n

Arithmetic mean 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0

Total time 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0

A B CP1 1.00 10.00 20.00P2 1,000.00 100.00 20.00

Page 18: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 18

Disadvantages Disadvantages of Arithmetic Meanof Arithmetic Mean

Performance varies depending on the reference machine

1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

B is 5 times slower than A

A is 5 times slower than B

C is slowest C is fastest

Normalized to A

Normalized to B Normalized to C

A B C A B C A B C

P1

P2

Arithmetic mean

Page 19: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 19

The Pros and Cons Of The Pros and Cons Of Geometric MeansGeometric Means

• Independent of running times of the individual programs• Independent of the reference machines• Do not predict execution time

– the performance of A and B is the same : only true when P1 ran 100 times for every occurrence of P2

P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

1(P1) x 100 + 1000(P2) x 1= 10(P1) x 100 + 100(P2) x 1

Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0

Normalized to A Normalized to B Normalized to C

A B C A B C A B C

Page 20: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 20

Page 21: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 21

Page 22: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 22

Amdahl's LawAmdahl's Law

Speedup due to enhancement E: ExTime w/o E Performance w/ESpeedup(E) = = ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:

ExTime(E) =Speedup(E) =

Page 23: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 23

Amdahl’s LawAmdahl’s Law

Speedup =ExTime

ExTimeE

=

1

(1 - FractionE) + FractionE

SpeedupE

ExTimeE = ExTime x (1 - FractionE) + SpeedupE

FractionE

1(1 - F) + F/S

=

Page 24: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 24

Amdahl’s LawAmdahl’s Law

Floating point instructions are improved to run 2 times(100% improvement); but only 10% of actual instructions are FP

Speedup = 1

(1-F) + F/S

= 1.0535.3% improvement

1

(1-0.1) + 0.1/2 0.95=

1=

Page 25: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 25

Corollary(Amdahl):Corollary(Amdahl): Make the Common Case FastMake the Common Case Fast

• All instructions require an instruction fetch, only a fraction require a data fetch/store

– Optimize instruction access over data access• Programs exhibit locality

Spatial Locality

Reg’sCache

Memory Disk / Tape

Temporal Locality

• Access to small memories is fasterProvide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.

Page 26: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 26

Locality of AccessLocality of Access

Spatial Locality:There is a high probability that a set of data, whose address differences are small, will be accessed in small time difference.

Temporal Locality:There is a high probability that the recently referenced data will be referenced in near future.

Page 27: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 27

• The simple case is usually the most frequent and the easiest to optimize!

• Do simple, fast things in hardware(faster) and be sure the rest can be handled correctly in software

Rule of ThumbRule of Thumb

Page 28: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 28

Metrics of PerformanceMetrics of Performance

Compiler

Programming Language

Application

ISA

DatapathControl

Transistors Wires PinsFunction Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Answers per monthOperations per second

Cycles per second (clock rate)

Megabytes per second

Page 29: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 29

Aspects of CPU PerformanceAspects of CPU Performance

Seconds Instructions Cycles Seconds

CPU time = = x x Program Program Instruction Cycle

Program X

Compiler X (X)

Inst. Set. X X

Organization X

Technology X X

Inst Count CPI Clock Rate

Page 30: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 30

Marketing MetricsMarketing MetricsMIPS = Instruction Count / Time x 106 = Clock Rate / CPI x 106

• Machines with different instruction sets ?

• Programs with different instruction mixes ?

– Dynamic frequency of instructions

• Not correlated with performance

MFLOP/s = FP Operations / Time x 106

• Machine dependent

• Often not where time is spent

Normalized:

add,sub,compare, mult 1

divide, sqrt 4

exp, sin, . . . 8

Normalized:

add,sub,compare, mult 1

divide, sqrt 4

exp, sin, . . . 8

Page 31: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 31

Cycles Per InstructionCycles Per Instruction

CPU time = Cycle Time x CPI x Ii = 1

n

i i

Instruction Frequency

CPI = CPI x F ,where F = I i = 1

n

i i i i

Instruction Count

CPI = (CPU Time x Clock Rate) / Instruction Count= Cycles / Instruction Count

Average cycles per instruction

Invest resources where time is spent !

Page 32: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 32

Organizational Trade-offsOrganizational Trade-offs

Instruction Mix

Cycle Time

CPIISA

DatapathControl

Transistors Wires PinsFunction Units

Compiler

Programming Language

Application

Page 33: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 33

Example: Calculating CPIExample: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq CPI(i) CPI (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

Page 34: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 34

ExampleExample

Add register / memory operations: R/M– One source operand in memory– One source operand in register– Cycle count of 2

Branch cycle count to increase to 3

What fraction of the loads must be eliminated for this to pay off?

Base Machine (Reg / Reg)

Some of LD instructions can be eliminated by having R/M type ADD instruction [ADD R1, X]

Typical Mix

Op Freqi CPIi

ALU 50% 1Load 20% 2Store 10% 2Branch 20% 2

Page 35: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 35

Example SolutionExample Solution Exec Time = Instr Cnt x CPI x Clock

Op Freqi CPIi CPIALU .50 1 .5Load .20 2 .4Store .10 2 .2Branch .20 2 .4Total 1.00 1.5

Page 36: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 36

Example SolutionExample Solution

Exec Time = Instr Cnt x CPI x Clock

CPINEW must be normalized to new instruction frequency

NewFreqi CPIi CPINEW

.5 - X 1 .5 - X

.2 - X 2 .4 - 2X

.1 2 .2

.2 3 .6

X 2 2X

1 - X (1.7 - X)/(1 - X)

OldOp Freqi CPIi CPI

ALU .50 1 .5

Load .20 2 .4

Store .10 2 .2

Branch .20 2 .4

Reg/Mem

1.00 1.5

Page 37: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 37

Example SolutionExample SolutionExec Time = Instr Cnt x CPI x Clock

All LOADs must be eliminated for this to be a win !

1.00 x 1.5 = (1 - X) x (1.7 - X)/(1 - X) 1.5 = 1.7 - X 0.2 = X

Instr CntOld x CPIOld x Clock = Instr CntNew x CPINew x Clock

Op Freq Cycles CPIOld Freq Cycles CPINEW

ALU .50 1 .5 .5 - X 1 .5 - XLoad .20 2 .4 .2 - X 2 .4 - 2XStore .10 2 .2 .1 2 .2Branch .20 2 .4 .2 3 .6Reg/Mem X 2 2X

1.00 1.5 1 - X (1.7 - X)/(1 - X)

Old New

Page 38: PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Performance CS510 Computer Architectures Lecture 3 - 38

Fallacies and PitfallsFallacies and Pitfalls

• MIPS is an accurate measure for comparing performance among computers– dependent on the instruction set– varies between programs on the same computer– can vary inversely to performance

• MFLOPS is a consistent and useful measure of performance– dependent on the machine and on the program– not applicable outside the floating-point performance– the set of floating-point operations is not consistent

across the machines