Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Assessing and Understanding Performance

Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available by Dr Mary Jane Irwin, Penn State University.

CS.210Computer Systems and Architecture

<http://spider.science.strath.ac.uk/spider/spider/showClass.php?class=cs210>

andCS.305

Computer Architecture<local.cis.strath.ac.uk/teaching/ug/classes/CS.305>


Purchasing perspective given a collection of machines, which has the

• best performance ?• least cost ?• best cost/performance?

Design perspective faced with design options, which has the

• best performance improvement ?• least cost ?• best cost/performance?

Both require basis for comparison metric for evaluation

Our goal is to understand what factors in the architecture contribute to overall system performance and therelative importance (and cost) of these factors

Performance Metrics

CS210_305_05/2


Defining (Speed) Performance

Normally interested in reducing Response time (aka execution time) – the time between the start

and the completion of a task• Important to individual users

Thus, to maximize performance, need to minimize execution time

Throughput – the total amount of work done in a given time• Important to data center managers

Decreasing response time almost always improves throughput

If X is n times faster than Y, then

€

PerformanceX =1

Execution timeX

€

PerformanceXPerformanceY

=Execution timeYExecution timeX

= n

CS210_305_05/3


Performance Factors

Want to distinguish elapsed time and the time spent on our task

CPU execution time (CPU time) – time the CPU spends working on a task Does not include time waiting for I/O or running other programs

Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

or

€

CPU Time program =CPU clock cyclesprogram ×Clock cycle time

€

CPU Time program =CPU clock cyclesprogram

Clock rate

CS210_305_05/4


Review: Machine Clock Rate

Clock rate (MHz, GHz) is inverseof clock cycle time (clock period):

one clock period

10 nsec clock cycle => 100 MHz clock rate



1 nsec clock cycle => 1 GHz clock rate

500 psec clock cycle => 2 GHz clock rate

250 psec clock cycle => 4 GHz clock rate

€

Clock cycle =1

Clock rate

€

Clock rate =1

Clock cycle

CS210_305_05/5


Clock Cycles per Instruction (CPI)

Not all instructions take the same amount of time to execute One way to think about execution time is that it equals the number

of instructions executed multiplied by the average time per instruction

Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same

ISA

CPI for this instruction class

A B C

CPI 1 2 3

€

clock cycles

program=instructions

program×average clock cycles

instruction

CS210_305_05/6


Effective CPI

Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI =

Where ICi is the count (percentage) of the number of instructions of class i executed

CPIi is the (average) number of clock cycles per instruction for that instruction class

n is the number of instruction classes The overall effective CPI varies by instruction mix – a

measure of the dynamic frequency of instructions across one or many programs

€

(CPIi × ICi )i=1

n

∑

CS210_305_05/7


THE Performance Equation

Our basic performance equation is then

or

These equations separate the three key factors that affect performance Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/

simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which

we must know the implementation details

€

CPU Time =Instruction_count ×CPI

Clock_rate€

CPU Time = Instruction_count ×CPI ×Clock _cycle

CS210_305_05/8


Instructioncount

CPI Clock cycle

Algorithm

Programming language

Compiler

ISA

Processor organization

Technology

Determinates of CPU Performance

€


CS210_305_05/9


Instructioncount

CPI Clock cycle

Algorithm

Programming language

Compiler

ISA

Processor organization

Technology X

XX

XX

X X

X

X

X

X

X

Determinates of CPU Performance

€


CS210_305_05/10


A Simple Example

Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

Q2: How does this compare with using branch prediction to shave a cycle off the branch time?

Q3: What if two ALU instructions could be executed at once?

Op Freq CPIi Freq x CPIi

ALU 50% 1 .

Load 20% 5

Store 10% 3

Branch 20% 2

=

CS210_305_05/11


Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

Q2: How does this compare with using branch prediction to shave a cycle off the branch time?

Q3: What if two ALU instructions could be executed at once?

A Simple Example

Op Freq CPIi Freq x CPIi

ALU 50% 1

Load 20% 5

Store 10% 3

Branch 20% 2

=

.5

1.0

.3

.4

2.2

CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

1.6

.5

.4

.3

.4

.5

1.0

.3

.2

2.0

CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

.25

1.0

.3

.4

1.95

CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

Q1: Q2: Q3:

CS210_305_05/12


Comparing and Summarizing Performance

Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

How do we summarize the performance for benchmark set with a single number? The average of execution times that is directly proportional to total

execution time is the arithmetic mean (AM)

Where Timei is the execution time for the ith program of a total of n programs in the workload

A smaller mean indicates a smaller average execution time and thus improved performance€

AM =1

nTimei

i=1

n

∑

CS210_305_05/13


SPEC Benchmarks www.spec.org

Integer benchmarks FP benchmarks

gzip compression wupwise Quantum chromodynamics

vpr FPGA place & route swim Shallow water model

gcc GNU C compiler mgrid Multigrid solver in 3D fields

mcf Combinatorial optimization applu Parabolic/elliptic pde

crafty Chess program mesa 3D graphics library

parser Word processing program galgel Computational fluid dynamics

eon Computer visualization art Image recognition (NN)

perlbmk perl application equake Seismic wave propagation simulation

gap Group theory interpreter facerec Facial image recognition

vortex Object oriented database ammp Computational chemistry

bzip2 compression lucas Primality testing

twolf Circuit place & route fma3d Crash simulation fem

sixtrack Nuclear physics accel

apsi Pollutant distribution

CS210_305_05/14

http://www.spec.org/


Example SPEC Ratings

CS210_305_05/15


Other Performance Metrics

Power consumption – especially in the embedded market where battery life (and cooling) is important For power-limited applications, the most important metric is

energy efficiency

CS210_305_05/16


Other Performance Metrics - (Native) MIPS

(Native) MIPS - and What Is Wrong with Them The dangers of using metrics other than time in performance

measurement can be shown by looking at several popular alternatives.

One such alternative is MIPS - Millions of Instructions Per Second

€

MIPS =Instruction count

Execution time ×106 =Clock rate

CPI ×106

The following form is sometimes convenient since clock rate is fixed for a machine and CPI is usually a small number, unlike instruction count or execution time. Relating MIPS to time,

€

Execution time =Instruction count

MIPS ×106

CS210_305_05/17


The Problem With MIPS

The problem with using MIPS as a measure of comparison is threefold: MIPS is dependent on the instruction set, making it

difficult to compare MIPS of computers with different instruction sets;

MIPS varies between programs on the same computer; and most importantly,

MIPS can vary inversely to performance! The classic example of the last case is the MIPS rating of

a machine with optional floating-point hardware. Machines with the option yield faster executing programs yet have a lower MIPS rating.

CS210_305_05/18


Peak MIPS

Beware of so-called peak MIPS. This type of rating is obtained by choosing an instruction mix that minimises the CPI even if that instruction mix is totally impractical. For instance: A program composed entirely of arithmetic and logic operations

but no jumps, branches, or load/stores! Or, as in a famous case, a program comprising only of NOPs - No

OPerations)

In other words: Peak MIPS - a level of performance that will never be attained ;-)

CS210_305_05/19


Relative MIPS

MIPS can fail to give a true picture of performance in that it does not track execution time. An alternative type MIPS rating is relative MIPS - as opposed to native MIPS - derived by using a particular machine as a reference point:

€

Relative MIPS =TimereferenceTimeunrated

×MIPSreference

The advantage of this form of MIPS is small since execution time, program, and program input still must be known to have meaningful information.

Timereference= Execution time of a program on a reference machine

Timeunrated= Execution time of the same program on a machine to be rated

MIPSreference= Agreed-upon MIPS rating of the reference machine

CS210_305_05/20


Summary: Evaluating ISAs

Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation?

Static Metrics: How many bytes does the program occupy in memory?

Dynamic Metrics: How many instructions are executed? How many bytes does the

processor fetch to execute the program? How many clocks are required per instruction? How "lean" a clock is practical?

Best Metric: Time to execute the program!

CPI

Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.

CS210_305_05/21


Fallacies and Pitfalls

Pitfall: Expecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement. Example: Suppose a program takes 100 seconds to run and

multiply operations account for 80 seconds of this time. How much do you need to improve the speed of multiplication to make the program run five times faster?

A: Using Amdahl’s Law:

€

Execution time affected by improvement

Amount of improvement+ Execution time unaffected

⎛

⎝ ⎜

⎞

⎠ ⎟

Execution time after improvement =

CS210_305_05/22



…A: Using Amdahl’s Law (and the problem set):

€

80 seconds

n+ (100 − 80 seconds)

⎛

⎝ ⎜

⎞

⎠ ⎟Execution time after improvement =

€

80 seconds

n+ (20 seconds)

⎛

⎝ ⎜

⎞

⎠ ⎟20 seconds =

To get 5 times faster the new execution time must be 20 seconds

€

80 seconds

n

⎛

⎝ ⎜

⎞

⎠ ⎟0 =

I.e there is no amount by which we can enhance multiply to achieve a fivefold improvement in execution time!

Making the common case fast…

…will tend to enhance performance better than optimisingthe rare case.

CS210_305_05/23



Pitfall: Comparing computers using only one or two of three performance metrics: clock rate, CPI, and instruction count.

Pitfall: Using peak performance to compare machines.

Fallacy: Synthetic benchmarks predict performance. Synthetic benchmarks are small artificial programs that attempt to

represent the execution frequency of statements found in a larger set of benchmarks or in real-world programs. Whetstone and Dhrystone are examples. Since these are not natural programs they can (be used to) distort performance statistics by, e.g.:

• compilers discarding large sections of the code!

• compilers targeting other optimisation 'opportunities' specifically for a benchmark and, hence, artificially inflate the performance stats. E.g. 20% to 30% improvement by using a string copy 'optimisation' for Dhrystone that could not be applied in over 99% of normal programs!!

CS210_305_05/24


Concluding Remarks

The task a computer designer faces is a complex one:Determine what attributes are important for a new machine, then design a machine to maximise performance while staying within cost constraints. Performance can be measured as throughput or response time - which depends on the environment/application and should be borne in mind.

Amdahl's Law is a valuable tool to determine what performance improvement an architectural enhancement may give.

Knowing what cases are the most frequent is critical to improving performance. Based on empirical studies of instruction sets, tradeoffs can be made by deciding which instructions are the most important and what cases to try to make fast.

Computer designs will always be measured by cost and performance and finding the best balance will always be the art of computer design, just as in any engineering task.

CS210_305_05/25

Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Documents

clock cycle time clock

performance clock cycles

machine clock rate clock

instruction clock cycles

clock rate cs210

nsec clock cycle

psec clock cycle

average time