Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank Competency Area 2: Performance Metrics Lecture 1.

Computer Architecture Lecture Notes Spring 2005Dr. Michael P. Frank

Competency Area 2:Performance MetricsLecture 1

Performance Metrics

• Why is it necessary for us to study performance?

— Performance is usually the key to the effectiveness of a system (hardware + software).

— Performance is critical to customers (purchasers), thus, we as designers and architects must also make it a priority.

— Performance must be assessed and understood in order for a system to communicate efficiently with peripheral devices.

Performance Metrics

• How can we determine performance?

0.017500 700 100605Car

0.04626,550 1,389 5313,20250Dash-8

0.145287,760 6,230 2,180 119,501132Concorde

0.06354,516 2,406 70811,75077BAE-146-200

0.045100,440 4,442 83723,859120Airbus 340-300

0.039246,79612,493 869139,681284Airbus 340-300

0.032230,31010,548 85391,380270Boeing 767-300

0.048387,32010,734 920216,847421Boeing 747-400

CostThroughput

CruisingRange Speed

Fuel Capacity

Passenger Capacity

Aircraft

Consider this example from the transportation industry:

Performance Example

• Fuel Capacity in liters• Range in kilometers• Speed in kilometers/hour• Throughput is defined as

(# of passengers) x (cruising speed)

• Cost is given as(fuel capacity) / (passengers x

range)

Which mode of transportation has the “best” performance?

Performance Example

• It depends on how we define performance.• Consider raw speed:

—Getting from one place to another quickly

best

worst

Performance Example

• What if we’re interested in the rate at which people are carried throughput:

best

worst

Performance Example

• Often times we relate performance and cost. Thus we can consider the amount of fuel used per passenger:

Bestplane

Bestoverall

Performance Metrics

• Similar measures of performance are used for computers.— Number of computations done per unit of time— Cost of computations— Possibly several aspects of cost can be considered

including initial purchase price, operating cost, cost of training users of system, etc.

• Common performance measures are

1. RESPONSE TIME – the amount of time it takes a program to complete (a.k.a execution time)

2. THROUGHPUT – the total amount of work done in a given amount of time

Performance Metrics

Example:Given the following actions:

1. Replacing processor with a faster version2. Adding additional processors to perform separate tasks in a multiprocessor system

do they (a) increase throughput, (a) decrease response time or (c) both?

Defining Performance• Our focus will be primarily on execution time.• To maximize performance implies a

minimization in execution time:

• For two machines:

• We say that machine Y is faster than machine X.

XX imeExecutionT

ePerformanc1

YX

XY

XY

imeExecutionTimeExecutionT

imeExecutionTimeExecutionT

eperformanceperformancif

11

Performance Metrics

Notes:

nimeExecutionT

imeExecutionT

ePerformanc

ePerformanc

Also

nePerformanc

ePerformanc

X

Y

Y

X

Y

X

,

(1) If X is n times faster than Y, then

(2) To avoid confusion, we’ll use the following terminology:We say We mean

“improve performance” increase performance“improve execution time” decrease execution time

Performance Example

If machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?

Performance Example

If machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?

.5.1

5.1sec10

sec15

,

,

sec15

sec10

BthanfastertimesisAMachine

ET

ET

perf

perf

so

PerfPerf

ETETSince

imeExecutionT

imeExecutionT

A

B

B

A

AB

AB

B

A

Measuring Performance

• Quite simply, TIME is the measure of computer performance!

• The most straightforward definition of time in wall-clock time elapsed time response time.

Total time to complete a task including system overheadactivities such as Input/Output tasks, disk and memory accesses, etc.


• CPU Time is the time it takes to complete a task excluding the time it takes for I/O waits.

USER CPU TIMEThe time CPU is busy executing the user’s code.

CPU TIME

SYSTEM CPU TIMEThe time CPU spends performing operating system tasks.

Note: Sometimes system and user CPU times are difficult to distinguish since it is hard to assign responsibility for OS activities.


Example,To understand the concept of CPUTime, consider the UNIX command ‘time’. Once typed, it may return a response similar to

90.7u 12.9s 2:39 65%

What do these numbers mean?



90.7u 12.9s 2:39 65%

User CPU Time% of elapsed timethat is CPU time

System CPU Time Elapsed Time



90.7u 12.9s 2:39 65%

a. What is the total CPUTime?b. Percentage of time spent on I/O and

other programs?



90.7u 12.9s 2:39 65%a. What is the total CPUTime?

b. Percentage of time spent on I/O and other programs?

sec6.1039.127.90 CPUTime

%35100159

6.103159


• Other notes:

1. SYSTEM PERFORMANCE – reciprocal of elapsed time on an unloaded system (e.g. no user applications)

2. CPU PERFORMANCE – recip. of user CPU time

3. CLOCK CYCLES (CC) – discrete time intervals measured by the processor clock running at a constant rate.

4. CLOCK PERIOD – time it takes to complete a clock cycle

5. CLOCK RATE – inverse of clock period


• Consider CPU performance:

Also,

CCtTimeCycleClock

programaforCyclesClockCPUCPUTime

,

CCfRateClock

programaforCyclesClockCPUCPUTime

,


• Since the execution time clearly depends on the number of instructions for a program, we must also define another performance metric:

CPI = average number of clock cycles

per instruction CountnInstructio

programaforCyclesClockCPUCPI


• Now we have two more equations that we can define for CPUTime:

CCtCPIICCPUTime

ccf

CPIICCPUTime


• In summary, performance metrics include:

Components of Performance

Units of Measure

CPUTime Seconds for program

IC # of instructions for a program

CPI Average # of clock cycles per instructions

tCC Seconds per clock cycle


Example,Suppose Machine A implements the same ISA as Machine B. Given and

for some program, and and for the same program, determine which machine is faster and by how much.

nstccA 10.2ACPI nstccB 2

2.1BCPI

Breakdown by Instruction Category

• Recall CPI = Clock cycles (CC) per instruction• But, CPI depends on many factors, including:

—Memory system behavior—Processor structure—Availability special processor features

– E.g., floating point, graphics, etc.

• To characterize the effect of changing specific aspects of the architecture, we find it helpful to break down CC into components due to different classes (categories) of instructions:—Where:

– ICi = instruction count for class i– CPIi = avg. cycles for insts. in class i– n = the number of instruction classes

n

iii ICCPICC

1

)(

Example• Suppose a processor has 3

categories of instructions A,B,C with the following CPIs:

• And, suppose a compiler designer is comparing two code sequences for a given program that have the following instruction counts:

• Determine:(i) Which code sequence

executes the most instructions?

(ii) Which will be faster?(iii) What is the average CPI for

each code sequence?

Instr. Class CPIi

A 1

B 2

C 3

Code

Seq.

Inst. counts

ICA ICB ICC

1 2 1 2

2 4 1 1

Solution to Example• Part (i):

— ICseq1 = 2 + 1 + 2 = 5 instructions— ICseq2 = 4 + 1 + 1 = 6 instructions Code sequence 2 executes more instructions

• Part (ii):— CCseq1 = ∑i(CPIixICi) = 1x2 + 2x1 + 3x2 = 10

cycles— CCseq2 = ∑i(CPIixICi) = 1x4 + 2x1 + 3x1 = 9 cycles Code sequence 2 takes fewer cycles is faster!

• Part (iii):— CPIseq1 = {CC/IC}seq1 = 10 cyc./5 inst. = 2— CPIseq2 = {CC/IC}seq2 = 9 cyc./6 inst. = 1.5

• Which part should we consult to tell us which code sequence has better performance?

Importance of Benchmarks

• How do we evaluate and compare the performance of different architectures?—We use benchmarks

Programs that are specifically chosen to measure performance.

A workload is a set of programs.Benchmarks consist of workloads that (user

hopes) will predict the performance of the actual workloadIt is important that benchmarks consist of realistic

workloadsNot simple toy programs or code fragments

Manufacturers often try to fine-tune their machines to do well on popular benchmarks that were too simple This does not always mean the machine will do well on

real programs!

SPEC benchmark

• A popular source of benchmarks is SPEC —Standard Performance Evaluation Corporation

• General CPU benchmarks: CPU2000.—Includes programs such as:

– gzip (compression), vpr (FPGA place & route), gcc (compiler), crafty (chess), vortex (database)

• SPEC also offers specialized benchmarks for:—Graphics, Parallel computing, Java, mail servers,

network fileservers, web servers

• They publish reports on benchmark results for various systems.—Main metric: SPECRatio – Proportional to

average inverse execution time. The bigger, the better!

• Reproducibility of results is very important!

Summarizing Performance

• How do we summarize performance in a way that accurately compares different machines?—One common approach: Total Execution Time

(TET)– Based on:

—Or, if the workload includes n different programs, we can calculate the average or Arithmetic Mean (AM):

– Smaller AM Improved performance

—Other methods are also used:– Weighted arithmetic mean, geometric mean ratio.

B

A

A

B

ET

ET

Perf

Perf

n

iitime

nAM

1

1

Performance Improvement

• Recall the formula: CPUTime = IC × CPI / fcyc.—Thus, CPU performance is Perf = f / (IC×CPI).

• Thus we can see 3 basic ways to improve CPU performance on a given task:—Increase clock frequency

—Decrease CPI – by improved processor organization

—Decrease instruction count– By compiler enhancement,

– change in ISA design (new instructions), or

– A more efficient application algorithm.

• However, we have to be careful!—Sometimes, improving one of these can hurt others!

Generalized Cost Measures

• In this course, we will often be focusing on ways to minimize execution time of programs.—Either CPU time, or number of clock cycles.

• Execution time is one example of what we may call a generalized cost measure (GCM).—A GCM is any property of a HW/SW design that tells us

how much of some valued resource is used up when the system is manufactured or used.

• Other examples of important GCMs include:—Energy consumed by a computation—Silicon chip area used up by a circuit design—Dollar cost to manufacture a computer component

• We will study some general engineering principles that apply to the minimization of any GCM in any system.

Additive Cost Measures

• Let us suppose we have a GCM C for a system.• Many times, the total cost C can be represented

as a sum of independent cost components:—E.g., C = C1 + C2 + … + Cn or .

• These could correspond to the resources used by individual subsystems of the whole system.—Or, used in doing particular categories of tasks.

• For example, execution time T can be broken down as the sum of time Tfp taken by floating-point instructions and the time Toth for others.

—That is, T = Tfp + Toth.

n

iiCC

1

Improving Part of a System

• Suppose a GCM is broken down as C = A + B.—The total cost is the sum of two components A

& B.

• Now suppose you are considering making an improvement to the system design that affects only cost component B.—Suppose you reduce it by a factor f, to B′ = B/f.

• The new total cost is then C′ = A + B′.—The cost of component A is unaffected.

• Overall (total) cost has therefore been reduced by the factor:

.overallfBA

BA

BA

BA

C

Cf

Diminishing Returns

• Suppose we continue improving (reducing) a cost component by larger and larger factors.—Does this mean the system’s total cost will be

reduced by correspondingly large factors? NO!

• Even if we improved one cost component (B in our example) by a factor of f = ∞, note that:

• Even here, the overall cost reduction factor foverall would still be only the finite value 1+B/A!—The system can only be improved by at most this

factor, if we improve just the one component B.

.10

limmax,overall A

B

A

BA

A

BA

A

BA

A

BAf

BfBf

Diminishing Returns Example• Suppose a particular chip contains B = 1 cm2

of logic circuits, and A = 2 cm2 of cache memory.—The total cost (in terms of area) is C = A+B = 3 cm2.

• Now, let’s go crazy trying to simplify and shrink the design of just the logic circuit…—What is the maximum factor by which

this tactic can reduce the area cost of the whole design (logic+memory)?

• Obviously, this can reduce the total area from 3 (cm2) to no less than 2 (area of memory alone), —or, shrink it by a factor of foverall = 3/2 = 1.5.

• Note we could have obtained this same answer using the equation foverall,max = 1+B/A as well.

Logic1 cm2

Memory2 cm2

Part/rest (initial)

( f )

(B/A)

Graph Showing Diminishing Returns

Important Lessons to Take from This• It’s probably not worth spending significant

design time extensively improving just a single component of a system,—Unless that component accounts for a dominant part

of the total cost (by some measure) to begin with.(B/A >> 1).

• It’s only worth improving a given component up to the point where it is no longer dominant.—Reducing it further won’t make a lot of difference.

• Therefore, all components with significant costs must be improved together in order to significantly improve an entire design.—Well-engineered systems will tend to have roughly

comparable costs in all of their major components.

Other Ways to Calculate foverall

• Earlier, we saw this formula:—For the overall improvement factor

foverall resulting from improvingcomponent B by the factor f.

• But, what if we don’t know the values of A and B? —What if we only know their relative sizes?

– Fortunately, it turns out that we can still calculate foverall.

• Let us define fracenh = B/C = B/(A+B) to be the fraction of the original total system cost that is accounted for by the particular part B that is going to be enhanced.—Then, the fraction of cost accounted for by A (the rest of the

system) is

• Our equation for foverall can then be reexpressed in terms of the quantities fracenh and 1−fracenh, as follows…

.overallfBA

BAf

.11 enh BA

A

BA

B

BA

BA

BA

Bfrac

Calculating foverall in terms of fracenh

• Let’s re-express foverall in terms of fracenh:

• We will call this form for foverall the Generalized Amdahl’s Law. (We’ll see why in a moment.)

ffrac

frac

BAfB

BAA

BA

AfB

A

BAf

fB

enhenh

overall

1

1

/11

Amdahl’s Law Proper• We saw that execution time is one valid cost measure.

—In such a case, note that the factor by which a cost is reduced is the speedup, or the factor by which performance is improved.

• We thus rename the improvement factor f of B (the enhanced part) to speedupenh, and the overall improvement factor foverall becomes speedupoverall, and we get:

• This is called Amdahl’s Law, and it is one of the most widely hyped quantitative principles of processor design.—But as we can see, it is not a special law of CPU architecture,

but just an application of the universal engineering principle of diminishing returns which we discussed earlier.

enh

enhenh

overall

1

1

speedupfrac

frac

speedup

Key Points from This Module

• Throughput vs. Response Time• Performance as Inverse Execution Time• Speedup Factors• Averaging Benchmark Results• CPU Performance Equation:

—Execution time = IC × CPI × tcc

—Performance = fcc / (IC × CPI)

• Amdahl’s Law:— C′ = A + B/f— Implies:

enh

enhenh

overall

1

1

speedupfrac

frac

speedup

C = Execution time after improvementB = Part of execution time affected by improvementf = Factor of improvement (speedup of enhanced part)A = Part of execution time unaffected by improvement

Example Performance Calculation

• Suppose program takes 10 secs. on computer A—And suppose computer A has a 4 GHz clock

• Want new computer B to run prg. in 6 seconds.—Suppose that increasing the clock speed is

only possible with a substantial processor redesign,

– which will result in 1.2× as many clock cycles being needed to execute the program.

• What clock rate is needed?— Answer: 4 GHz × (10/6) × 1.2 = 8 GHz

Another Example

• Consider two different implementations of a given ISA, running a given benchmark:— Processor A has a cycle time of 250 ps

– And a CPI of 2.0

— Processor B has a cycle time of 500 ps– And a CPI of 1.2

• Which computer is faster on this benchmark, and by what factor?— Processor A takes 250 ps × 2.0 = 500 ps / instr.— Processor B takes 500 ps × 1.2 = 600 ps / instr.— Thus, A is faster by a factor of 6/5 = 1.2×.

Another example

• Suppose some Java application takes 15 seconds on a certain machine.

• A new Java compiler is released that requires only 0.6 as many dynamic instructions to run the application.— Unfortunately, it also increases the CPI by

1.1× – Presumably, uses more multi-cycle instructions.

• How fast will the application run when compiled using the new compiler?—It will take 15 × 0.6 × 1.1 = 9.9 seconds to run—It will be 15/9.9 = 50/33 = 1.515…× faster

– Only slightly more than 50% faster than before.

Another Example

• Consider the following measurements of execution time:

• Which of the following statements are true?— A is faster than B for program 1.— A is faster than B for program 2.— A is faster than B for a workload with equal

numbers of executions of programs 1 and 2.— A is faster than B for a workload with twice as

many executions of program 1 as of program 2.

Program

Computer A

Computer B

1 2 sec. 4 sec.

2 5 sec. 2 sec.

Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank Competency Area 2: Performance Metrics Lecture 1.

Documents

best performance

performance metrics

performance metrics

performance metrics

common performance measures

execution time slide

overall slide

best worst slide