This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Purchasing perspective given a collection of machines, which has the
• best performance ?• least cost ?• best cost/performance?
Design perspective faced with design options, which has the
• best performance improvement ?• least cost ?• best cost/performance?
Both require basis for comparison metric for evaluation
Our goal is to understand what factors in the architecture contribute to overall system performance and therelative importance (and cost) of these factors
Performance Metrics
CS210_305_05/2
Assessing and Understanding Performance
Defining (Speed) Performance
Normally interested in reducing Response time (aka execution time) – the time between the start
and the completion of a task• Important to individual users
Thus, to maximize performance, need to minimize execution time
Throughput – the total amount of work done in a given time• Important to data center managers
Decreasing response time almost always improves throughput
If X is n times faster than Y, then
€
PerformanceX =1
Execution timeX
€
PerformanceXPerformanceY
=Execution timeYExecution timeX
= n
CS210_305_05/3
Assessing and Understanding Performance
Performance Factors
Want to distinguish elapsed time and the time spent on our task
CPU execution time (CPU time) – time the CPU spends working on a task Does not include time waiting for I/O or running other programs
Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
or
€
CPU Time program =CPU clock cyclesprogram ×Clock cycle time
€
CPU Time program =CPU clock cyclesprogram
Clock rate
CS210_305_05/4
Assessing and Understanding Performance
Review: Machine Clock Rate
Clock rate (MHz, GHz) is inverseof clock cycle time (clock period):
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec clock cycle => 1 GHz clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
€
Clock cycle =1
Clock rate
€
Clock rate =1
Clock cycle
CS210_305_05/5
Assessing and Understanding Performance
Clock Cycles per Instruction (CPI)
Not all instructions take the same amount of time to execute One way to think about execution time is that it equals the number
of instructions executed multiplied by the average time per instruction
Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same
ISA
CPI for this instruction class
A B C
CPI 1 2 3
€
clock cycles
program=instructions
program×average clock cycles
instruction
CS210_305_05/6
Assessing and Understanding Performance
Effective CPI
Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging
Overall effective CPI =
Where ICi is the count (percentage) of the number of instructions of class i executed
CPIi is the (average) number of clock cycles per instruction for that instruction class
n is the number of instruction classes The overall effective CPI varies by instruction mix – a
measure of the dynamic frequency of instructions across one or many programs
€
(CPIi × ICi )i=1
n
∑
CS210_305_05/7
Assessing and Understanding Performance
THE Performance Equation
Our basic performance equation is then
or
These equations separate the three key factors that affect performance Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which
we must know the implementation details
€
CPU Time =Instruction_count ×CPI
Clock_rate€
CPU Time = Instruction_count ×CPI ×Clock _cycle
CS210_305_05/8
Assessing and Understanding Performance
Instructioncount
CPI Clock cycle
Algorithm
Programming language
Compiler
ISA
Processor organization
Technology
Determinates of CPU Performance
€
CPU Time = Instruction_count ×CPI ×Clock _cycle
CS210_305_05/9
Assessing and Understanding Performance
Instructioncount
CPI Clock cycle
Algorithm
Programming language
Compiler
ISA
Processor organization
Technology X
XX
XX
X X
X
X
X
X
X
Determinates of CPU Performance
€
CPU Time = Instruction_count ×CPI ×Clock _cycle
CS210_305_05/10
Assessing and Understanding Performance
A Simple Example
Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
Q2: How does this compare with using branch prediction to shave a cycle off the branch time?
Q3: What if two ALU instructions could be executed at once?
Op Freq CPIi Freq x CPIi
ALU 50% 1 .
Load 20% 5
Store 10% 3
Branch 20% 2
=
CS210_305_05/11
Assessing and Understanding Performance
Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
Q2: How does this compare with using branch prediction to shave a cycle off the branch time?
Q3: What if two ALU instructions could be executed at once?
A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2
=
.5
1.0
.3
.4
2.2
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
1.6
.5
.4
.3
.4
.5
1.0
.3
.2
2.0
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
.25
1.0
.3
.4
1.95
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Q1: Q2: Q3:
CS210_305_05/12
Assessing and Understanding Performance
Comparing and Summarizing Performance
Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))
How do we summarize the performance for benchmark set with a single number? The average of execution times that is directly proportional to total
execution time is the arithmetic mean (AM)
Where Timei is the execution time for the ith program of a total of n programs in the workload
A smaller mean indicates a smaller average execution time and thus improved performance€
AM =1
nTimei
i=1
n
∑
CS210_305_05/13
Assessing and Understanding Performance
SPEC Benchmarks www.spec.org
Integer benchmarks FP benchmarks
gzip compression wupwise Quantum chromodynamics
vpr FPGA place & route swim Shallow water model
gcc GNU C compiler mgrid Multigrid solver in 3D fields
Power consumption – especially in the embedded market where battery life (and cooling) is important For power-limited applications, the most important metric is
energy efficiency
CS210_305_05/16
Assessing and Understanding Performance
Other Performance Metrics - (Native) MIPS
(Native) MIPS - and What Is Wrong with Them The dangers of using metrics other than time in performance
measurement can be shown by looking at several popular alternatives.
One such alternative is MIPS - Millions of Instructions Per Second
€
MIPS =Instruction count
Execution time ×106 =Clock rate
CPI ×106
The following form is sometimes convenient since clock rate is fixed for a machine and CPI is usually a small number, unlike instruction count or execution time. Relating MIPS to time,
€
Execution time =Instruction count
MIPS ×106
CS210_305_05/17
Assessing and Understanding Performance
The Problem With MIPS
The problem with using MIPS as a measure of comparison is threefold: MIPS is dependent on the instruction set, making it
difficult to compare MIPS of computers with different instruction sets;
MIPS varies between programs on the same computer; and most importantly,
MIPS can vary inversely to performance! The classic example of the last case is the MIPS rating of
a machine with optional floating-point hardware. Machines with the option yield faster executing programs yet have a lower MIPS rating.
CS210_305_05/18
Assessing and Understanding Performance
Peak MIPS
Beware of so-called peak MIPS. This type of rating is obtained by choosing an instruction mix that minimises the CPI even if that instruction mix is totally impractical. For instance: A program composed entirely of arithmetic and logic operations
but no jumps, branches, or load/stores! Or, as in a famous case, a program comprising only of NOPs - No
OPerations)
In other words: Peak MIPS - a level of performance that will never be attained ;-)
CS210_305_05/19
Assessing and Understanding Performance
Relative MIPS
MIPS can fail to give a true picture of performance in that it does not track execution time. An alternative type MIPS rating is relative MIPS - as opposed to native MIPS - derived by using a particular machine as a reference point:
€
Relative MIPS =TimereferenceTimeunrated
×MIPSreference
The advantage of this form of MIPS is small since execution time, program, and program input still must be known to have meaningful information.
Timereference= Execution time of a program on a reference machine
Timeunrated= Execution time of the same program on a machine to be rated
MIPSreference= Agreed-upon MIPS rating of the reference machine
CS210_305_05/20
Assessing and Understanding Performance
Summary: Evaluating ISAs
Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation?
Static Metrics: How many bytes does the program occupy in memory?
Dynamic Metrics: How many instructions are executed? How many bytes does the
processor fetch to execute the program? How many clocks are required per instruction? How "lean" a clock is practical?
Best Metric: Time to execute the program!
CPI
Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.
CS210_305_05/21
Assessing and Understanding Performance
Fallacies and Pitfalls
Pitfall: Expecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement. Example: Suppose a program takes 100 seconds to run and
multiply operations account for 80 seconds of this time. How much do you need to improve the speed of multiplication to make the program run five times faster?
A: Using Amdahl’s Law:
€
Execution time affected by improvement
Amount of improvement+ Execution time unaffected
⎛
⎝ ⎜
⎞
⎠ ⎟
Execution time after improvement =
CS210_305_05/22
Assessing and Understanding Performance
Fallacies and Pitfalls
…A: Using Amdahl’s Law (and the problem set):
€
80 seconds
n+ (100 − 80 seconds)
⎛
⎝ ⎜
⎞
⎠ ⎟Execution time after improvement =
€
80 seconds
n+ (20 seconds)
⎛
⎝ ⎜
⎞
⎠ ⎟20 seconds =
To get 5 times faster the new execution time must be 20 seconds
€
80 seconds
n
⎛
⎝ ⎜
⎞
⎠ ⎟0 =
I.e there is no amount by which we can enhance multiply to achieve a fivefold improvement in execution time!
Making the common case fast…
…will tend to enhance performance better than optimisingthe rare case.
CS210_305_05/23
Assessing and Understanding Performance
Fallacies and Pitfalls
Pitfall: Comparing computers using only one or two of three performance metrics: clock rate, CPI, and instruction count.
Pitfall: Using peak performance to compare machines.
Fallacy: Synthetic benchmarks predict performance. Synthetic benchmarks are small artificial programs that attempt to
represent the execution frequency of statements found in a larger set of benchmarks or in real-world programs. Whetstone and Dhrystone are examples. Since these are not natural programs they can (be used to) distort performance statistics by, e.g.:
• compilers discarding large sections of the code!
• compilers targeting other optimisation 'opportunities' specifically for a benchmark and, hence, artificially inflate the performance stats. E.g. 20% to 30% improvement by using a string copy 'optimisation' for Dhrystone that could not be applied in over 99% of normal programs!!
CS210_305_05/24
Assessing and Understanding Performance
Concluding Remarks
The task a computer designer faces is a complex one:Determine what attributes are important for a new machine, then design a machine to maximise performance while staying within cost constraints. Performance can be measured as throughput or response time - which depends on the environment/application and should be borne in mind.
Amdahl's Law is a valuable tool to determine what performance improvement an architectural enhancement may give.
Knowing what cases are the most frequent is critical to improving performance. Based on empirical studies of instruction sets, tradeoffs can be made by deciding which instructions are the most important and what cases to try to make fast.
Computer designs will always be measured by cost and performance and finding the best balance will always be the art of computer design, just as in any engineering task.