Lec 2 Performance

7/31/2019 Lec 2 Performance

1/28

Processor Performance

Ajit Pal

ProfessorDepartment of Computer Science and Engineering

Indian Institute of Technology Kharagpur

INDIA-721302

High Performance Computer Architecture


2/28

Outline

Introduction

Defining Performance

The Iron Law of Processor Performance

Processor performance enhancement

Performance Evaluation Approaches

Performance Reporting

Amdahls Law


3/28

Ajit Pal, IIT Kharagpur

Introduction

Performance measurement is important:

Helps us to determine if one processor (orcomputer) works faster than another

Helps us to know how much performanceimprovement has taken place after incorporatingsome performance enhancement feature

Helps to see through the marketing hype!

Provides answer to the following questions:

Why is some hardware better than others for

different programs? What factors affect system performance?

Hardware, OS or Compiler? How does the machine's instruction set affect

performance?


4/28


Defining Performance in Terms of Time

Time is the final measure of computer performance

A computer exhibits higher performance if it executes

programs faster

Response Time(elapsed time, latency):

how long does it take for myjob to run? how long does it take to execute (start to

finish) myjob?

how long must Iwait for the database query?

Throughput: how manyjobs can the machine run at once?

what is the averageexecution rate?

how muchwork is getting done?

Individual user

concerns

Systems managerconcerns


5/28


Execution Time

Elapsed Time

counts everything (disk and memory accesses, waiting for I/O,running other programs, etc.) from start to finish

a useful number, but often not good for comparison purposes

elapsed time = CPU time+ wait time (I/O, other programs, etc.)

CPU time

doesn't count waiting for I/O or time spent running otherprograms

can be divided into user CPU time and system CPU time (OScalls)

CPU time = user CPU time + system CPU timeelapsed time = user CPU time + system CPU time + wait time

Our focus: user CPU time

(CPU execution time or, simply, execution time): time spentexecuting the lines of code that are in our program


6/28


Measuring Performance

For some program running on machine X:

PerformanceX = 1 / Execution timeX

X is n times faster than Ymeans:

PerformanceX / PerformanceY = n


7/28Ajit Pal, IIT Kharagpur


Processor Performance = ---------------

Time

Program

Architecture --> Implementation --> Realization

Compiler Designer Processor Designer Chip Designer

Instructions Cycles

Program Instruction

Time

Cycle(code size)

= X X

(CPI) (cycle time)




Instructions/Program (Instruction count)Instructions executed, not static code sizeDetermined by algorithm, compiler, ISA

Cycles/Instruction (CPI)Determined by ISA and CPU organizationOverlap among instructions reduces this term

Time/cycle (Cycle time)Determined by technology, organization,clever circuit design



Processor Performance Enhancement

All processor performance enhancement technique

boils down to reducing one or more of these three terms

Some techniques can be used to reduce one termwithout affecting othersImproved hardware technologyCompiler optimization techniquesSuch type of performance optimization techniquesare preferred

Some techniques can reduce one of the terms, but mayincrease other terms (Inter-related)

CISC ISA reduces instruction count but increases CPILoop unrolling reduces instruction count but increases CPI



MIPS and MFLOPS

Used extensively 30 years back.

MIPS: millions of instructions processed persecond.

MFLOPS: Millions of Floating-point Operations

completed per Second

MIPS =Exec. Time x 106

Instruction Count

CPI x 106Clock Rate=


11/28


Problems with MIPS

Three significant problems with using MIPS:

So severe, made some one term: Meaningless Information about Processing Speed

Problem 1:

MIPS is instruction set dependent.

Problem 2:

MIPS varies between programs on the same computer.

Problem 3:

MIPS can vary inversely to performance!

Lets look at an example as to why MIPS doesnt

work


12/28


A MIPS Example

Consider the following computer:

Code type- A (1 cycle) B (2 cycle) C (3 cycle)

Compiler 1 5 1 1

Compiler 2 10 1 1

Instruction counts (in millions)for each instruction class

The machine runs at 100MHz.

Instruction A requires 1 clock cycle, Instruction B requires2 clock cycles, Instruction C requires 3 clock cycles.

CPIi x Ni

i=1

n

CPI =

Instruction Count

CPU Clock Cycles

Instruction Count

=


13/28


A MIPS Example

CPI1 =(5 + 1 + 1) x 106

[(5x1) + (1x2) + (1x3)] x 10610/7 = 1.43=

MIPS1 = 1.43

100 MHz

69.9

=

CPI2 =

(10 + 1 + 1) x 106

[(10x1) + (1x2) + (1x3)] x 10615/12 = 1.25=

MIPS2 =1.25

100 MHz80.0=

So, compiler 2 has a higherMIPS rating and should befaster?

count cycles


14/28


A MIPS Example

Now lets compare CPU time:

CPU Time =Clock Rate

Instruction Count x CPI

= 0.10 secondsCPU Time1 =100 x 106

7 x 106 x 1.43

= 0.15 secondsCPU Time2 =100 x 106

12 x 106

x 1.25

Therefore program 1 is faster despite a lower MIPS!


15/28


Example: Calculating Overall CPI

Typical Instruction Mix

Operation ISA CPI(i) Freq

ALU 50% 1 (40%)

Load 20% 2 (27%)

Store 10% 2 (13%)

Branch 20% 5 (20%)

Overall CPI= 1*0.4+ 2*0.27+ 2*0.13+5*0.2

= 2.2


16/28


Five levels of Benchmarks

1. Real ApplicationsExamples: compilers/editors, scientificapplications, graphics, etc.Problem: Portability due to dependence on OS andCompiler2. Modified ApplicationsReal applications modified/tailored to improveportability or to test specific features of CPU3. Kernels

Programs that are much simpler than realapplications

Kernels; small and key pieces of real applicationsExamples: Livermore Loops: 24 loop kernels

Linpack: linear algebra package

Measuring Performance Using Benchmarks


17/28


Synthetic Benchmarks

4. Toy benchmarks

10 to 100lines of simple programsEasy to type and run on almost all computers

Example: Quick sort, Merge sort, etc.

5. Synthetic Benchmarks

Basic Principle: Analyze the distribution of instructions

over a large number of practical programs.

Synthesize a program that has the same

instruction distribution as a typical program: Need not compute something meaningful.

Dhrystone, Khornerstone, Linpack are some of the older

synthetic benchmarks


18/28


SPEC

Recently used popular approach is to put together

collections of benchmarks measuring performanceof a variety of applications

SPEC:System Performance Evaluation Cooperative:

A non-profit organization (www.spec.org) CPU-intensive benchmark for evaluating processor

performance of workstation:

Generations: SPEC89, SPEC92, SPEC95, and

SPEC2000

Emphasizing memory system performance in

SPEC2000.


19/28


SPEC

Sponsored by industry but independent and self-managed trusted by code developers and machine

vendors

Clear guides for testing, see www.spec.org

Regular updates (benchmarks are dropped and new

ones added periodically according to relevance)

Specialized benchmarks for particular classes of

applications

Can still be abused, by selective optimization!
http://www.spec.org/http://www.spec.org/


20/28


SPEC History

First Round: SPEC CPU89

10 programs yielding a single number Second Round: SPEC CPU92

SPEC CINT92 (6 integer programs) and SPEC CFP92 (14floating point programs)

compiler flags can be set differently for different programs

Third Round: SPEC CPU95 new set of programs: SPEC CINT95 (8 integer programs)

and SPEC CFP95 (10 floating point) single flag setting for all programs

Fourth Round: SPEC CPU2000 new set of programs: SPEC CINT2000 (12 integer

programs) and SPEC CFP2000 (14 floating point) single flag setting for all programs programs in C, C++, Fortran 77, and Fortran 90


21/28


CINT2000

Program Language What It Is

164.gzip C Compression

175.vpr C FPGA Circuit Placement and Routing

176.gcc C C Programming Language Compiler

181.mcf C Combinatorial Optimization

186.crafty C Game Playing: Chess

197.parser C Word Processing

252.eon C++ Computer Visualization

253.perlbmk C PERL Programming Language

254.gap C Group Theory, Interpreter

255.vortex C Object-oriented Database

256.Bzip C Compression

300.twolf C Place and Route Simulator

(Integer component of SPEC CPU2000)


22/28


(Floating point component of SPEC CPU2000)

Program Language What It Is

168.wupwise Fortran 77 Physics / Quantum Chromodynamics

171.swim Fortran 77 Shallow Water Modeling

172.Mgrid Fortran 77 Multi-grid Solver: 3D Potential Field

173.applu Fortran 77 Parabolic / Elliptic Differential Equations

177.mesa C 3-D Graphics Library

178.galgel Fortran 90 Computational Fluid Dynamics

179.art C Image Recognition / Neural Networks

183.equake C Seismic Wave Propagation Simulation

187.facerec Fortran 90 Image Processing: Face Recognition

188.ammp C Computational Chemistry189.Luca Fortran 90 Number Theory / Primality Testing

191.fma3d Fortran 90 Finite-element Crash Simulation

200.sixtrack Fortran 77 High Energy Physics Accelerator Design

301.apsi Fortran 77 Meteorology: Pollutant Distribution

CFP2000


23/28


SPEC CPU2000 Reporting

Refer SPEC website www.spec.org for

documentation Any measure that summarizes performance

should reflect Execution time

Single number result Arithmetic mean orgeometric mean of normalized ratios for each

code in the suite

Weighted arithmetic mean summarizes

performance while tracking execution time

Report precise description of machine (platform)

Report compiler flag setting
http://www.spec.org/http://www.spec.org/


24/28


Amdahls Law

Quantifies overall performance gain due to improve

in a part of a computation.

Amdahls Law:

Performance improvement gained from using

some faster mode of execution is limited by theamount of time the enhancement is actually used.

Speedup=Execution time for the task without enhancement

Execution time for a task using enhancement


25/28


Amdahls Law and Speedup

Speedup tells us: How much faster a machine will run due to an

enhancement.

For using Amdahls law two things should beconsidered:

1st: Fraction of the computation time in theoriginal machine that can use the enhancement If a program executes in 30 seconds and 15

seconds of exec. uses enhancement, fraction=

2nd: Improvement gained by enhancement If enhanced task takes 3.5 seconds and

original task took 7secs, we say the speedupis 2.


26/28


Amdahls Law Equations

Execution timenew = Execution timeold x (1 Fractionenhanced) +

Fractionenhanced

Speedupenhanced

Speedupoverall =Execution Timeold

Execution Timenew

=

(1 Fractionenhanced) +Fractionenhanced

Speedupenhanced

1

Dont just try to memorizethese equations and plug numbers into them.

Its always important to think about the problem too!

Use previous equation,Solve for speedup


27/28


Points to Remember

Processor performance

Terms are inter-related

Minimize time, which is the product, NOT

isolated terms

Use of Benchmark Suite to measureperformance

Repoting by a single number

Instructions Cycles

Program Instruction

Time

Cycle

(code size)

= X X

(CPI) (cycle time)


28/28

Ajit Pal IIT Kharagpur

Thanks!

Lec 2 Performance

Documents