Top Banner

of 28

Lec 2 Performance

Apr 04, 2018

Download

Documents

himanshu_agra
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/31/2019 Lec 2 Performance

    1/28

    Processor Performance

    Ajit Pal

    ProfessorDepartment of Computer Science and Engineering

    Indian Institute of Technology Kharagpur

    INDIA-721302

    High Performance Computer Architecture

  • 7/31/2019 Lec 2 Performance

    2/28

    Outline

    Introduction

    Defining Performance

    The Iron Law of Processor Performance

    Processor performance enhancement

    Performance Evaluation Approaches

    Performance Reporting

    Amdahls Law

  • 7/31/2019 Lec 2 Performance

    3/28

    Ajit Pal, IIT Kharagpur

    Introduction

    Performance measurement is important:

    Helps us to determine if one processor (orcomputer) works faster than another

    Helps us to know how much performanceimprovement has taken place after incorporatingsome performance enhancement feature

    Helps to see through the marketing hype!

    Provides answer to the following questions:

    Why is some hardware better than others for

    different programs? What factors affect system performance?

    Hardware, OS or Compiler? How does the machine's instruction set affect

    performance?

  • 7/31/2019 Lec 2 Performance

    4/28

    Ajit Pal, IIT Kharagpur

    Defining Performance in Terms of Time

    Time is the final measure of computer performance

    A computer exhibits higher performance if it executes

    programs faster

    Response Time(elapsed time, latency):

    how long does it take for myjob to run? how long does it take to execute (start to

    finish) myjob?

    how long must Iwait for the database query?

    Throughput: how manyjobs can the machine run at once?

    what is the averageexecution rate?

    how muchwork is getting done?

    Individual user

    concerns

    Systems managerconcerns

  • 7/31/2019 Lec 2 Performance

    5/28

    Ajit Pal, IIT Kharagpur

    Execution Time

    Elapsed Time

    counts everything (disk and memory accesses, waiting for I/O,running other programs, etc.) from start to finish

    a useful number, but often not good for comparison purposes

    elapsed time = CPU time+ wait time (I/O, other programs, etc.)

    CPU time

    doesn't count waiting for I/O or time spent running otherprograms

    can be divided into user CPU time and system CPU time (OScalls)

    CPU time = user CPU time + system CPU timeelapsed time = user CPU time + system CPU time + wait time

    Our focus: user CPU time

    (CPU execution time or, simply, execution time): time spentexecuting the lines of code that are in our program

  • 7/31/2019 Lec 2 Performance

    6/28

    Ajit Pal, IIT Kharagpur

    Measuring Performance

    For some program running on machine X:

    PerformanceX = 1 / Execution timeX

    X is n times faster than Ymeans:

    PerformanceX / PerformanceY = n

  • 7/31/2019 Lec 2 Performance

    7/28Ajit Pal, IIT Kharagpur

    The Iron Law of Processor Performance

    Processor Performance = ---------------

    Time

    Program

    Architecture --> Implementation --> Realization

    Compiler Designer Processor Designer Chip Designer

    Instructions Cycles

    Program Instruction

    Time

    Cycle(code size)

    = X X

    (CPI) (cycle time)

  • 7/31/2019 Lec 2 Performance

    8/28Ajit Pal, IIT Kharagpur

    The Iron Law of Processor Performance

    Instructions/Program (Instruction count)Instructions executed, not static code sizeDetermined by algorithm, compiler, ISA

    Cycles/Instruction (CPI)Determined by ISA and CPU organizationOverlap among instructions reduces this term

    Time/cycle (Cycle time)Determined by technology, organization,clever circuit design

  • 7/31/2019 Lec 2 Performance

    9/28Ajit Pal, IIT Kharagpur

    Processor Performance Enhancement

    All processor performance enhancement technique

    boils down to reducing one or more of these three terms

    Some techniques can be used to reduce one termwithout affecting othersImproved hardware technologyCompiler optimization techniquesSuch type of performance optimization techniquesare preferred

    Some techniques can reduce one of the terms, but mayincrease other terms (Inter-related)

    CISC ISA reduces instruction count but increases CPILoop unrolling reduces instruction count but increases CPI

  • 7/31/2019 Lec 2 Performance

    10/28Ajit Pal, IIT Kharagpur

    MIPS and MFLOPS

    Used extensively 30 years back.

    MIPS: millions of instructions processed persecond.

    MFLOPS: Millions of Floating-point Operations

    completed per Second

    MIPS =Exec. Time x 106

    Instruction Count

    CPI x 106Clock Rate=

  • 7/31/2019 Lec 2 Performance

    11/28

    Ajit Pal, IIT Kharagpur

    Problems with MIPS

    Three significant problems with using MIPS:

    So severe, made some one term: Meaningless Information about Processing Speed

    Problem 1:

    MIPS is instruction set dependent.

    Problem 2:

    MIPS varies between programs on the same computer.

    Problem 3:

    MIPS can vary inversely to performance!

    Lets look at an example as to why MIPS doesnt

    work

  • 7/31/2019 Lec 2 Performance

    12/28

    Ajit Pal, IIT Kharagpur

    A MIPS Example

    Consider the following computer:

    Code type- A (1 cycle) B (2 cycle) C (3 cycle)

    Compiler 1 5 1 1

    Compiler 2 10 1 1

    Instruction counts (in millions)for each instruction class

    The machine runs at 100MHz.

    Instruction A requires 1 clock cycle, Instruction B requires2 clock cycles, Instruction C requires 3 clock cycles.

    CPIi x Ni

    i=1

    n

    CPI =

    Instruction Count

    CPU Clock Cycles

    Instruction Count

    =

  • 7/31/2019 Lec 2 Performance

    13/28

    Ajit Pal, IIT Kharagpur

    A MIPS Example

    CPI1 =(5 + 1 + 1) x 106

    [(5x1) + (1x2) + (1x3)] x 10610/7 = 1.43=

    MIPS1 = 1.43

    100 MHz

    69.9

    =

    CPI2 =

    (10 + 1 + 1) x 106

    [(10x1) + (1x2) + (1x3)] x 10615/12 = 1.25=

    MIPS2 =1.25

    100 MHz80.0=

    So, compiler 2 has a higherMIPS rating and should befaster?

    count cycles

  • 7/31/2019 Lec 2 Performance

    14/28

    Ajit Pal, IIT Kharagpur

    A MIPS Example

    Now lets compare CPU time:

    CPU Time =Clock Rate

    Instruction Count x CPI

    = 0.10 secondsCPU Time1 =100 x 106

    7 x 106 x 1.43

    = 0.15 secondsCPU Time2 =100 x 106

    12 x 106

    x 1.25

    Therefore program 1 is faster despite a lower MIPS!

  • 7/31/2019 Lec 2 Performance

    15/28

    Ajit Pal, IIT Kharagpur

    Example: Calculating Overall CPI

    Typical Instruction Mix

    Operation ISA CPI(i) Freq

    ALU 50% 1 (40%)

    Load 20% 2 (27%)

    Store 10% 2 (13%)

    Branch 20% 5 (20%)

    Overall CPI= 1*0.4+ 2*0.27+ 2*0.13+5*0.2

    = 2.2

  • 7/31/2019 Lec 2 Performance

    16/28

    Ajit Pal, IIT Kharagpur

    Five levels of Benchmarks

    1. Real ApplicationsExamples: compilers/editors, scientificapplications, graphics, etc.Problem: Portability due to dependence on OS andCompiler2. Modified ApplicationsReal applications modified/tailored to improveportability or to test specific features of CPU3. Kernels

    Programs that are much simpler than realapplications

    Kernels; small and key pieces of real applicationsExamples: Livermore Loops: 24 loop kernels

    Linpack: linear algebra package

    Measuring Performance Using Benchmarks

  • 7/31/2019 Lec 2 Performance

    17/28

    Ajit Pal, IIT Kharagpur

    Synthetic Benchmarks

    4. Toy benchmarks

    10 to 100lines of simple programsEasy to type and run on almost all computers

    Example: Quick sort, Merge sort, etc.

    5. Synthetic Benchmarks

    Basic Principle: Analyze the distribution of instructions

    over a large number of practical programs.

    Synthesize a program that has the same

    instruction distribution as a typical program: Need not compute something meaningful.

    Dhrystone, Khornerstone, Linpack are some of the older

    synthetic benchmarks

  • 7/31/2019 Lec 2 Performance

    18/28

    Ajit Pal, IIT Kharagpur

    SPEC

    Recently used popular approach is to put together

    collections of benchmarks measuring performanceof a variety of applications

    SPEC:System Performance Evaluation Cooperative:

    A non-profit organization (www.spec.org) CPU-intensive benchmark for evaluating processor

    performance of workstation:

    Generations: SPEC89, SPEC92, SPEC95, and

    SPEC2000

    Emphasizing memory system performance in

    SPEC2000.

  • 7/31/2019 Lec 2 Performance

    19/28

    Ajit Pal, IIT Kharagpur

    SPEC

    Sponsored by industry but independent and self-managed trusted by code developers and machine

    vendors

    Clear guides for testing, see www.spec.org

    Regular updates (benchmarks are dropped and new

    ones added periodically according to relevance)

    Specialized benchmarks for particular classes of

    applications

    Can still be abused, by selective optimization!

    http://www.spec.org/http://www.spec.org/
  • 7/31/2019 Lec 2 Performance

    20/28

    Ajit Pal, IIT Kharagpur

    SPEC History

    First Round: SPEC CPU89

    10 programs yielding a single number Second Round: SPEC CPU92

    SPEC CINT92 (6 integer programs) and SPEC CFP92 (14floating point programs)

    compiler flags can be set differently for different programs

    Third Round: SPEC CPU95 new set of programs: SPEC CINT95 (8 integer programs)

    and SPEC CFP95 (10 floating point) single flag setting for all programs

    Fourth Round: SPEC CPU2000 new set of programs: SPEC CINT2000 (12 integer

    programs) and SPEC CFP2000 (14 floating point) single flag setting for all programs programs in C, C++, Fortran 77, and Fortran 90

  • 7/31/2019 Lec 2 Performance

    21/28

    Ajit Pal, IIT Kharagpur

    CINT2000

    Program Language What It Is

    164.gzip C Compression

    175.vpr C FPGA Circuit Placement and Routing

    176.gcc C C Programming Language Compiler

    181.mcf C Combinatorial Optimization

    186.crafty C Game Playing: Chess

    197.parser C Word Processing

    252.eon C++ Computer Visualization

    253.perlbmk C PERL Programming Language

    254.gap C Group Theory, Interpreter

    255.vortex C Object-oriented Database

    256.Bzip C Compression

    300.twolf C Place and Route Simulator

    (Integer component of SPEC CPU2000)

  • 7/31/2019 Lec 2 Performance

    22/28

    Ajit Pal, IIT Kharagpur

    (Floating point component of SPEC CPU2000)

    Program Language What It Is

    168.wupwise Fortran 77 Physics / Quantum Chromodynamics

    171.swim Fortran 77 Shallow Water Modeling

    172.Mgrid Fortran 77 Multi-grid Solver: 3D Potential Field

    173.applu Fortran 77 Parabolic / Elliptic Differential Equations

    177.mesa C 3-D Graphics Library

    178.galgel Fortran 90 Computational Fluid Dynamics

    179.art C Image Recognition / Neural Networks

    183.equake C Seismic Wave Propagation Simulation

    187.facerec Fortran 90 Image Processing: Face Recognition

    188.ammp C Computational Chemistry189.Luca Fortran 90 Number Theory / Primality Testing

    191.fma3d Fortran 90 Finite-element Crash Simulation

    200.sixtrack Fortran 77 High Energy Physics Accelerator Design

    301.apsi Fortran 77 Meteorology: Pollutant Distribution

    CFP2000

  • 7/31/2019 Lec 2 Performance

    23/28

    Ajit Pal, IIT Kharagpur

    SPEC CPU2000 Reporting

    Refer SPEC website www.spec.org for

    documentation Any measure that summarizes performance

    should reflect Execution time

    Single number result Arithmetic mean orgeometric mean of normalized ratios for each

    code in the suite

    Weighted arithmetic mean summarizes

    performance while tracking execution time

    Report precise description of machine (platform)

    Report compiler flag setting

    http://www.spec.org/http://www.spec.org/
  • 7/31/2019 Lec 2 Performance

    24/28

    Ajit Pal, IIT Kharagpur

    Amdahls Law

    Quantifies overall performance gain due to improve

    in a part of a computation.

    Amdahls Law:

    Performance improvement gained from using

    some faster mode of execution is limited by theamount of time the enhancement is actually used.

    Speedup=Execution time for the task without enhancement

    Execution time for a task using enhancement

  • 7/31/2019 Lec 2 Performance

    25/28

    Ajit Pal, IIT Kharagpur

    Amdahls Law and Speedup

    Speedup tells us: How much faster a machine will run due to an

    enhancement.

    For using Amdahls law two things should beconsidered:

    1st: Fraction of the computation time in theoriginal machine that can use the enhancement If a program executes in 30 seconds and 15

    seconds of exec. uses enhancement, fraction=

    2nd: Improvement gained by enhancement If enhanced task takes 3.5 seconds and

    original task took 7secs, we say the speedupis 2.

  • 7/31/2019 Lec 2 Performance

    26/28

    Ajit Pal, IIT Kharagpur

    Amdahls Law Equations

    Execution timenew = Execution timeold x (1 Fractionenhanced) +

    Fractionenhanced

    Speedupenhanced

    Speedupoverall =Execution Timeold

    Execution Timenew

    =

    (1 Fractionenhanced) +Fractionenhanced

    Speedupenhanced

    1

    Dont just try to memorizethese equations and plug numbers into them.

    Its always important to think about the problem too!

    Use previous equation,Solve for speedup

  • 7/31/2019 Lec 2 Performance

    27/28

    Ajit Pal, IIT Kharagpur

    Points to Remember

    Processor performance

    Terms are inter-related

    Minimize time, which is the product, NOT

    isolated terms

    Use of Benchmark Suite to measureperformance

    Repoting by a single number

    Instructions Cycles

    Program Instruction

    Time

    Cycle

    (code size)

    = X X

    (CPI) (cycle time)

  • 7/31/2019 Lec 2 Performance

    28/28

    Ajit Pal IIT Kharagpur

    Thanks!