Matthias Müller ([email protected]) Center for Information Services and High Performance Computing (ZIH) Leistungsanalyse von Rechnersystemen 7. November 2006 Holger Brunst, Matthias Müller: Leistungsanalyse 2 Summary of Previous Lecture (1) A ten step approach to a systematic performance evaluation: 1. State the goals of the study, define the system 2. List services and outcomes 3. Select metrics 4. List parameters that affect performance 5. Select factors to study 6. Select evaluation techniques 7. Select workload 8. Design sequence of experiments 9. Analyze and interpret data 10. Present results
53
Embed
Leistungsanalyse von Rechnersystemen - TU Dresden · Leistungsanalyse von Rechnersystemen 7. November 2006 Holger Brunst, Matthias Müller: Leistungsanalyse 2 Summary of Previous
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Center for Information Services and High Performance Computing (ZIH)
Leistungsanalysevon Rechnersystemen
7. November 2006
Holger Brunst, Matthias Müller: Leistungsanalyse
2
Summary of Previous Lecture (1)
A ten step approach to a systematic performance evaluation:
1. State the goals of the study, define the system
2. List services and outcomes
3. Select metrics
4. List parameters that affect performance
5. Select factors to study
6. Select evaluation techniques
7. Select workload
8. Design sequence of experiments
9. Analyze and interpret data
10.Present results
Holger Brunst, Matthias Müller: Leistungsanalyse
3
Summary of previous lecture (2)
Commonly used metrics:
– Clock rate
– MIPS
– MFLOPS
– SPEC metrics
– Response time
– Throughput
– Utilization
– MTBF
– …
Holger Brunst, Matthias Müller: Leistungsanalyse
4
Summary of previous lecture (3)
Evaluation techniques
– Analytical Modeling
– Simulation
– Measurement
Holger Brunst, Matthias Müller: Leistungsanalyse
5
Summary of previous lecture (4)
Comparison of sequential and parallel algorithms
Speedup:
– n is the number of processors
– T1 is the execution time of the sequential algorithm
– Tn is the execution time of the parallel algorithm with n processors
Efficiency:
– Its value estimates how well-utilized p processors solve a given problem
– Usually between zero and one. Exception: Super linear speedup (later)
Sn =T1
Tn
E p =Sp
p
Holger Brunst, Matthias Müller: Leistungsanalyse
6
Amdahl’s Law
Find the maximum expected improvement to an overall system when onlypart of the system is improved
Serial execution time = s+p
Parallel execution time = s+p/n
– Normalizing with respect to serial time (s+p) = 1 results in:
• Sn = 1/(s+p/n)
– Drops off rapidly as serial fraction increases
– Maximum speedup possible = 1/s, independent of n the number ofprocessors!
Bad news: If an application has only 1% serial work (s = 0.01) then you willnever see a speedup greater than 100. So, why do we build system withmore than 100 processors?
What is wrong with this argument?
Sn =s + p
s +p
n
Holger Brunst, Matthias Müller: Leistungsanalyse
7
Scaled Speedup (Gustafson-Barsis’ Law)
Amdahl’s speedup equation assumes p is independent of n, in other wordsthe problem size remains the same
Gustafson-Barsis’ law states that any sufficiently large problem can beefficiently parallelized
More realistic to assume “runtime” remains the same, NOT the problem size
If the problem size scales up, does the serial part also increase?
Parallel execution time = s+p
Serial execution time = s+np
– Normalizing with respect to parallel execution time results in:
Center for Information Services and High Performance Computing (ZIH)
What and who is SPEC?
Holger Brunst, Matthias Müller: Leistungsanalyse
26
What is SPEC?
The Standard Performance Evaluation Corporation (SPEC) is a non-profitcorporation formed to establish, maintain and endorse a standardized set ofrelevant benchmarks that can be applied to the newest generation of high-performance computers. SPEC develops suites of benchmarks and alsoreviews and publishes submitted results from our member organizations andother benchmark licensees.
For more details see http://www.spec.org
Holger Brunst, Matthias Müller: Leistungsanalyse
27
SPEC Members
SPEC Members:
3DLabs * Acer Inc. * Advanced Micro Devices * Apple Computer, Inc. * ATI Research* Azul Systems, Inc. * BEA Systems * Borland * Bull S.A. * CommuniGate Systems *Dell * EMC * Exanet * Fabric7 Systems, Inc. * Freescale Semiconductor, Inc. * FujitsuLimited * Fujitsu Siemens * Hewlett-Packard * Hitachi Data Systems * Hitachi Ltd. *IBM * Intel * ION Computer Systems * JBoss * Microsoft * Mirapoint * NEC - Japan *Network Appliance * Novell * NVIDIA * Openwave Systems * Oracle * P.A. Semi *Panasas * PathScale * The Portland Group * S3 Graphics Co., Ltd. * SAP AG * SGI *Sun Microsystems * Super Micro Computer, Inc. * Sybase * Symantec Corporation *Unisys * Verisign * Zeus Technology *
SPEC Associates:
California Institute of Technology * Center for Scientific Computing (CSC) * DefenceScience and Technology Organisation - Stirling * Dresden University of Technology *Duke University * JAIST * Kyushu University * Leibniz Rechenzentrum - Germany *National University of Singapore * New South Wales Department of Education andTraining * Purdue University * Queen's University * Rightmark * Stanford University *Technical University of Darmstadt * Texas A&M University * Tsinghua University *University of Aizu - Japan * University of California - Berkeley * University of CentralFlorida * University of Illinois - NCSA * University of Maryland * University of Modena* University of Nebraska, Lincoln * University of New Mexico * University of Pavia *University of Stuttgart * University of Texas at Austin * University of Texas at El Paso *University of Tsukuba * University of Waterloo * VA Austin Automation Center *
Holger Brunst, Matthias Müller: Leistungsanalyse
28
SPEC members in Dresden: Workshop June 2007
Holger Brunst, Matthias Müller: Leistungsanalyse
29
SPEC groups
Open Systems Group (desktop systems, high-end workstations and servers)
– CPU (CPU benchmarks)
– JAVA (java client and server side benchmarks)
– MAIL (mail server benchmarks)
– SFS (file server benchmarks)
– WEB (web Server benchmarks)
High Performance Group (HPC systems)
– OMP (OpenMP benchmark)
– HPC (HPC application benchmark)
– MPI (MPI application benchmark)
Graphics Performance Groups (Graphics)
– Apc (Graphics application benchmarks)
– Opc (OpenGL performance benchmarks)
Holger Brunst, Matthias Müller: Leistungsanalyse
30
SPEC HPG = SPEC High-Performance Group
Founded in 1994
Mission: To establish, maintain, and endorse a suite ofbenchmarks that are representative of real-world high-performance computing applications.
SPEC/HPG includes members from both industry and academia.
Center for Information Services and High Performance Computing (ZIH)
SPEC CPU 2006
From John Henning’s talk at SPEC Workshop
June 2007, Dresden
Holger Brunst, Matthias Müller: Leistungsanalyse
38
SPEC CPU2006 History
Released August 2006
Replaces CPU2000 (retired February 2007)
5th CPU benchmark
– SPECmark (later called “CPU89”)
– SPEC92 (later called “CPU92”)
– CPU95
– CPU2000
– CPU2006
Note: these updates are required to stay representative
Question to the audience: What kind of application would you add?
Holger Brunst, Matthias Müller: Leistungsanalyse
39
CINT 2006
Benchmark L Application Area Brief Description
400.perlbench C Programming Language Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs).
401.bzip2 C Compression Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.
403.gcc C C-Compiler Based on gcc Version 3.2, generates code for Opteron.
429.mcf C Combinatorial Optim. Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport.
445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game.
456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profileHMMs)
458.sjeng C AI: chess A highly-ranked chess program that also plays several chess variants.
462.libquantum C Physics Quantum Comp. Simulates a quantum computer, running Shor's polynomial-time factorization algorithm.
464.h264ref C Video Compression A reference implementation of H.264/AVC, encodes a videostream using2 parameter sets. The H.264/AVC standard is expected to replace MPEG2
471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete event simulator to model a large Ethernetcampus network.
473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm.
483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents toother document types.
Holger Brunst, Matthias Müller: Leistungsanalyse
40
CFP 2006 (part I)
Benchmark Lang. Application Area Brief Description
416.gamess Fortran Quantum Chemistry. Implements a wide range of quantum chemical computations. The SPECworkload does self-consistent field calculations using the Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and Multi-Configuration Self-Consistent Field
433.milc C Physics/QCD A gauge field generating program for lattice gauge theory with dynamicalquarks.
434.zeusmp Fortran Physics / CFD ZEUS-MP is a computational fluid dynamics code developed at the Laboratory for Computational Astrophysics (NCSA, University of Illinoisat Urbana-Champaign) for the simulation of astrophysical
phenomena.
435.gromacs C, Fortran Biochemistry Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution.
436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog numerical method
437.leslie3d Fortran Fluid Dynamics Computational Fluid Dynamics (CFD) using Large-Eddy Simulations withLinear-Eddy Model in 3D. Uses MacCormack Predictor-Corrector timeintegration
444.namd C++ Biology Molecular Dynamics Simulates biomolecular systems. Test case has 92,224 atoms of apolipoprotein A-I.
447.dealII C++ FE Analysis deal.II is a C++ library targeted at adaptive finite elements and error estimation. The testcase solves a Helmholtz-type equation with non-constant coefficients.
Holger Brunst, Matthias Müller: Leistungsanalyse
41
CFP 2006 (part II)
Benchmark Language Application Area Brief Description
450.soplex C++ Linear Programming, Solves a linear program using a simplex algorithm and sparse linear algebra. Test Optimization cases includerailroad planning and military airlift models.
453.povray C++ Image Ray-tracing Image rendering. The testcase is a 1280x1024 anti-aliased image of a landscape with some abstract objectswith textures using a Perlin noise function.
454.calculix C, F Structural Mechanics Finite element code for 3D structural applications. Usesthe SPOOLES solver library.
459.GemsFDTD F Electromagnetics Solves Maxwell equations in 3D using finite-difference time-domain (FDTD) method.
465.tonto Fortran Quantum Chemistry An open source quantum chemistry package, using an object-oriented design in Fortran 95. The test case placesa constraint on a molecular Hartree-Fock wavefunctioncalculation to better match experimental X-ray diffractiondata.
470.lbm C Fluid Dynamics Implements the "Lattice-Boltzmann Method" to simulateincompressible fluids in 3D
481.wrf C,F Weather Weather modeling from scales of meters to thousands ofkilometers. The test case is from a 30km area over 2 days.
482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University
Holger Brunst, Matthias Müller: Leistungsanalyse
42
Code growth
Holger Brunst, Matthias Müller: Leistungsanalyse
43
Metrics
Speed
– SPECint_base2006 (Required Base result)
– SPECint2006 (Optional Peak result)
– SPECfp_base2006 (Required Base result)
– SPECfp2006 (Optional Peak result)
Throughput
– SPECint_rate_base2006 (Required Base result)
– SPECint_rate2006 (Optional Peak result)
– SPECfp_rate_base2006 (Required Base result)
– SPECfp_rate2006 (Optional Peak result)
Holger Brunst, Matthias Müller: Leistungsanalyse
44
Speed Metric for Single Benchmark
For each benchmark in suite, compute ratio vs. time on a reference system
– A 1997 Sun system with 296 MHz UltraSPARC II
– Similar but not identical to CPU2000 ref machine
Example:
– 400.perlbench on a year 2006 iMac took 948 seconds
– On the reference system, took 9770 seconds
– SPECratio = 10.3 (9770/948)
– If your workload looks like perl, you might find that this modern iMacruns around 10x faster than a state-of-the-1997-art workstation.
Holger Brunst, Matthias Müller: Leistungsanalyse
45
Overall Speed Metric
To obtain the overall speed metrics: geometric mean of the individualSPECratios
Why geometric mean?
Because this is the best answer to the question
“Without knowing how much time I will spend in text processing vs. networkmapping vs. compiling vs. video compression, please tell me about howmuch faster this machine will be than the reference system.”
Holger Brunst, Matthias Müller: Leistungsanalyse
46
Motivation for Throughput Metric
Differs from speed
Stove analogy:
– One big flame cooks one big pot with one hogshead in one hour
– 6 little flames cook 6 little pots, each holding one firkin, in 15 minutes
– Which is better?
Well, big flame does ~250 liters/hour; each little flame does only ~40 * 4 =160 liters/hour
Holger Brunst, Matthias Müller: Leistungsanalyse
47
Throughput vs. Speed
Big flame does ~250 liters/hour; each little flame does only ~40 * 4 = 160liters/hour
Alternatives:
– If I only need to heat up an UNOPENED container holding 1 gallon ofsoup, supper can be served most quickly if I put it on the big flame
– If I need to heat up one butt of soup (=2 hogsheads), and if I can openthe container, I'd be better off using many small flames
In IT business:
– Processing one image in Photoshop or Gimp vs.
– Rendering the next movie with thousands of pictures
Holger Brunst, Matthias Müller: Leistungsanalyse
48
CPU2006 Throughput Metric
Formula:the number of copies run * reference time for the benchmark / elapsed timein seconds
Example:Sun Fire E25K runs 144 copies of 400.perlbench in1066 seconds:144 * 9770 / 1066 = 1320
Holger Brunst, Matthias Müller: Leistungsanalyse
49
Summary of Metrics
Two different kind of metrics
– speed (single application turnaround)
– rate (thoughput)
Run rules make the different between base and peak
– Base: conservative optimization, less freedom
– Peak: more aggressive optimization, more freedom
Tow benchmark sets SPECint and SPECfp 23 = 8 different metrics
If you look at the single application results you get: 2*2*(12+17)=116 different metics
Holger Brunst, Matthias Müller: Leistungsanalyse
50
Example for Run Rules
Base does not allow feedback directed optimization (still legal in peak)
An unlimited number of flags may be set in base,
– Why? Because flag counting is not worth arguing about.
– For example, is -fast:np27 one flag, two, or three? Prove it.