Leistungsanalyse von Rechnersystemen - TU Dresden · Leistungsanalyse von Rechnersystemen 7. November 2006 Holger Brunst, Matthias Müller: Leistungsanalyse 2 Summary of Previous

Matthias Müller ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Leistungsanalysevon Rechnersystemen

7. November 2006

Holger Brunst, Matthias Müller: Leistungsanalyse

2

Summary of Previous Lecture (1)

A ten step approach to a systematic performance evaluation:

1. State the goals of the study, define the system

2. List services and outcomes

3. Select metrics

4. List parameters that affect performance

5. Select factors to study

6. Select evaluation techniques

7. Select workload

8. Design sequence of experiments

9. Analyze and interpret data

10.Present results


3

Summary of previous lecture (2)

Commonly used metrics:

– Clock rate

– MIPS

– MFLOPS

– SPEC metrics

– Response time

– Throughput

– Utilization

– MTBF

– …


4


Evaluation techniques

– Analytical Modeling

– Simulation

– Measurement


5


Comparison of sequential and parallel algorithms

Speedup:

– n is the number of processors

– T1 is the execution time of the sequential algorithm

– Tn is the execution time of the parallel algorithm with n processors

Efficiency:

– Its value estimates how well-utilized p processors solve a given problem

– Usually between zero and one. Exception: Super linear speedup (later)

Sn =T1

Tn

E p =Sp

p


6

Amdahl’s Law

Find the maximum expected improvement to an overall system when onlypart of the system is improved

Serial execution time = s+p

Parallel execution time = s+p/n

– Normalizing with respect to serial time (s+p) = 1 results in:

• Sn = 1/(s+p/n)

– Drops off rapidly as serial fraction increases

– Maximum speedup possible = 1/s, independent of n the number ofprocessors!

Bad news: If an application has only 1% serial work (s = 0.01) then you willnever see a speedup greater than 100. So, why do we build system withmore than 100 processors?

What is wrong with this argument?

Sn =s + p

s +p

n


7

Scaled Speedup (Gustafson-Barsis’ Law)

Amdahl’s speedup equation assumes p is independent of n, in other wordsthe problem size remains the same

Gustafson-Barsis’ law states that any sufficiently large problem can beefficiently parallelized

More realistic to assume “runtime” remains the same, NOT the problem size

If the problem size scales up, does the serial part also increase?

Parallel execution time = s+p

Serial execution time = s+np

– Normalizing with respect to parallel execution time results in:

– Ssn = n+(1-n) s = p(n+1) + 1

Ssn =s + pn

s + p



Workload types, selection andcharacterization


9

Types of Workloads

Test workload:

– Any workload used in performance studies

– Real or synthetic

Real workload:

– Observed on a system being used for normal operation

– Cannot be repeated

– May contain sensitive data

Synthetic workload:

– Should be representative for a real workload

– Often smaller in size


10

Historical examples for test workloads

Addition instruction

Instruction mixes

Kernels

Synthetic programs

Application benchmarks


11

Popular benchmarks: Eratosthenes sieve algorithm

Algorithm to find prime numbers

Kernel

Simple

An algorithm is always independent of a computer language or specificimplementation

No very representative of today's use of computers


12

Popular benchmarks: Ackermann’s Function

Ackermann(n,m) := n+1 if m=0Ackermann(m-1,1) if n=0Ackermann(m-1, Ackermann(m,n-1))

Used to assess the efficiency of procedure calls

Ackermann(3,n) requires(512*4**(n-1)-15*2**(n+3)+9*n+37)/3 calls anda stack size 2**(n+3)-4


13

Popular benchmarks: Whetstone

Used at British Central Computer Agency

11 modules

Representative f 949 ALGOL programs

Available in ALGOL, FORTRAN, PL/I and other programs

See Curnow and Wichmann (1975)

Results in KWHIPS (Kilo Whetstone Instructions Per Second)

Workloads characteristics:

– Floating point intensive

– Cache friendly

– No I/O


14

Popular benchmarks: LINPACK

Developed by Jack Dongarra (1983) at ANL (now ICL, UTK)

Solves a dense system of linear equations

Algorithmic definition of the benchmark

Reference implementation available (HPL)

Makes have use of BLAS

One fixed dataset: 100x100

Used as the benchmark for the TOP500 list

Many vendors have its own hand-tuned implementation


15

Popular benchmarks: Dhrystone

Developed in 1984 by Reinhold Weicker at Siemens

Represents systems programming environments

Available in C, Pascal and Ada

Results are in Dhrystone Instructions Per Seconds (DIPS)

Includes ground rules for building and executing Dhrystone (run rules)


16

Popular Benchmarks: Lawrence Livermore Loops

24 separate tests

Largely vectorizable

Assembled at LLNL (see McMahon 1986)


17

Popular Benchmarks: Transaction Processing (TPC-C)

Successor of the Debit-Credit Benchmark

TPC-C is an on-line transaction processing benchmark

Results reports performance (tpmC) and price/performance ($/tmpC)

System reported has to be available to the customer (at that price)

Running the benchmarks requires a costly setup:


18

SPEC groups and benchmarks

Open Systems Group (desktop systems, high-end workstations and servers)

– CPU (CPU benchmarks)

– JAVA (java client and server side benchmarks)

– MAIL (mail server benchmarks)

– SFS (file server benchmarks)

– WEB (web Server benchmarks)

High Performance Group (HPC systems)

– OMP (OpenMP benchmark)

– HPC (HPC application benchmark)

– MPI (MPI application benchmark)

Graphics Performance Groups (Graphics)

– Apc (Graphics application benchmarks)

– Opc (OpenGL performance benchmarks)



Workload Selection


20

System under Study

Seems to be an easy thing to define

Be aware of different abstraction layers

Example ISO/OSI reference model for computer networks:

1. Application (mail, FTP)

2. Presentation (Data compression, ..)

3. Session (Dialogs)

4. Transport (Messages)

5. Network (Packets)

6. Datalink (Frames)

7. Physical (Bits)


21

Level of Detail of the workload description

Examples:

– Most frequent request (e.g. Addition)

– Frequency of request type (instruction mix)

– Time-stamped sequence of requests

– Average resource demand (e.g. 20 I/O requests per second)

– Distribution of resource demands (not only the average, but alsoprobability distribution)


22

Representativeness

After all benchmarks are not a merit of their own, they should represent realworkloads:

Different characteristics to consider:

– Arrival rate of requests

– Resource demands

– Resource usage profile (sequence and amounts of resources used by anapplication)

To be representative a test workload has to follow the user behavior in atimely fashion!!!


Center for Information Services and High Performance Computing (ZIH)Center for Information Services and High Performance Computing (ZIH)

SPEC Benchmarks

Vorlesung Leistungsanalyse


24

Outline

What is SPEC?

Who is SPEC?

Some SPEC benchmarks:

– SPEC CPU

– SPEC HPC

– SPEC OMP

– SPEC MPI

Summary



What and who is SPEC?


26

What is SPEC?

The Standard Performance Evaluation Corporation (SPEC) is a non-profitcorporation formed to establish, maintain and endorse a standardized set ofrelevant benchmarks that can be applied to the newest generation of high-performance computers. SPEC develops suites of benchmarks and alsoreviews and publishes submitted results from our member organizations andother benchmark licensees.

For more details see http://www.spec.org


27

SPEC Members

SPEC Members:

3DLabs * Acer Inc. * Advanced Micro Devices * Apple Computer, Inc. * ATI Research* Azul Systems, Inc. * BEA Systems * Borland * Bull S.A. * CommuniGate Systems *Dell * EMC * Exanet * Fabric7 Systems, Inc. * Freescale Semiconductor, Inc. * FujitsuLimited * Fujitsu Siemens * Hewlett-Packard * Hitachi Data Systems * Hitachi Ltd. *IBM * Intel * ION Computer Systems * JBoss * Microsoft * Mirapoint * NEC - Japan *Network Appliance * Novell * NVIDIA * Openwave Systems * Oracle * P.A. Semi *Panasas * PathScale * The Portland Group * S3 Graphics Co., Ltd. * SAP AG * SGI *Sun Microsystems * Super Micro Computer, Inc. * Sybase * Symantec Corporation *Unisys * Verisign * Zeus Technology *

SPEC Associates:

California Institute of Technology * Center for Scientific Computing (CSC) * DefenceScience and Technology Organisation - Stirling * Dresden University of Technology *Duke University * JAIST * Kyushu University * Leibniz Rechenzentrum - Germany *National University of Singapore * New South Wales Department of Education andTraining * Purdue University * Queen's University * Rightmark * Stanford University *Technical University of Darmstadt * Texas A&M University * Tsinghua University *University of Aizu - Japan * University of California - Berkeley * University of CentralFlorida * University of Illinois - NCSA * University of Maryland * University of Modena* University of Nebraska, Lincoln * University of New Mexico * University of Pavia *University of Stuttgart * University of Texas at Austin * University of Texas at El Paso *University of Tsukuba * University of Waterloo * VA Austin Automation Center *


28

SPEC members in Dresden: Workshop June 2007


29

SPEC groups

Open Systems Group (desktop systems, high-end workstations and servers)

– CPU (CPU benchmarks)

– JAVA (java client and server side benchmarks)

– MAIL (mail server benchmarks)

– SFS (file server benchmarks)

– WEB (web Server benchmarks)

High Performance Group (HPC systems)

– OMP (OpenMP benchmark)

– HPC (HPC application benchmark)

– MPI (MPI application benchmark)

Graphics Performance Groups (Graphics)

– Apc (Graphics application benchmarks)

– Opc (OpenGL performance benchmarks)


30

SPEC HPG = SPEC High-Performance Group

Founded in 1994

Mission: To establish, maintain, and endorse a suite ofbenchmarks that are representative of real-world high-performance computing applications.

SPEC/HPG includes members from both industry and academia.

Benchmark products:

– SPEC OMP (OMPM2001, OMPL2001)

– SPEC HPC2002 released at SC 2002

– SPEC MPI (under development)


31

Currently active SPEC HPG Members

Fujitsu

HP

IBM

Intel

SGI

SUN

UNISYS

University of Purdue

Technische Universität Dresden


32

HPG (High Performance Group) Benchmark Suites

OMPL2001

Founding of

SPEC HPG

HPC96

OMP2001

HPC2002

MPI2007

Jan 1994 1996 June 2001 June 2002 Jan 2003 2007



Overview and Positioning


34

Where is SPEC Relative to Other Benchmarks ?There are many metrics, each one has its purpose

Raw machine performance: Tflops

Microbenchmarks: Stream

Algorithmic benchmarks: Linpack

Compact Apps/Kernels: NAS benchmarks

Application Suites: SPEC

User-specific applications: Custom benchmarks

Computer Hardware

Applications


35

Why do we need benchmarks?

Identify problems: measure machine properties

Time evolution: verify that we make progress

Coverage:Help the vendors to have representative codes:

– Increase competition by transparency

– Drive future development (see SPEC CPU2000)

Relevance:Help the customers to choose the right computer


36

Comparison of different benchmark classes

+++00Micro

+++0-Algorithmic

++00Kernels

Apps

SPEC

00++-

++++

Timeevolution

Identifyproblems

relevancecoverage



SPEC CPU 2006

From John Henning’s talk at SPEC Workshop

June 2007, Dresden


38

SPEC CPU2006 History

Released August 2006

Replaces CPU2000 (retired February 2007)

5th CPU benchmark

– SPECmark (later called “CPU89”)

– SPEC92 (later called “CPU92”)

– CPU95

– CPU2000

– CPU2006

Note: these updates are required to stay representative

Question to the audience: What kind of application would you add?


39

CINT 2006

Benchmark L Application Area Brief Description

400.perlbench C Programming Language Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs).

401.bzip2 C Compression Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.

403.gcc C C-Compiler Based on gcc Version 3.2, generates code for Opteron.

429.mcf C Combinatorial Optim. Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport.

445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game.

456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profileHMMs)

458.sjeng C AI: chess A highly-ranked chess program that also plays several chess variants.

462.libquantum C Physics Quantum Comp. Simulates a quantum computer, running Shor's polynomial-time factorization algorithm.

464.h264ref C Video Compression A reference implementation of H.264/AVC, encodes a videostream using2 parameter sets. The H.264/AVC standard is expected to replace MPEG2

471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete event simulator to model a large Ethernetcampus network.

473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm.

483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents toother document types.


40

CFP 2006 (part I)

Benchmark Lang. Application Area Brief Description

410.bwaves Fortran Fluid Dynamics Computes 3D transonic transient laminar viscous flow.

416.gamess Fortran Quantum Chemistry. Implements a wide range of quantum chemical computations. The SPECworkload does self-consistent field calculations using the Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and Multi-Configuration Self-Consistent Field

433.milc C Physics/QCD A gauge field generating program for lattice gauge theory with dynamicalquarks.

434.zeusmp Fortran Physics / CFD ZEUS-MP is a computational fluid dynamics code developed at the Laboratory for Computational Astrophysics (NCSA, University of Illinoisat Urbana-Champaign) for the simulation of astrophysical

phenomena.

435.gromacs C, Fortran Biochemistry Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution.

436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog numerical method

437.leslie3d Fortran Fluid Dynamics Computational Fluid Dynamics (CFD) using Large-Eddy Simulations withLinear-Eddy Model in 3D. Uses MacCormack Predictor-Corrector timeintegration

444.namd C++ Biology Molecular Dynamics Simulates biomolecular systems. Test case has 92,224 atoms of apolipoprotein A-I.

447.dealII C++ FE Analysis deal.II is a C++ library targeted at adaptive finite elements and error estimation. The testcase solves a Helmholtz-type equation with non-constant coefficients.


41

CFP 2006 (part II)

Benchmark Language Application Area Brief Description

450.soplex C++ Linear Programming, Solves a linear program using a simplex algorithm and sparse linear algebra. Test Optimization cases includerailroad planning and military airlift models.

453.povray C++ Image Ray-tracing Image rendering. The testcase is a 1280x1024 anti-aliased image of a landscape with some abstract objectswith textures using a Perlin noise function.

454.calculix C, F Structural Mechanics Finite element code for 3D structural applications. Usesthe SPOOLES solver library.

459.GemsFDTD F Electromagnetics Solves Maxwell equations in 3D using finite-difference time-domain (FDTD) method.

465.tonto Fortran Quantum Chemistry An open source quantum chemistry package, using an object-oriented design in Fortran 95. The test case placesa constraint on a molecular Hartree-Fock wavefunctioncalculation to better match experimental X-ray diffractiondata.

470.lbm C Fluid Dynamics Implements the "Lattice-Boltzmann Method" to simulateincompressible fluids in 3D

481.wrf C,F Weather Weather modeling from scales of meters to thousands ofkilometers. The test case is from a 30km area over 2 days.

482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University


42

Code growth


43

Metrics

Speed

– SPECint_base2006 (Required Base result)

– SPECint2006 (Optional Peak result)

– SPECfp_base2006 (Required Base result)

– SPECfp2006 (Optional Peak result)

Throughput

– SPECint_rate_base2006 (Required Base result)

– SPECint_rate2006 (Optional Peak result)

– SPECfp_rate_base2006 (Required Base result)

– SPECfp_rate2006 (Optional Peak result)


44

Speed Metric for Single Benchmark

For each benchmark in suite, compute ratio vs. time on a reference system

– A 1997 Sun system with 296 MHz UltraSPARC II

– Similar but not identical to CPU2000 ref machine

Example:

– 400.perlbench on a year 2006 iMac took 948 seconds

– On the reference system, took 9770 seconds

– SPECratio = 10.3 (9770/948)

– If your workload looks like perl, you might find that this modern iMacruns around 10x faster than a state-of-the-1997-art workstation.


45

Overall Speed Metric

To obtain the overall speed metrics: geometric mean of the individualSPECratios

Why geometric mean?

Because this is the best answer to the question

“Without knowing how much time I will spend in text processing vs. networkmapping vs. compiling vs. video compression, please tell me about howmuch faster this machine will be than the reference system.”


46

Motivation for Throughput Metric

Differs from speed

Stove analogy:

– One big flame cooks one big pot with one hogshead in one hour

– 6 little flames cook 6 little pots, each holding one firkin, in 15 minutes

– Which is better?

Well, big flame does ~250 liters/hour; each little flame does only ~40 * 4 =160 liters/hour


47

Throughput vs. Speed

Big flame does ~250 liters/hour; each little flame does only ~40 * 4 = 160liters/hour

Alternatives:

– If I only need to heat up an UNOPENED container holding 1 gallon ofsoup, supper can be served most quickly if I put it on the big flame

– If I need to heat up one butt of soup (=2 hogsheads), and if I can openthe container, I'd be better off using many small flames

In IT business:

– Processing one image in Photoshop or Gimp vs.

– Rendering the next movie with thousands of pictures


48

CPU2006 Throughput Metric

Formula:the number of copies run * reference time for the benchmark / elapsed timein seconds

Example:Sun Fire E25K runs 144 copies of 400.perlbench in1066 seconds:144 * 9770 / 1066 = 1320


49

Summary of Metrics

Two different kind of metrics

– speed (single application turnaround)

– rate (thoughput)

Run rules make the different between base and peak

– Base: conservative optimization, less freedom

– Peak: more aggressive optimization, more freedom

Tow benchmark sets SPECint and SPECfp 23 = 8 different metrics

If you look at the single application results you get: 2*2*(12+17)=116 different metics


50

Example for Run Rules

Base does not allow feedback directed optimization (still legal in peak)

An unlimited number of flags may be set in base,

– Why? Because flag counting is not worth arguing about.

– For example, is -fast:np27 one flag, two, or three? Prove it.

– What if it's -fast_np27 ?

– What it it’s –fast np27 or –fast –np27 ?


51

SPEC CPU2000 Result



SPEC OMP


53

SPEC OMP

Benchmark suite developed by SPEC HPG

Benchmark suite for performance testing of shared memoryprocessor systems

Uses OpenMP versions of SPEC CPU2000 benchmarks

SPEC OMP mixes integer and FP in one suite

OMPM is focused on 4-way to 16-way systems

OMPL is targeting 32-way and larger systems


54

SPEC OMP Applications

Code Applications Language linesammp Molecular Dynamics C 13500

applu CFD, partial LU Fortran 4000

apsi Air pollution Fortran 7500

art Image Recognition\

neural networks C 1300

fma3d Crash simulation Fortran 60000

gafort Genetic algorithm Fortran 1500

galgel CFD, Galerkin FE Fortran 15300

equake Earthquake modeling C 1500

mgrid Multigrid solver Fortran 500

swim Shallow water modeling Fortran 400

wupwise Quantum chromodynamics Fortran 2200


55

CPU2000 vs. OMPM2001

Characteristic CPU2000 OMPM2001

Max. working set 200 MB 1.6 GB Memory needed 256 MB 2 GB Benchmark runtime 30 min @ 300 MHz 5 hrs @ 300 MHz

Language C, C++, F77, F90 C, F90, OpenMP Focus Single CPU < 16 CPU system System type Cheap desktop MP workstation Runtime 24 hours 34 hours

Runtime 1 CPU 24 hours 140 hours Run modes Single and rate Parallel Number benchmarks 26 11 Iterations Median 3 or more Worst of 2, median of 3

Source mods Not allowed Allowed Baseline flags Max of 4 Any, same for all Reference system 1 CPU @ 300 MHz 4 CPU @ 350 MHz


56

CPU2000 vs OMPL2001

Characteristic CPU2000 OMPL2001

Max. working set 200 MB 6.5 GB Memory needed 256 MB 8 GB Benchmark runtime 30 min @ 300 MHz 9 hrs @ 300 MHz

Language C, C++, F77, F90 C, F90, OpenMP Focus Single CPU > 16 CPU system System type Cheap desktop Engineerin g MP sys Runtime 24 hours 75 hours

Runtime 1 CPU 24 hours 1000 hours Run modes Single and rate Parallel Number benchmarks 26 9 Iterations Median 3 or more 2 or more

Source mods Not allowed Allowed Baseline flags Max of 4 Any, same for all Reference s ystem 1 CPU @ 300 MHz 16 CPU @ 300 MHz


57

Program Memory Footprints

OMPM2001

(Mbytes)

OMPL2001

(Mbytes)

wupwise 1480 5280 swim 1580 6490

mgrid 450 3490

applu 1510 6450

galgel 370 equake 860 5660

apsi 1650 5030

gafort 1680 1700

fma3d 1020 5210 art 2760 10670

ammp 160


58

SPEC OMP Results (January 2006)

141 submitted results for OMPM

39 submitted results for OMPL

32 KB64 KB16 KB1.5 MBL1 Data

8 MB8 MB256 KB-L2

--6144 KB-L3

32 KB32 KB16 KB0.75 MBL1 Inst

40012001500875Speed

R12000UltraSPARC IIIItanium2PA-8700+CPU

O3800Fire 15KSuperdomeSuperdomeArchitecture

SGISUNHPHPVendor


59

SPEC OMPL Results: Applications with scaling to 128


60

SPEC OMPL Results: Superlinear scaling of applu


61

SPEC OMPL Results: Applications with scaling to 64



SPEC MPI2007


63

An application benchmark suite that measures:

– Type of computer processor

– Number of computer processors

– Communication interconnect

– Memory architecture

– Compilers

– MPI library performance

– File system performance

Identifying Candidate Applications

– From SPEC CPU2006

– With a search for candidate call

MPI2007 design goals: benchmark for distributed memory


64

Comparison of Different Benchmarks using MPI

CF77,CF77,F90,C,C++Language

~600~400~2400#MPI calls in thecode

~44~36~59#different MPI calls inthe code

47.200 lines28.000 lines~530.000 linesCode size

7813Number ofapplications

HPCCNPBSPEC MPI


65

Application Fields

– Computation fluid dynamics

– Quantum chromodynamics

– Climate modeling

– Ray tracing

– Molecular Dynamics

– Weather prediction

– Heat transfer

– Hydrodynamics

– Flow Simulation


66

MPI2007 Development

Participating Members

•AMD, Fujitsu, HP, IBM, INTEL

•QLogic (PathScale), SGI, SUN

•University of Dresden, Lawrence Livermore Lab

Release date expected to be July, 2007

We are always looking for new members to help develop benchmarks


67

MPI2007 Benchmark Goals

–Runs on Clusters or SMP’s

–Validates for correctness and measures performance

–Supports 32-bit or 64-bit OS/ABI.

–Consists of applications drawn from National Labs and University researchcenters

–Supports a broad range of MPI implementations and Operating systemsincluding Windows, Linux, Proprietary Unix

–Has a runtime of ~1 hour per benchmark test at 16 ranks using GigE with 1GB memory footprint per rank

–Scales to 128 ranks

–Is extensible to future large and extreme data sets planned to cover largernumber of ranks.


68

MPI2007 – tested for portability

– Architectures:

• Opteron, Xeon, Itanium2, PA-Risc, Power5, Sparc

– Interconnects:

• Ethernet, Infiniband, Infinipath, SGI NUMAlink, and shared memory.

– Operating systems

• Linux (RH FC3, SLES9/10,Suse 9.3), Windows CCS, HPUX, Solaris,AIX

– MPI implementations

• HP-MPI, MPICH, MPICH2, Open MPI, IBM-MPI,Intel MPI, MPICH-GM, MVAPICH, Fujitsu MPI, InfiniPath MPI, SGIMPT

– Compilers:

• SUN Studio, Fujitsu, Intel, PathScale, PGI, HP, and IBM compilers.


69

MPI2007 – tested for scalability

– Scalable from 16 to 128 ranks (processes) for medium data set

– Runtime of 1 hour per benchmark test at 16 ranks using GigE on anunspecified reference cluster.

– Memory footprint should be < 1GB per rank at 16 ranks.

– Exhaustively tested for each rank count- 12- 15 -> 130- 140, 160, 180, 200, 225, 256, 512


70

Overview of the applications

SSOR1372F905671137.lu

Astrophysical CFD21639C,F9044441132.zeusmp2

density-functional theory20155F9091585130.socorro

Eulerian hydrodynamics1342F906468129.tera_tf

Geophysical FEM1858F77,C30935

128.GAPgeofe

m

Weather forecast23132F90,C163462127.wrf2

Molecular dynamics25625C++6796126.lammps

Ray tracing1617C15512122.tachyon

Geophysical fluid

dynamics17158F9069203121.pop2

CFD15239F90,C44524115.fds4

Electrodynamic simulation16237F9021858113.GemsFDTD

Combustion1343F77,F9010503107.leslie3d

Lattice QCD1851C17987104.milc

callscall sites

AreaMPIMPILanguageLOCCode


71

MPI2007 Benchmark dynamic message call counts

MPI_Allgather 303040 32 512

MPI_Allgatherv 7936

MPI_Allreduce 17700 140832 23628416 1696 2002016 60416 36992 12864 224

MPI_Barrier 62 1088 160 320 8640 32 64 15520 9760 96 32

MPI_Bcast 122 292000 9664 1888 67488 352 1184 1248 288

MPI_Cart_create 32 32

MPI_Comm_create 96

MPI_Comm_dup 32 224

MPI_Comm_free 32

MPI_Comm_split 32 32 32 32

MPI_Gather 8512

MPI_Iprobe

MPI_Irecv 359340 3201600 5.58E+08 196544 6508380 6015144 1991616 5266164 845056 19000

MPI_Irsend

MPI_Isend 359340 5.58E+08 601514 4 845056

MPI_Issend

MPI_Probe

MPI_Recv 3270 35371 9152 10106 360 7600320

MPI_Reduce 64 128 1152 64

MPI_Scan 32

MPI_Send 3201600 3270 35371 205696 6518486 1991976 5266164 7619320

MPI_Send_init 16158

MPI_Sendrecv 1204000

MPI_Ssend

MPI_Start 16158

MPI_Startall 1

MPI_Test

MPI_Testany 522

MPI_Wait 718680 3201600 196544 6508380 1991616 19000

MPI_Waitall 151264 3.19E+08 32 1394816 249888

MPI_Waitany 5266164


72

MPI2007 (32 ranks)Characteristics

Elapsed Time

%User

Time %MPI Time

104.milc 2142.44 82% 18%

107.leslie3d 3997.10 72% 28%

113.GemsFDTD 1682.58 67% 33%

115.fds4 1926.18 91% 9%

121.pop2 2016.27 64% 36%

122.tachyon 2034.54 99% 1%

126.lammps 1841.00 94% 6%

127.wrf2 3085.30 74% 26%

128.GAPgeofem 653.17 86% 14%

129.tera_tf 1116.59 85% 15%

130.socorro 1203.73 96% 4%

132.zeusmp2 1400.41 83% 17%

137.lu 733.14 94% 6%


73

Pt2Pt Communication Statistics: 122.tachyon (ray tracing)


74

Pt2Pt Communication Statistics: 107.leslie3D (combustion)


75

Pt2Pt Communication Statistics: 113.GemsFDTD (electrodynamics)


76

Message Length Statistics (Pt2Pt)

0

20

40

60

80

100

120

140

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

137

132

130

129

127

126

122

115

113

107

101



Available Results


78

Available Results

– AMD A2210 Reference Platform (16 cores)

• Gigabit Ethernet

• Single Core AMD Opteron 848, 2.2 GHz

– SGI Altix 4700 (16-128 cores)

• SGI Numalink, SGI MPT 1.15

• Dual-Core Intel Itanium II 9040, 1.6 GHz

– HP Proliant BL460c Blade Cluster Platform 3000 BL (16-256 cores)

• Infiniband DDR, HP-MPI 2.2.5

• Dual-Core Intel Xeon 5160, 3.0 GHz

– QLogic, U. Cambridge Darwin Cluster (32-512 cores)

• Infinipath, QLogic Infinipath MPI library 2.0

• Dual-Core Intel Xeon 5160, 3.0 GHz

– QLogic, AMD Emerald Cluster (32-512 cores)

• Infinipath, QLogic Infinipath MPI library 2.1

• Dual-Core AMD Opteron 290, 2.8 GHz


79

Scales to 128 , works on 512

SPECmpiM Results on U. Cambridge's Darwin Cluster

0

10

20

30

40

50

60

70

80

104.m

ilc

107.

leslie3d

113.Gem

sFDTD

115.

fds4

121.

pop2

122.

tach

yon

126.lam

mps

127.wrf2

128.

GAPge

ofem

129.

tera

_tf

130.

soco

rro

132.

zeus

mp2

137.

lu

SPECmpiM_b

ase

SP

EC

mp

iM R

ati

o

32 64 128 256 512Ranks / Cores:


80

Scalability on U. Cambridge’s Darwin Cluster (II)Scaling QL Darwin

0

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600

104.milc

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

122.tachyon

126.lammps

127.wrf2

128.GAPgeofem

129.tera_tf

130.socorro

132.zeusmp2

137.lu

IDEAL


81

Scalability on HP ClusterScaling on HP ProLiant

0

100

200

300

400

500

600

0 50 100 150 200 250 300

#ranks

104.milc

107.leslie3d

113.GemsFDTD

115.fds4

121.pop2

122.tachyon

126.lammps

127.wrf2

128.GAPgeofem

129.tera_tf

130.socorro

132.zeusmp2

137.lu

IDEAL


82

Summary and Conclusion

SPEC MPI2007 properties:

– Application benchmark with 13 different codes

– Run and reporting rules for reproducibility

– Tested on a wide range of platforms:

• CPU and Node Architectures

• Interconnects

• Compilers

• MPI implementations

– Available dataset (medium) scales to 128 ranks

– Next steps:

• Large dataset with enhanced scalability for larger systems

• …



Use Cases


84

Use cases

– Performance trends

– Compiler and performance

– Comparing different Itanium systems

– Comparing different system generations


85

SPEC performance trends (performance per thread)

SPEC History

y = 3E-17e 0,0012x

1

10

100

1000

10000

28.08.1999 15.03.2000 01.10.2000 19.04.2001 05.11.2001 24.05.2002 10.12.2002 28.06.2003 14.01.2004 01.08.2004 17.02.2005

Time

SP

EC 8w ay

Expon


86

Where Does the Performance Go? orWhy Should I Care About the Memory Hierarchy?

1

100

10000

1000000

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

Year

Pe

rfo

rm

an

ce

Processor-DRAM Memory Gap (latency) Proc

60%/yr.

(2X/1.5yr)

DRAM

9%/yr.

(2X/10 yrs)

“Moore’s Law”

Processor-Memory

Performance Gap:

(grows 50% / year)

CPU

DRAM


87

Comparison OMPM base compilers

0

10000

20000

30000

40000

50000

60000

70000

310.

wup

wis

e_m

312.

swim

_m

314.

mgr

id_m

316.

appl

u_m

318.

galg

el_m

320.

equa

ke_m

324.

apsi

_m

326.

gaf

ort

_m

328.

fma3

d_m

330.

art_

m

332.

amm

p_m

NEC ?/7.0

NEC 4.1/8.0

Intel 8.1


88

Influence of compilers on OMPM base 32-way results

0

5000

10000

15000

20000

25000

30000

35000

40000

Alt

ix (

9M)

pSer

ies

690

p5 5

70

pSer

ies

690T

alph

aser

ver

asam

a 8.

1

Alt

ix

sup

erd

om

e

Asa

ma

8.0

pri

mep

ow

er *

PA

-su

per

do

me

Ori

gin

Asa

ma

7.0

SG

I R12

K


89

Comparison OMPM on 32-way 1.5 GHz Itanium

0

10000

20000

30000

40000

50000

60000

70000

80000

310.

wup

wis

e_m

312.

swim

_m

314.

mgr

id_m

316.

appl

u_m

318.

galg

el_m

320.

equa

ke_m

324.

apsi

_m

326.

gaf

ort

_m

328.

fma3

d_m

330.

art_

m

332.

amm

p_m

asama

altix

superdome


90

SMP Performance Gain Itanium/Itanium 2

0

0,5

1

1,5

2

2,5

3

wu

pw

ise

swim

mgr

id_m

appl

u_m

galg

el_m

equa

ke_m

apsi

_m

gafo

rt_m

fma3

d_m

art_

m

amm

p_m

Star

CD

Sta

rCD

larg

e

Par

apyr U

G

uran

us


91

45.7cm

38.6cmCPU

1985 1990 1995 1998

Pe

rfo

rma

nce

BipolarWater-cooled

CMOSAir-cooled

Multi Nodes

Large scalecluster

>100nodes

SX-3

SX-5

Over 1GFLOPPer Node

SX-6/7

SX-1/2

SX-4

Technology

2cm

2cm

SX-8

Massive scale cluster>500nodes

2004

Single modulenode

Single ChipVector Processor

Multi CPUs

Architecture

The history of NEC SX series

2001


92

Performance Properties of Different SX systems

324 GB/s72 GF/s36 GB/s9 GF/s2002SX-6+

16 GF/s

8 GF/s

4 GF/s

2 GF/s

CPU perf.

128 GF/s

64 GF/s

64 GF/s

64 GF/s

Node perf.

512 GB/s

256 GB/s

512 GB/s

512 GB/s

Mem. Band/Node

16 GB/s1996SX-4

64 GB/s2004SX-8

32 GB/s2001SX-6

32 GB/s1999SX-5e

Mem band/ CPUAvailabilitySystem

Factor 2 in

two years

Factor 2 in

eight years


93

Properties of SPEC codes on vector systems

4649.60.06CEquake

27245.1492.57FGalgel

480211.0499.14FMgrid

152034.1781.31FApplu

164823.0276.70FApsi

168059.6040.25FGafort

148858.7487.34FWupwise

1584253.4899.75FSwim

176102.7976.67CAmmp

272242.1432.06CArt

10408.9510.29FFma3d

MEM (MB)VlenVratioLangName


94

Expectations

Swim, mgrid and maybe galgel should perform well

Equake, fma3d and art should perform poorly

However, the focus was not on absolute, but relative performance andscalability


95

SPEC efficiency on SX

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

SX5

SX5 0,63% 67,89% 52,46% 6,94% 13,71% 1,17% 4,21% 0,15% 1,14% 0,92% 1,88%

wupwi

seswim mgrid applu galgel

equak

eapsi gafort fma3d art ammp


96

Performance measurements

All performance is reported relative to the performance of one thread on SX-4

Number of threads used:

– 1,2,4,8,16,32 on SX-4

– 1,2,4,8,16 on SX-5

– 1,2,4,8 on SX-6+

– 1,2,4,8 on SX-8


97

Wupwise – expected behavior

wupwise

0

10

20

30

40

50

60

0 5 10 15 20 25 30

SX-4

SX-5

SX-6

SX-8

Same node

performance

of SX-4/5/6


98

Art – improves better than peak performance

art

0

10

20

30

40

50

60

0 5 10 15 20 25 30

SX-4

SX-5

SX-6

SX-8

Art benefits from

improvements of

scalar unit


99

Swim – surprisingly improves with every generation

swim

0

10

20

30

40

50

60

0 5 10 15 20 25 30

SX-4

SX-5

SX-6

SX-8

Compute

bound on SX-4

and SX-5 !


100

Mgrid – large improvements from SX-6+ to SX-8

mgrid

0

10

20

30

40

50

60

0 5 10 15 20 25 30

SX-4

SX-5

SX-6

SX-8

Improved

stride 2

memory

access


101

Not much improvement from SX-4 to 5 and 6 to 8

ammp

0

10

20

30

40

50

60

0 5 10 15 20 25 30

SX-4

SX-5

SX-6

SX-8


102

Explanation for ammp improvements

Ammp contains a lot of locks

Lock performance (measured by EPCC microbenchmarks)

1.213.4013.5 micro sSX-8

12.821.234.3 micro sSX-6+

Ammpratio

AmmpLock RatioLock


103

General observations

With the exception of equake and galgel the applications show goodscalability

Peak performance improvements

– realized to 87% to 96% for 1 thread

– realized to 81% to 89% for 8 threads

On average an SX-8 CPU is 6.14 times faster than an SX-4 CPU (peak ratio is8)

No significant difference between scalar and vector codes



Summary


105

Summary – What you should have learned

– There are many different benchmark approaches: microbenchmarks, kernels,applications…

– SPEC benchmarks are application or at least application oriented benchmarks,designed to represent current workloads

• An update is required after a few years

– SPEC benchmarks are used to:

• Measure and compare performance of systems

• Drive future development

• …

– Different metrics are used (base/peak, speed/throughput)

– Many different factors have an influence on application performance:

• CPU

• Memory system

• Compilers

• OS and runtime environment

• I/O system

• …

Leistungsanalyse von Rechnersystemen - TU Dresden · Leistungsanalyse von Rechnersystemen 7. November 2006 Holger Brunst, Matthias Müller: Leistungsanalyse 2 Summary of Previous

Documents