Center for Information Services and High Performance ... · 1995 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1994 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1993 Fujitsu

Nöthnitzer Straße 46

Raum 1026

Tel. +49 351 - 463 - 35048

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Leistungsanalyse

von Rechnersystemen

9. November 2011


Raum 1026

Tel. +49 351 - 463 - 35048




Summary of Previous Lecture

Holger Brunst, Matthias Müller: Leistungsanalyse

Summary of Previous Lecture

Different workloads:

– Test workload

– Real workload

– Synthetic workload

Historical examples for test workloads:

– Addition instruction

– Instruction mixes

– Kernels

– Synthetic programs

– Application benchmarks


Excursion on Speedup and Efficiency Metrics

Comparison of sequential and parallel algorithms

Speedup:

– n is the number of processors

– T1 is the execution time of the sequential algorithm

– Tn is the execution time of the parallel algorithm with n processors

Efficiency:

– Its value estimates how well-utilized p processors solve a given problem

– Usually between zero and one. Exception: Super linear speedup (later)

Sn =T1

Tn

E p =Sp

p


Amdahl s Law

Find the maximum expected improvement to an overall system when only part of the system is improved

Serial execution time = s+p

Parallel execution time = s+p/n

– Normalizing with respect to serial time (s+p) = 1 results in:

• Sn = 1/(s+p/n)

– Drops off rapidly as serial fraction increases

– Maximum speedup possible = 1/s, independent of n the number of processors!

Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors?

What is wrong with this argument?

Sn =s + p

s +p

n


Popilar and historic benchmarks

Popular benchmarks:

– Eratosthenes sieve algorithm

– Ackermann s Function

– Whetstone

– LINPACK

– Dhrystone

– Lawrence Livermore Loops

– TPC-C

– SPEC


Workload description

Level of Detail of the workload description - Examples:

– Most frequent request (e.g. Addition)

– Frequency of request type (instruction mix)

– Time-stamped sequence of requests

– Average resource demand (e.g. 20 I/O requests per second)

– Distribution of resource demands (not only the average, but also probability distribution)


Characterization of Benchmarks

There are many metrics, each one has its purpose

– Raw machine performance: Tflops

– Microbenchmarks: Stream

– Algorithmic benchmarks: Linpack

– Compact Apps/Kernels: NAS benchmarks

– Application Suites: SPEC

– User-specific applications: Custom benchmarks

Computer Hardware

Applications


Comparison of different benchmark classes

coverage relevance Identify problems

Time evolution

Micro 0 0 ++ +

Algorithmic - 0 + ++

Kernels 0 0 + +

SPEC + + + +

Apps - ++ 0 0


SPEC Benchmarks: CPU 2006

Application Benchmarks

Different metrics:

– Integer, floatingpoint

– Standard and rate

– Base, peak

Run rules


Raum 1026

Tel. +49 351 - 463 - 35048




Example for a Microbenchmark:

Stream


Stream Benchmark

Author: John McCalpin ( Mr Bandwidth )

John McCalpin Memory Bandwidth and Machine Balance in High Performance Computers , IEEE TCCA Newsletter, December 1995

http://www.cs.virginia.edu/stream/

STREAM: measure memory bandwidth with the operations:

– Copy: a(i) = b(i)

– Scale: a(i)=s*b(i)

– Add: a(i)=b(i)+c(i)

– Triad: a(i)=b(i)+s*c(i)

STREAM2: measures memory hierarchy bandwidth with the operations:

– Fill: a(i)=0

– Copy: a(i)=b(i)

– Daxpy: a(i) = a(i) +q*b(i)

– Sum: sum += a(i)


Stream 2 properties


Stream Results: TOP 10 in 2011

STREAM Memory Bandwidth --- John D. McCalpin, [email protected] Revised to Thu Jul 21 13:02:04 CDT 2011 All results are in MB/s --- 1 MB=10^6 B, *not* 2^20 B -------------------------------------------------------------------------------------------- Sub. Date Machine ID ncpus COPY SCALE ADD TRIAD -------------------------------------------------------------------------------------------- 2011.04.05 SGI_Altix_UV_1000 2048 5321074.0 5346667.0 5823380.0 5859367.0 data 2006.07.10 SGI_Altix_4700 1024 3661963.0 3677482.0 4385585.0 4350166.0 data 2011.06.06 ScaleMP_Xeon_X6560_64B 768 1493963.0 2112630.0 2252598.0 2259709.0 data 2004.12.22 SGI_Altix_3700_Bx2 512 906388.0 870211.0 1055179.0 1119913.0 data 2003.11.13 SGI_Altix_3000 512 854062.0 854338.0 1008594.0 1007828.0 data 2003.10.02 NEC_SX-7 32 876174.7 865144.1 869179.2 872259.1 data 2008.04.07 IBM_Power_595 64 679207.2 624707.8 777334.8 805804.6 data 1999.12.07 NEC_SX-5-16A 16 607492.0 590390.0 607412.0 583069.0 data 2009.08.10 ScaleMP_XeonX5570_vSMP_16B 128 437571.0 431726.0 442722.0 445869.0 data 1997.06.10 NEC_SX-4 32 434784.0 432886.0 437358.0 436954.0 data --------------------------------------------------------------------------------------------


Stream 2 Results


Raum 1026

Tel. +49 351 - 463 - 35048




Linpack and TOP500

Slides courtesy Jack Dongarra

The Linpack Benchmark is a measure of a computer s floating-point rate of execution.

It is determined by running a computer program that solves a dense system of linear equations.

Over the years the characteristics of the benchmark has changed a bit.

In fact, there are three benchmarks included in the Linpack Benchmark report.

LINPACK Benchmark Dense linear system solve with LU factorization using partial pivoting Operation count is: 2/3 n3 + O(n2)

Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100.

When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results:

Please send the results of this run to:

Jack J. Dongarra

Computer Science Department

University of Tennessee Knoxville, Tennessee 37996-1300

Fax: 865-974-8296

Internet: [email protected]

norm. resid resid machep x(1) x(n)

1.67005097E+00 7.41628980E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00

times are reported for matrices of order 100

dgefa dgesl total mflops unit ratio

times for array with leading dimension of 201

1.540E-03 6.888E-05 1.609E-03 4.268E+02 4.686E-03 2.873E-02

1.509E-03 7.084E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.509E-03 7.003E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02

1.502E-03 6.593E-05 1.568E-03 4.380E+02 4.567E-03 2.800E-02

times for array with leading dimension of 200

1.431E-03 6.716E-05 1.498E-03 4.584E+02 4.363E-03 2.675E-02 1.424E-03 6.694E-05 1.491E-03 4.605E+02 4.343E-03 2.663E-02

1.431E-03 6.699E-05 1.498E-03 4.583E+02 4.364E-03 2.676E-02

1.432E-03 6.439E-05 1.497E-03 4.588E+02 4.360E-03 2.673E-02

Time

Factor

Time

Solve

Total

Time Mflop/s

rate

In the beginning there was the Linpack 100 Benchmark (1977)

n=100 (80KB); size that would fit in all the machines

Fortran; 64 bit floating point arithmetic

No hand optimization (only compiler options)

Year Computer Number of

Processors Cycle time Mflop/s

2006 Intel Pentium Woodcrest (3 GHz) 1 3 GHz 3018

2005 NEC SX-8/1 (1 proc) 1 2 GHz 2177

2004 Intel Pentium Nocona (1 proc 3.6 GHz) 1 3.6 GHz 1803

2003 HP Integrity Server rx2600 (1 proc 1.5GHz) 1 1.5 GHz 1635

2002 Intel Pentium 4 (3.06 GHz) 1 2.06 GHz 1414

2001 Fujitsu VPP5000/1 1 3.33 nsec 1156

2000 Fujitsu VPP5000/1 1 3.33 nsec 1156

1999 CRAY T916 4 2.2 nsec 1129

1995 CRAY T916 1 2.2 nsec 522

1994 CRAY C90 16 4.2 nsec 479

1993 CRAY C90 16 4.2 nsec 479

1992 CRAY C90 16 4.2 nsec 479

1991 CRAY C90 16 4.2 nsec 403

1990 CRAY Y-MP 8 6.0 nsec 275

1989 CRAY Y-MP 8 6.0 nsec 275

1988 CRAY Y-MP 1 6.0 nsec 74

1987 ETA 10-E 1 10.5 nsec 52

1986 NEC SX-2 1 6.0 nsec 46

1985 NEC SX-2 1 6.0 nsec 46

1984 CRAY X-MP 1 9.5 nsec 21

1983 CRAY 1 1 12.5 nsec 12

1979 CRAY 1 1 12.5 nsec 3.4

In the beginning there was the Linpack 100 Benchmark (1977)

n=100 (80KB); size that would fit in all the machines

Fortran; 64 bit floating point arithmetic

No hand optimization (only compiler options)

Linpack 1000 (1986)

n=1000 (8MB); wanted to see higher performance levels

Any language; 64 bit floating point arithmetic

Hand optimization OK

Linpack TPP (1991) (Top500; 1993)

Any size (n as large as you can; n=106; 8TB; ~6 hours);

Any language; 64 bit floating point arithmetic

Hand optimization OK Strassen s method not allowed (confuses the op count and rate)

Reference implementation available

In all cases results are verified by looking at:

Operations count for factorization ; solve

|| ||(1)

|| || || ||

Ax bO

A x n=

3 22 1

3 2

n n2

2n

LINPACK NxN benchmark Solves system of linear equations by some method Allows the vendors to choose size of problem for benchmark Measures execution time for each size problem

LINPACK NxN report Nmax – the size of the chosen problem run on a machine Rmax – the performance in Gflop/s for the chosen size problem run on the machine N1/2 – the size where half the Rmax execution rate is achieved Rpeak – the theoretical peak performance Gflop/s for the machine

LINPACK NxN is used to rank TOP500 fastest computers in the world

Size

Rate

Nmax

Rmax

N1/2

Size

Rate

TPP performance

(Entries for this table began in 1991.)

Year Computer # of

Procs Measured

Gflop/s Size of

Problem Size of

1/2 Perf Theoretical

Peak Gflop/s

2005 - 2006 IBM Blue Gene/L 131072 280600 1769471 367001

2002– 2004 Earth Simulator Computer,

NEC 5104 35610 1041216 265408 40832

2001 ASCI White-Pacific, IBM SP

Power 3 7424 7226 518096 179000 11136

2000 ASCI White-Pacific, IBM SP

Power 3 7424 4938 430000 11136

1999 ASCI Red Intel Pentium II Xeon

core 9632 2379 362880 75400 3207

1998 ASCI Blue-Pacific SST, IBM SP

604E 5808 2144 431344 3868

1997 Intel ASCI Option Red (200

MHz Pentium Pro) 9152 1338 235000 63000 1830

1996 Hitachi CP-PACS 2048 368.2 103680 30720 614

1995 Intel Paragon XP/S MP 6768 281.1 128600 25700 338

1994 Intel Paragon XP/S MP 6768 281.1 128600 25700 338

1993 Fujitsu NWT 140 124.5 31920 11950 236

1992 NEC SX-3/44 4 20.0 6144 832 22

1991 Fujitsu VP2600/10 1 4.0 1000 200 5

Performance of Supercomputers at ZIH

0,0001

0,001

0,01

0,1

1

10

100

1000

10000 T

FL

OP

S

Jahr

Cray T3E 28 GFlops

Platz 237 VP200-EX 472 MFlops

Platz 500

SGI Origin 2000 16,5 GFlops

Platz 236

SGI Origin 3800 85,4 GFlops

Platz 351

Rang 1

Rang 10 Rang 500

PC-Farm 10,88 TFlops

Platz 79

SGI Altix 11,9 TFlops

Platz 49

HRSK-II Stufe 1

HRSK-II Stufe 2

TOP10 (June 2011)

Rank Site Computer/Year Vendor Cores Rmax Rpeak Power

1 RIKEN Advanced Institute for Computational Science (AICS) Japan

K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect / 2011 Fujitsu

548352 8162.00 8773.63 9898.56

2 National Supercomputing Center in Tianjin China

Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-1000 8C / 2010 NUDT

186368 2566.00 4701.00 4040.00

3 DOE/SC/Oak Ridge National Laboratory United States

Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 2009 Cray Inc.

224162 1759.00 2331.00 6950.60

4 National Supercomputing Centre in Shenzhen (NSCS) China

Nebulae - Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPU / 2010 Dawning

120640 1271.00 2984.30 2580.00

5 GSIC Center, Tokyo Institute of Technology Japan

TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows / 2010 NEC/HP

73278 1192.00 2287.63 1398.61

6 DOE/NNSA/LANL/SNL United States

Cielo - Cray XE6 8-core 2.4 GHz / 2011 Cray Inc.

142272 1110.00 1365.81 3980.00

7 NASA/Ames Research Center/NAS United States

Pleiades - SGI Altix ICE 8200EX/8400EX, Xeon HT QC 3.0/Xeon 5570/5670 2.93 Ghz, Infiniband / 2011 SGI

111104 1088.00 1315.33 4102.00

8 DOE/SC/LBNL/NERSC United States

Hopper - Cray XE6 12-core 2.1 GHz / 2010 Cray Inc.

153408 1054.00 1288.63 2910.00

9 Commissariat a l'Energie Atomique (CEA) France

Tera-100 - Bull bullx super-node S6010/S6030 / 2010 Bull SA

138368 1050.00 1254.55 4590.00

10 DOE/NNSA/LANL United States

Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire Infiniband / 2009 IBM

122400 1042.00 1375.78 2345.50

Trends: Architectures

?

Trends: Processor Family

Trends: Interconnect Family

?

Trends: Operating System Family

?

Matthias Müller ([email protected])


HPCC Benchmark

Slides courtesy Jack Dongara

From Linpack Benchmark and Top500: “no single number can reflect overall performance”

Clearly need something more than Linpack

HPC Challenge Benchmark

Test suite stresses not only the processors, but the memory system and the interconnect.

The real utility of the HPCC benchmarks are that architectures can be described with a wider range of metrics than just Flop/s from Linpack.

Linpack Benchmark

Good

One number

Simple to define & easy to rank

Allows problem size to change with machine and over time

Bad

Emphasizes only peak CPU speed and number of CPUs

Does not stress local bandwidth

Does not stress the network

Does not test gather/scatter

Ignores Amdahl s Law (Only does weak scaling)

…

Ugly

Benchmarketeering hype

Consists of basically 7 benchmarks; Think of it as a framework or harness for adding benchmarks of interest.

1. HPL (LINPACK) MPI Global (Ax = b)

2. STREAM Local; single CPU *STREAM Embarrassingly parallel

3. PTRANS (A A + BT) MPI Global

4. RandomAccess Local; single CPU *RandomAccess Embarrassingly parallel RandomAccess MPI Global

5. BW and Latency – MPI

6. FFT - Global, single CPU, and EP

7. Matrix Multiply – single CPU and EP

35

HPCC was developed by HPCS to assist in testing new HEC systems Each benchmark focuses on a different part of the memory hierarchy HPCS performance targets attempt to

Flatten the memory hierarchy Improve real application performance Make programming easier

HPC Challenge

Performance Targets

HPL: linear system solve Ax = b

STREAM: vector operations A = B + s * C

FFT: 1D Fast Fourier Transform Z = fft(X)

RandomAccess: integer update T[i] = XOR( T[i], rand)

Cache(s)

Local Memory

Registers

Remote Memory

Disk

Tape

Instructions

Memory Hierarchy

Operands

Lines Blocks

Messages

Pages

Local - only a single processor is performing computations.

Embarrassingly Parallel - each processor in the entire system is performing computations but they do no communicate with each other explicitly.

Global - all processors in the system are performing computations and they explicitly communicate with each other.

Computational

resources

CPU computational

speed

Memory bandwidth

Node

Interconnect bandwidth

Computational

resources

CPU computational

speed

Memory bandwidth

Node

Interconnect bandwidth

HPL

Matrix Multiply

STREAM Random & Natural Ring

Bandwidth & Latency

Memory Access Patterns

Memory Access Patterns

Size

Rate

TPP performance

TPP Linpack Benchmark Used for the Top500 ratings Solve Ax=b, dense problem, matrix is random

Uses LU decomposition with partial pivoting Based on the ScaLAPACK routines but optimized The algorithm is scalable in the sense that the parallel efficiency is maintained constant with respect to the per processor memory usage In double precision (64-bit) arithmetic Run on all processors Problem size set by user

These settings used for the other tests

Requires An implementation of the MPI An implementation of the Basic Linear Algebra Subprograms (BLAS)

Reports total TFlop/s achieved for set of processors Takes the most time

Considering stopping the process after say 25% Still check to see if correct

The STREAM Benchmark is a standard benchmark for the measurement of computer memory bandwidth Measures bandwidth sustainable from standard operations -- not the theoretical "peak bandwidth" provided by most vendors Four operations

COPY, SCALE ADD, TRIAD

Measures: Machine Balance - relative cost of memory accesses vs arithmetic Vector lengths chosen to fill local memory

Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion

Reports total GB/s achieved per processor

------------------------------------------------------------------

name kernel bytes/iter FLOPS/iter

------------------------------------------------------------------

COPY: a(i) = b(i) 16 0

SCALE: a(i) = q*b(i) 16 1

SUM: a(i) = b(i) + c(i) 24 1

TRIAD: a(i) = b(i) + q*c(i) 24 2

------------------------------------------------------------------

Implements parallel matrix transpose

A = A + BT

The matrices A and B are distributed across the processors

Two-dimensional block-cyclic storage Same storage as for HPL

Exercises the communications pattern where pairs of processors communicate with each other simultaneously.

Large (out-of-cache) data transfers across the network

Stresses the global bisection bandwidth

Reports total GB/s achieved for set of processors

Integer Read-modify-write to random address No spatial or temporal locality Measures memory latency or the ability to hide memory latency

Architecture stresses Latency to cache and main memory Architectures which can generate enough outstanding memory operations to tolerate the latency, change this into a main memory bandwidth constrained benchmark

Three forms Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion

Tested with an MPI version across the set of processors

Each processor caches updates then all processors perform MPI all-to-all communication to perform updates across processors

Reports Gup/s (Giga updates per second) per processor

Ping-Pong test between pairs of processors

Send a message from proci to prock then return message from prock to proci

proci MPI_Send() - prock MPI_Recv() proci MPI_Recv() - prock MPI_Send() Other processors doing MPI_Waitall()

time += MPI_Wtime() time /= 2 The test is performed between as many possible distinct pairs of processors.

There is an upper bound on the time for the test Tries to find the weakest link amongst all pairs

Minimum bandwidth Maximum latency Not necessarily the same link will be the worst for bandwidth and latency

Message 8B used for latency test; take max time Message 2MB used for bandwidth test; take min GB/s

Two types of rings: Naturally ordered

(use MPI_COMM_WORLD): 0,1,2, ... P-1.

Randomly ordered (30 rings tested)

eg.: 7, 2, 5, 0, 3, 1, 4, 6

Each node posts two sends (to its left and right neighbor) and two receives (from its left and right neighbor).

Two types of communication routines are used: combined send/receive and non-blocking send/receive.

MPI_Sendrecv( TO: right_neighbor,FROM: left_neighbor) MPI_Irecv( left_neighbor )MPI_Irecv( right_neighbor ) and MPI_Isend( right_neighbor )MPI_Isend( left_neighbor )

The smaller (better) time for each is taken (which one is smaller depends on the MPI implementation).

Message 8B used for latency test; Message 2MB used for bandwidth test;

Using FFTE software Daisuke Takahashi code from University of Tsukuba 64 bit complex 1-D FFT

Uses 64 bit addressing Global transpose with MPI_Alltoall() Three transposes (data is never scrambled)

Single program to download and run Simple input file similar to HPL input

Base Run and Optimization Run Base run must be made

User supplies MPI and the BLAS Optimized run allowed to replace certain routines

User specifies what was done

Results upload via website html table and Excel spreadsheet generated with performance results

Intentionally we are not providing a single figure of merit (no over all ranking)

Goal: no more than 2 X the time to execute HPL.

49

1.Download

2.Install

3.Run

4.Upload results

5.Confirm via @email@

6.Tune

7.Run

8.Upload results

9.Confirm via @email@

Only some routines can be replaced Data layout needs to be preserved Multiple languages can be used

Provide detailed installation and

execution environment

Results are immediately available on the web site: Interactive HTML XML MS Excel Kiviat charts (radar plots)

Optional

Prequesites: C compiler

BLAS MPI

Center for Information Services and High Performance ... · 1995 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1994 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1993 Fujitsu

Documents