Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048 Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Center for Information Services and High Performance Computing (ZIH) Leistungsanalyse von Rechnersystemen 9. November 2011
56
Embed
Center for Information Services and High Performance ... · 1995 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1994 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1993 Fujitsu
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Center for Information Services and High Performance Computing (ZIH)
Summary of Previous Lecture
Holger Brunst, Matthias Müller: Leistungsanalyse
Summary of Previous Lecture
Different workloads:
– Test workload
– Real workload
– Synthetic workload
Historical examples for test workloads:
– Addition instruction
– Instruction mixes
– Kernels
– Synthetic programs
– Application benchmarks
Holger Brunst, Matthias Müller: Leistungsanalyse
Excursion on Speedup and Efficiency Metrics
Comparison of sequential and parallel algorithms
Speedup:
– n is the number of processors
– T1 is the execution time of the sequential algorithm
– Tn is the execution time of the parallel algorithm with n processors
Efficiency:
– Its value estimates how well-utilized p processors solve a given problem
– Usually between zero and one. Exception: Super linear speedup (later)
Sn =T1
Tn
E p =Sp
p
Holger Brunst, Matthias Müller: Leistungsanalyse
Amdahl s Law
Find the maximum expected improvement to an overall system when only part of the system is improved
Serial execution time = s+p
Parallel execution time = s+p/n
– Normalizing with respect to serial time (s+p) = 1 results in:
• Sn = 1/(s+p/n)
– Drops off rapidly as serial fraction increases
– Maximum speedup possible = 1/s, independent of n the number of processors!
Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors?
What is wrong with this argument?
Sn =s + p
s +p
n
Holger Brunst, Matthias Müller: Leistungsanalyse
Popilar and historic benchmarks
Popular benchmarks:
– Eratosthenes sieve algorithm
– Ackermann s Function
– Whetstone
– LINPACK
– Dhrystone
– Lawrence Livermore Loops
– TPC-C
– SPEC
Holger Brunst, Matthias Müller: Leistungsanalyse
Workload description
Level of Detail of the workload description - Examples:
– Most frequent request (e.g. Addition)
– Frequency of request type (instruction mix)
– Time-stamped sequence of requests
– Average resource demand (e.g. 20 I/O requests per second)
– Distribution of resource demands (not only the average, but also probability distribution)
2003 HP Integrity Server rx2600 (1 proc 1.5GHz) 1 1.5 GHz 1635
2002 Intel Pentium 4 (3.06 GHz) 1 2.06 GHz 1414
2001 Fujitsu VPP5000/1 1 3.33 nsec 1156
2000 Fujitsu VPP5000/1 1 3.33 nsec 1156
1999 CRAY T916 4 2.2 nsec 1129
1995 CRAY T916 1 2.2 nsec 522
1994 CRAY C90 16 4.2 nsec 479
1993 CRAY C90 16 4.2 nsec 479
1992 CRAY C90 16 4.2 nsec 479
1991 CRAY C90 16 4.2 nsec 403
1990 CRAY Y-MP 8 6.0 nsec 275
1989 CRAY Y-MP 8 6.0 nsec 275
1988 CRAY Y-MP 1 6.0 nsec 74
1987 ETA 10-E 1 10.5 nsec 52
1986 NEC SX-2 1 6.0 nsec 46
1985 NEC SX-2 1 6.0 nsec 46
1984 CRAY X-MP 1 9.5 nsec 21
1983 CRAY 1 1 12.5 nsec 12
1979 CRAY 1 1 12.5 nsec 3.4
In the beginning there was the Linpack 100 Benchmark (1977)
n=100 (80KB); size that would fit in all the machines
Fortran; 64 bit floating point arithmetic
No hand optimization (only compiler options)
Linpack 1000 (1986)
n=1000 (8MB); wanted to see higher performance levels
Any language; 64 bit floating point arithmetic
Hand optimization OK
Linpack TPP (1991) (Top500; 1993)
Any size (n as large as you can; n=106; 8TB; ~6 hours);
Any language; 64 bit floating point arithmetic
Hand optimization OK Strassen s method not allowed (confuses the op count and rate)
Reference implementation available
In all cases results are verified by looking at:
Operations count for factorization ; solve
|| ||(1)
|| || || ||
Ax bO
A x n=
3 22 1
3 2
n n2
2n
LINPACK NxN benchmark Solves system of linear equations by some method Allows the vendors to choose size of problem for benchmark Measures execution time for each size problem
LINPACK NxN report Nmax – the size of the chosen problem run on a machine Rmax – the performance in Gflop/s for the chosen size problem run on the machine N1/2 – the size where half the Rmax execution rate is achieved Rpeak – the theoretical peak performance Gflop/s for the machine
LINPACK NxN is used to rank TOP500 fastest computers in the world
Size
Rate
Nmax
Rmax
N1/2
Size
Rate
TPP performance
(Entries for this table began in 1991.)
Year Computer # of
Procs Measured
Gflop/s Size of
Problem Size of
1/2 Perf Theoretical
Peak Gflop/s
2005 - 2006 IBM Blue Gene/L 131072 280600 1769471 367001
Center for Information Services and High Performance Computing (ZIH)
HPCC Benchmark
Slides courtesy Jack Dongara
From Linpack Benchmark and Top500: “no single number can reflect overall performance”
Clearly need something more than Linpack
HPC Challenge Benchmark
Test suite stresses not only the processors, but the memory system and the interconnect.
The real utility of the HPCC benchmarks are that architectures can be described with a wider range of metrics than just Flop/s from Linpack.
Linpack Benchmark
Good
One number
Simple to define & easy to rank
Allows problem size to change with machine and over time
Bad
Emphasizes only peak CPU speed and number of CPUs
Does not stress local bandwidth
Does not stress the network
Does not test gather/scatter
Ignores Amdahl s Law (Only does weak scaling)
…
Ugly
Benchmarketeering hype
Consists of basically 7 benchmarks; Think of it as a framework or harness for adding benchmarks of interest.
1. HPL (LINPACK) MPI Global (Ax = b)
2. STREAM Local; single CPU *STREAM Embarrassingly parallel
3. PTRANS (A A + BT) MPI Global
4. RandomAccess Local; single CPU *RandomAccess Embarrassingly parallel RandomAccess MPI Global
5. BW and Latency – MPI
6. FFT - Global, single CPU, and EP
7. Matrix Multiply – single CPU and EP
35
HPCC was developed by HPCS to assist in testing new HEC systems Each benchmark focuses on a different part of the memory hierarchy HPCS performance targets attempt to
Flatten the memory hierarchy Improve real application performance Make programming easier
Local - only a single processor is performing computations.
Embarrassingly Parallel - each processor in the entire system is performing computations but they do no communicate with each other explicitly.
Global - all processors in the system are performing computations and they explicitly communicate with each other.
Computational
resources
CPU computational
speed
Memory bandwidth
Node
Interconnect bandwidth
Computational
resources
CPU computational
speed
Memory bandwidth
Node
Interconnect bandwidth
HPL
Matrix Multiply
STREAM Random & Natural Ring
Bandwidth & Latency
Memory Access Patterns
Memory Access Patterns
Size
Rate
TPP performance
TPP Linpack Benchmark Used for the Top500 ratings Solve Ax=b, dense problem, matrix is random
Uses LU decomposition with partial pivoting Based on the ScaLAPACK routines but optimized The algorithm is scalable in the sense that the parallel efficiency is maintained constant with respect to the per processor memory usage In double precision (64-bit) arithmetic Run on all processors Problem size set by user
These settings used for the other tests
Requires An implementation of the MPI An implementation of the Basic Linear Algebra Subprograms (BLAS)
Reports total TFlop/s achieved for set of processors Takes the most time
Considering stopping the process after say 25% Still check to see if correct
The STREAM Benchmark is a standard benchmark for the measurement of computer memory bandwidth Measures bandwidth sustainable from standard operations -- not the theoretical "peak bandwidth" provided by most vendors Four operations
COPY, SCALE ADD, TRIAD
Measures: Machine Balance - relative cost of memory accesses vs arithmetic Vector lengths chosen to fill local memory
Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion
The matrices A and B are distributed across the processors
Two-dimensional block-cyclic storage Same storage as for HPL
Exercises the communications pattern where pairs of processors communicate with each other simultaneously.
Large (out-of-cache) data transfers across the network
Stresses the global bisection bandwidth
Reports total GB/s achieved for set of processors
Integer Read-modify-write to random address No spatial or temporal locality Measures memory latency or the ability to hide memory latency
Architecture stresses Latency to cache and main memory Architectures which can generate enough outstanding memory operations to tolerate the latency, change this into a main memory bandwidth constrained benchmark
Three forms Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion
Tested with an MPI version across the set of processors
Each processor caches updates then all processors perform MPI all-to-all communication to perform updates across processors
Reports Gup/s (Giga updates per second) per processor
Ping-Pong test between pairs of processors
Send a message from proci to prock then return message from prock to proci