Top Banner
Supercomputer Performance Characterization Presented By: IQxplorer
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Super Computer

SupercomputerPerformance Characterization

Presented By: IQxplorer

Page 2: Super Computer

Here are some important computer performance questions

• What key computer system parameters determine performance?

• What synthetic benchmarks can be used to characterize these system parameters?

• How does performance on synthetics compare between computers?

• How does performance on applications compare between computers?

• How does performance scale (i.e., vary with processor count)?

Page 3: Super Computer

Comparative performance results have been obtainedon six computers at NCSA & SDSC,

all with > 1,000 processors

Computer SiteSystemvendor

Computeprocessors :

p/node

ClockSpeed(GHz)

Processortype

Switchtype

Blue Gene SDSC IBM 2,048: 2 0.7 PowerPCCustom:

3-D torus + tree

Cobalt NCSA SGI 1,024: 512 1.6 Itanium 2NUMAlink 4 +InfiniBand 4x

DataStar SDSC IBM1,408: 8 &

768: 81.5 &

1.7Power4+

Federation:fat tree

Mercury NCSA IBM 1,262: 2 1.5 Itanium 2 Myrinet 2000

Tungsten NCSA Dell 2,560: 2 3.2 Xeon Myrinet 2000

T2 NCSA Dell 1,024: 2 3.6 EM64T InfiniBand 4x

Page 4: Super Computer

These computers have shared-memory nodesof widely varying size connected by different switch types

• Blue Gene• Massively parallel processor system with low-power, 2p nodes• Two custom switches for point-to-point and collective communication

• Cobalt• Cluster of two large, 512p nodes (also called a constellation)• Custom switch within nodes & commodity switch between nodes

• DataStar• Cluster of 8p nodes• Custom high-performance switch called Federation

• Mercury, Tungsten, & T2• Clusters of 2p nodes• Commodity switches

Page 5: Super Computer

Performance can be better understood with a simple model

• Total run time can be split into three components:ttot = tcomp + tcomm + tio

• Overlap may exist. If so, it can be handled as follows:tcomp = computation timetcomm = communication time that can’t be overlapped with tcomptio = I/O time that can’t be overlapped with tcomp & tcomm

• Relative values vary depending upon computer, application, problem, & number of processors

Page 6: Super Computer

Run-time components depend uponsystem parameters & code features

Run-timecomponent

System parameter Code feature

tcompFlop speedMemory bandwidthMemory latency

Computation in cacheStrided memory accessRandom memory access

tcommInterconnect bandwidthInterconnect latency

Large message transfersSmall message transfers

tio I/O rate Large transfers to disk

Differences between point-to-point & collective communication are important too

Page 7: Super Computer

Compute, communication, & I/O speeds have been measured for many synthetic & application benchmarks

• Synthetic benchmarks• sloops (includes daxpy & dot)• HPL (Linpack)• HPC Challenge• NAS Parallel Benchmarks• IOR

• Application benchmarks• Amber 9 PMEMD (biophysics: molecular dynamics)• …• WRF (atmospheric science: weather prediction)

Page 8: Super Computer

Normalized memory access profiles for daxpyshow better memory access, but more memory contention

on Blue Gene compared DataStar

0.1

1.0

10.0

1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

Memory accessed: 16n (B)

Mem

ory

bandw

idth

(m

em

ops/

pro

cess

or-

clock

)

DS 1p/n O4 qnoipa

DS 8p/n O4 qnoipa

BG CO:1p/n O3 440d

BG VN: 2p/n O3 440d

BG CO:1p/n O3 440

daxpy: a(i) = a(i) + s*b(i)

DataStar: 1.5-GHz Power4+ processorsBlue Gene: 700-MHz PowerPC processors

Page 9: Super Computer

Each HPCC synthetic benchmark measures one or two system parameters in varying combinations

Primary system parameters that determine performance

HPCCbenchmark

Flop speedMemory

bandwidthMemorylatency

Interconnectbandwidth

Interconnectlatency

HPL x x

DGEMM x

STREAM x

PTRANS x x

RandomAccess x x

FFTE x x

bench_lat_bw x x

Page 10: Super Computer

Relative speeds are shown for HPCC benchmarks on 6 computersat 1,024p; 4 different computers are fastest depending upon benchmark;

2 of these are also slowest, depending upon benchmark

0.1

1.0

10.0

G-HPL G-PTRANS G-FFTE G-RandomAccess

EP-STREAMTriad

EP-DGEMM RandomRing

Bandwidth

RandomRing

Latency

HPCC Benchmark on 1,024p

Sp

eed

rela

tive t

o 1

.5-G

Hz

Data

Sta

r

DataStar: 1.5-GHz Power4+

Cobalt: 1.6-GHz Itanium 2

Mercury: 1.5-GHz Itanium 2

T2: 3.6-GHz EM64T

Tungsten: 3.2-GHz Xeon

Blue Gene: 0.7-GHz PowerPC

Data available soon at CIP Web site: www.ci-partnership.org

Page 11: Super Computer

Absolute speeds are shown for HPCC & IOR benchmarkson SDSC computers; TG processors are fastest, BG & DS

interconnects are fastest, & all three computers have similar I/O rates

EP-EP- STREAM Ping Pong Ping Pong IOR write IOR read

DGEMM Triad bandwidth latency rate rateComputer (Gflop/s) (GB/s) (GB/s) (µs) (GB/s) (GB/s)

Blue Gene 2.2 0.87 0.16 4 3.4 2.7DataStar 3.7 1.68 1.40 6 3.8 2.0TeraGrid IA-64 5.6 1.90 0.23 11 4.2 3.1

Page 12: Super Computer

Relative speeds are shown for 5 applications on 6 computersat various processor counts; Cobalt & DataStar are generally fastest

0.1

1.0

10.0

GAMESS lg: 384pMILC lg: 1,024p NAMD ApoA1:512p

PARATEC lg:256p

WRF std: 256p

Application

Sp

eed

rela

tive t

o 1

.5-G

Hz

Data

Sta

r

Cobalt: 1.6-GHz Itanium 2

DataStar: 1.5-GHz Power4+

Mercury: 1.5-GHz Itanium 2

T2: 3.6-GHz EM64T

Tungsten: 3.2-GHz Xeon

Blue Gene: 0.7-GHz PowerPC

Page 13: Super Computer

Good scaling is essential to take advantageof high processors counts

• Two types of scaling are of interest• Strong: performance vs processor count (p) for fixed problem size • Weak: performance vs p for fixed work per processor

• There are several ways of plotting scaling• Run time (t) vs p• Speed (1/t) vs p• Speed/p vs p

• Scaling depends significantly on the computer, application, & problem

• Use log-log plot to preserve ratios when comparing computers

Page 14: Super Computer

AWM 512^3 problem shows good strong scaling to 2,048pon Blue Gene & to 512p on DataStar, but not on TeraGrid cluster

Data from Yifeng Cui

AWM 512^3 Execution Time Without I/O

0.01

0.10

1.00

10.00

64 128 256 512 1024 2048

Number of processors

Wal

l Clo

ck t

ime

per

ste

p

DSBGTGDS idealBG idealTG ideal

Page 15: Super Computer

MILC medium problem shows superlinear speedupon Cobalt, Mercury, & DataStar at small processor counts;strong scaling ends for DataStar & Blue Gene above 2,048p

0.1

1.0

10.0

16 32 64 128 256 512 1,024 2,048

Processors

Sp

eed

/pro

cessor

rela

tive t

o 1

.5-G

Hz

Data

Sta

r at

16

p

Cobalt: 1.6-GHz Itanium 2

Mercury: 1.5-GHz Itanium 2

DataStar: 1.5-GHz Power4+

T2: 3.6-GHz EM64T

Tungsten: 3.2-GHz Xeon

Blue Gene: 0.7-GHz PowerPCMILC medium

Page 16: Super Computer

NAMD ApoA1 problem scales best on DataStar & Blue Gene;Cobalt is fastest below 512p, but the same speed as DataStar at 512p

0.1

1.0

10.0

16 32 64 128 256 512 1,024 2,048

Processors

Sp

eed

/pro

cessor

rela

tive t

o 1

.5-G

Hz

Data

Sta

r at

16

p

Cobalt: 1.6-GHz Itanium 2

DataStar: 1.5-GHz Power4+

Mercury: 1.5-GHz Itanium 2

Tungsten: 3.2-GHz Xeon

Blue Gene: 0.7-GHz PowerPC

NAMD ApoA1 with 92k atoms

Page 17: Super Computer

WRF standard problem scales best on DataStar;Cobalt is fastest below 512p, but the same speed as DataStar at 512p

0.1

1.0

10.0

16 32 64 128 256 512 1,024 2,048

Processors

Sp

eed

/pro

cessor

rela

tive t

o 1

.5-G

Hz

Data

Sta

r at

16

p

Cobalt: 1.6-GHz Itanium 2

DataStar: 1.5-GHz Power4+

Mercury: 1.5-GHz Itanium 2

T2: 3.6-GHz EM64T

Tungsten: 3.2-GHz Xeon

WRF standard

Page 18: Super Computer

Communication fraction generally grows with processor count in strong scaling scans, such as for WRF standard problem on DataStar

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

16 32 64 128 256 512 1,024 2,048

Processors

Com

mu

nic

ati

on

fra

cti

on

DataStar: 1.5-GHz Power4+WRF standard

Page 19: Super Computer

A more careful look at Blue Gene shows many pluses

+ Hardware is more reliable than for other high-end systems installed at SDSC in recent years

+ Compute times are extremely reproducible+ Networks scale well+ I/O performance with GPFS is good at high p+ Price per peak flop/s is low+ Power per flop/s is low+ Footprint is small

Page 20: Super Computer

But there are also some minuses

- Processors are relatively slow• Clock speed is 700 MHz• Compilers seldom use second FPU in each processor

(though optimized libraries do)

- Applications must scale well to get high absolute performance

- Memory is only 512 MB/node, so some problems don’t fit• Coprocessor mode can be used (with 1p/node), but this is inefficient• Some problems still don’t fit even in coprocessor mode

- Cross-compiling complicates software development for complex codes

Page 21: Super Computer

Major applications ported and being run on BG at SDSC span various disciplines

Code name Discipline Description Implementors

Amber 9 PMEMD Biophysics Molecular dynamics Ross Walker (SDSC)

AWM Geophysics 3-D seismic wave Yifeng Cui (SDSC) propagation

DNS (ESSL) Engineering Direct numerical simulation Diego Donzis (Georgia Tech) & of 3-D turbulence Dmitry Pekurovsky (SDSC)

DOT (FFTW) Biophysics Protein docking Susan Lindsey (SDSC) & Wayne Pfeiffer (SDSC)

MILC * Physics Quantum chromodynamics Doug Toussaint (Arizona)

mpcugles Engineering 3-D fluid dynamics Giri Chukkapalli (SDSC)

NAMD 2.6b1 * Biophysics Molecular dynamics Sameer Kumar (IBM) (FFTW)Rosetta * Biophysics Protein folding Ross Walker (SDSC)

SPECFEM3D Geophysics 3-D seismic wave Brian Savage propagation (Carnegie Institution)

* Most heavily used

Page 22: Super Computer

Speed of BG relative to DataStar varies about clock speed ratio(0.47 = 0.7/1.5) for applications on ≥ 512p;

CO & VN mode perform similarly (per MPI p)

0.1

1.0

Amber 9PMEMD

Cellulose:768p

AWM 512^3w/o I/O:1,024p

DNS (ESSL)1,024^3:1,024p

DOT (FFTW)UDG/UGI 54krots: 512p

MILC large:1,024p

mpcuglesforward prop

w/o I/O:512p

NAMD 2.6b1(FFTW)

ApoA1: 512p

SPECFEM3DTonga-Fiji:

1,024p

Application

Sp

eed

rela

tive t

o 1

.5-G

Hz

Data

Sta

r BG in CO mode

BG in VN mode

Clock speed ratio = 0.47

Page 23: Super Computer

DNS scaling on BG is generally better than on DataStar,but shows unusual variation;

VN mode is somewhat slower than CO mode (per MPI p)

0.1

1.0

16 32 64 128 256 512 1024 2048

MPI processors

Sp

eed

/pro

cess

or

rela

tive t

o 1

.5-G

Hz

Data

Sta

r

1.5-GHz DataStar

Blue Gene CO mode

Blue Gene VN modeDNS 1024^3

Data from Dmitry Pekurovsky

Page 24: Super Computer

If number of allocated processors is considered,then VN mode is faster than CO mode,

and both modes show unusual variation

0.1

1.0

16 32 64 128 256 512 1024 2048

Allocated processors

Sp

eed

/pro

cess

or

rela

tive t

o 1

.5-G

Hz

Data

Sta

r

1.5-GHz DataStar

Blue Gene CO mode

Blue Gene VN modeDNS 1024^3

Data from Dmitry Pekurovsky

Page 25: Super Computer

IOR weak scaling scans using GPFS-WAN show BG in VN modeachieves 3.4 GB/s for writes (~DS) & 2.7 GB/s for reads (>DS)

10

100

1,000

10,000

1 2 4 8 16 32 64 128 256 512 1024 2048

MPI processors

I/O

rate

(M

B/s

)

CO peak CO write CO read VN peak VN write VN read

Blue Gene in CO & VN mode using gpfs-wan (default mapping) Noncollective read/write via IOR (256 or 128 MB/p)(-a POSIX -e -t 1m -b 256m or -b 128m)

Data from 3/7/06

Page 26: Super Computer

Blue Gene has more limited applicability than DataStar,but is a good choice if the application is right

+ Some applications run relatively fast & scale well+ Turnaround is good with only a few users+ Hardware is reliable & easy to maintain- Other applications run relatively slowly and/or don’t

scale well- Some typical problems need to run in CO mode to fit in

memory- Other typical problems won’t fit at all