Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

An Evaluation of Current High-Performance Networks

Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael Welcome, Kathy Yelick

Lawrence Berkeley National Lab

&U.C. Berkeley

http://upc.lbl.gov


Motivation

• Benchmark a variety of current high-speed Networks- Measure Latency and Software Overhead, not just

Bandwidth- One-sided communication provides advantages vs.

2-sided MPI?• Global Address Space (GAS) Languages

- UPC, Titanium (Java), Co-Array Fortran- Small message performance (8 bytes)

- Support sparse/irregular/adaptive programs- Programming model: incremental optimization- Overlapping messages can hide the latency


Systems Evaluated

System NetworkBus

(per sec)1-sided

hardwareAPIs

Cray T3E CustomCustom

(330 MB) • SHMEM,

E-registers

IBM SP SP switch 2GXX bus

(2 GB)LAPI

HP AlphaServer QuadricsPCI 64/66

(532 MB) • SHMEM

IBM Netfinity MyrinetPCI 32/66

(266 MB) • GM

PC cluster GigEPCI 64/66

(532 MB)VIPL


Modified LogGP Model

• LogGP: no overlap • Observed: overheads can overlap: L can be negative

P0

P1

osend

L

orecv

P0

P1

osend

orecv

EEL: end to end latency (instead of transport latency L)g: minimum time between small message sendsG: additional gap per byte for larger messages


Microbenchmarks

P0osend

gap

P0osend

gap cpu

P0

osendgap

cpu

1) Ping-pong test: measures EEL (end-to-end latency)

2) Flood test: measures gap (g/G)

3) CPU overlap test: measures software overheads

Flood Test CPU Test 1 CPU Test 2


Latencies for 8 byte ‘puts’

0

5

10

15

20

25

T3E/S

hm

T3E/E

-Reg

T3E/M

PI

IBM

/LAPI

IBM

/MPI

Quadr

ics/S

hm

Quadr

ics/M

PI

Myr

inet/G

M

Myr

inet/M

PI

GigE/V

IPL

GigE/M

PI

use

c

End-to-end latency (1-way)


8-byte ‘put’ Latencies with Software Overheads

0

5

10

15

20

25

usec

Other

Send overhead only (S)

Overhead overlap (V)

Receive overhead only (R)


Gap varies with msg clustering

0

5

10

15

20

25

30

gap

bet

wee

n m

sgs

(use

c)

q=1q=2q=4

q=8q=16

Clustering messages can both use idle cycles, and reduce the number of idle cycles that need to be filled


Potential for CPU overlap during clustered message sends

Hardware support for 1-way communication provides more opportunity for computational overlap

0

2

4

6

8

10

12

usec

Gap (g)Send OverheadReceive Overhead


Fixed message cost (g), vs. per-byte cost (G)

0

2

4

6

8

10

12

use

c

Per Message Cost (g)Per KByte Cost (G*1024)


“Large” Messages

Factor of 6 between minimum sizes needed for “large” message (large = bandwidth dominates fixed message cost)

0

500

1000

1500

2000

2500

3000

3500

4000

Byt

es

Cross-over between g and G


Small message performance over time

Software send overhead for 8-byte messages over time.

Not improving much over time (even in absolute terms)


Conclusion

• Latency and software overhead of messages varies widely among today’s HPC networks- Affects ability to effectively mask communication latency, with

large effect on GAS language viability- especially software overhead--latency can be hidden

• These parameters have historically been overlooked in benchmarks and vendor evaluations- Hopefully this will change- Recent discussions with vendors promising- Incorporation into standard benchmarks would be nice…

http://upc.lbl.gov

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Documents