Top Banner
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael Welcome, Kathy Yelick Lawrence Berkeley National Lab & U.C. Berkeley http://upc.lbl.gov
13

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Jan 03, 2016

Download

Documents

Sandra Young
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

An Evaluation of Current High-Performance Networks

Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael Welcome, Kathy Yelick

Lawrence Berkeley National Lab

&U.C. Berkeley

http://upc.lbl.gov

Page 2: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Motivation

• Benchmark a variety of current high-speed Networks- Measure Latency and Software Overhead, not just

Bandwidth- One-sided communication provides advantages vs.

2-sided MPI?• Global Address Space (GAS) Languages

- UPC, Titanium (Java), Co-Array Fortran- Small message performance (8 bytes)

- Support sparse/irregular/adaptive programs- Programming model: incremental optimization- Overlapping messages can hide the latency

Page 3: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Systems Evaluated

System NetworkBus

(per sec)1-sided

hardwareAPIs

Cray T3E CustomCustom

(330 MB) • SHMEM,

E-registers

IBM SP SP switch 2GXX bus

(2 GB)LAPI

HP AlphaServer QuadricsPCI 64/66

(532 MB) • SHMEM

IBM Netfinity MyrinetPCI 32/66

(266 MB) • GM

PC cluster GigEPCI 64/66

(532 MB)VIPL

Page 4: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Modified LogGP Model

• LogGP: no overlap • Observed: overheads can overlap: L can be negative

P0

P1

osend

L

orecv

P0

P1

osend

orecv

EEL: end to end latency (instead of transport latency L)g: minimum time between small message sendsG: additional gap per byte for larger messages

Page 5: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Microbenchmarks

P0osend

gap

P0osend

gap cpu

P0

osendgap

cpu

1) Ping-pong test: measures EEL (end-to-end latency)

2) Flood test: measures gap (g/G)

3) CPU overlap test: measures software overheads

Flood Test CPU Test 1 CPU Test 2

Page 6: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Latencies for 8 byte ‘puts’

0

5

10

15

20

25

T3E/S

hm

T3E/E

-Reg

T3E/M

PI

IBM

/LAPI

IBM

/MPI

Quadr

ics/S

hm

Quadr

ics/M

PI

Myr

inet/G

M

Myr

inet/M

PI

GigE/V

IPL

GigE/M

PI

use

c

End-to-end latency (1-way)

Page 7: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

8-byte ‘put’ Latencies with Software Overheads

0

5

10

15

20

25

usec

Other

Send overhead only (S)

Overhead overlap (V)

Receive overhead only (R)

Page 8: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Gap varies with msg clustering

0

5

10

15

20

25

30

gap

bet

wee

n m

sgs

(use

c)

q=1q=2q=4

q=8q=16

Clustering messages can both use idle cycles, and reduce the number of idle cycles that need to be filled

Page 9: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Potential for CPU overlap during clustered message sends

Hardware support for 1-way communication provides more opportunity for computational overlap

0

2

4

6

8

10

12

usec

Gap (g)Send OverheadReceive Overhead

Page 10: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Fixed message cost (g), vs. per-byte cost (G)

0

2

4

6

8

10

12

use

c

Per Message Cost (g)Per KByte Cost (G*1024)

Page 11: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

“Large” Messages

Factor of 6 between minimum sizes needed for “large” message (large = bandwidth dominates fixed message cost)

0

500

1000

1500

2000

2500

3000

3500

4000

Byt

es

Cross-over between g and G

Page 12: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Small message performance over time

Software send overhead for 8-byte messages over time.

Not improving much over time (even in absolute terms)

Page 13: Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Unified Parallel C at LBNL/UCB

Conclusion

• Latency and software overhead of messages varies widely among today’s HPC networks- Affects ability to effectively mask communication latency, with

large effect on GAS language viability- especially software overhead--latency can be hidden

• These parameters have historically been overlooked in benchmarks and vendor evaluations- Hopefully this will change- Recent discussions with vendors promising- Incorporation into standard benchmarks would be nice…

http://upc.lbl.gov