Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Hideo Saito Kenjiro TauraThe University of TokyoMay 16, 2007

2

Background Increase in the bandwidth of WANs➭ More opportunities to perform parallel

computation using multiple clusters

WAN

3

Requirements for Wide-area MPI

Wide-area connectivity Firewalls and private addresses Only some nodes can connect to each other Perform routing using the connections that

happen to be possible

Firewall

NAT

4

Reqs. for Wide-area MPI (2) Scalability

The number of conns. must be limited in order to scale to thousands of nodes Various allocation limits of the

system (e.g., memory, file descriptors, router sessions)

Simplistic schemes that may potentially result in O(n2) connections won’t scale Lazy connect strategies work for

many apps, but not for those that involve all-to-all communication

5

Reqs. for Wide-area MPI (3) Locality awareness

To achieve high performance with few conns, select conns. in a locality-aware manner Many connections with nearby

nodes, few connections with faraway nodes

Many conns.within a cluster

Few conns.between clusters

6

Reqs. for Wide-area MPI (4) Application awareness

Select connections according to the application’s communication pattern

Assign ranks* according to the application’s communication pattern

Adaptivity Automatically, without tedious manual configuration

* rank = process ID in MPI

7

Contributions of Our Work Locality-aware connection management

Uses latency and traffic information obtained from a short profiling run

Locality-aware rank assignment Uses the same info. to discover rank-process

mappings with low comm. overhead

➭ Multi-Cluster MPI (MC-MPI) Wide-area-enabled MPI library

8

Outline

1. Introduction2. Related Work3. Proposed Method

Profiling Run Connection Management Rank Assignment

4. Experimental Results5. Conclusion

9

Grid-enabled MPI Libraries MPICH-G2 [Karonis et al. ‘03], MagPIe [Kielma

nn et al. ‘99] Locality-aware communication optimizations

E.g., wide-area-aware collective operations (broadcast, reduction, ...)

Doesn’t work with Firewalls

10

Grid-enabled MPI Libraries (cont’d)

MPICH/MADIII [Aumage et al. ‘03], StaMPI [Imamura et al. ‘00] Forwarding mechanisms that

allow nodes to communicate even in the presence of FWs

Manual configuration Amount of necessary config.

becomes overwhelming as more resources are used

Forward Firewall

11

P2P Overlays Pastry [Rowstron et al. ’00]

Each node maintains just O(log n) connections Messages are routed using those connections Highly scalable, but routing properties are unfavora

ble for high performance computing Few connections between nearby nodes Messages between nearby nodes need to be forwarded, c

ausing large latency penalties

12

Adaptive MPI Huang et al. ‘06

Performs load balancing by migrating virtual processors Balance the exec. times of the physical processors Minimize inter-processor communication

Adapts to apps. by tracking the amount of communication performed between procs.

Assumes that the communication cost of every processor pair is the same MC-MPI takes differences in communication costs into ac

count

Physical Processor

Virtual Processor

13

Lazy Connect Strategies MPICH [Gropp et al. ‘96], Scalable MPI over Inf

iniband [Yu et al. ‘06] Establish connections only on demand Reduces the number of conns. if each proc. only co

mmunicates with a few other procs. Some apps. generate all-to-all comm. patterns, res

ulting in many connections E.g., IS in the NAS Parallel Benchmarks

Doesn’t extend to wide-area environments where some communication may be blocked

14

Outline




15

Overview of Our Method

Latency matrix (L) Traffic matrix (T)

Locality-aware connection management Locality-aware rank assignment

Short Profiling Run

Optimized Real Run

16

Outline




17

Latency Matrix

Latency matrix L = {lij} lij: latency between processes i and j in the target e

nvironment Each process autonomously measures the RTT bet

ween itself and other processes Reduce the num. of measurements by using the tri

angular inequality to estimate RTTs

rttpr

rttpq

rttrq

p

r

q

if rttpr>αrttrq: rttpq=rttpr

(α: constant)

18

Traffic Matrix

Traffic matrix T = {tij} tij: traffic between ranks i and j in the target applica

tion Many applications repeat similar communication pa

tterns➭ Execute the application for a short amount of time

and make tij the number of transmitted messages

(E.g., one iteration of an iterative 　 app.)

19

Outline




20

Connection Management

Bounding Graph Spanning Tree Lazy Connection Establishment

MPI_Init Application Body

Candidateconnections

Establishcandidate

connectionson demand

21

Selection of Candidate Connections

Each process selects O(log n) neighbors based on L and T : parameter that controls connection dens

ity n: number of processes

/ /2 /4

...

Many nearbyprocesses

Few farawayprocesses

near far

22

Bounding Graph Procs. try to establish temporary

conns. to their selected neighbors The collective set of

successful connections ➭ Bounding graph (Some conns. may fail due to FWs)

Temporaryconnections

BoundingGraph

23

Routing Table Construction Construct a routing table using

just the bounding graph Close the temporary connections

Conns. of the bounding graph are reestablished lazily as “real” conns. Temporary conns. => small bufs. Real conns. => large bufs.

BoundingGraph

24

Lazy Connection Establishment

FW

Bounding Graph

FW

Spanning Tree

FW

Send connectrequest usingspanning tree

Connectin reversedirection

Lazy connectfails due to FW

25

Outline




26

Commonly-used Method Sort the processes by host name (or IP

address) and assign ranks in that order Assumptions

Most communication takes place between processes with close ranks

The communication cost between processes with close host names is low

However, Applications have various comm. patterns Host names don’t necessarily have a

correlation to communication costs

27

Our Rank Assignment Scheme Find a rank-process mapping with low commun

ication overhead Map the rank assignment problem to the Quadratic

Assignment Problem QAP

Given two nxn cost matrices, L and T, find a permutation p of {0, 1, ..., n-1} that minimizes:

28

Solving QAPs NP-Hard, but there are heuristics for finding go

od suboptimal solutions Library based on GRASP [Resende et al. ’96] Test against QAPLIB [Burkard et al. ’97]

Instances of up to n = 256 n processors for problem size n Approximate solutions that were within one to two p

ercent of the best known solution in under one second

29

Outline

1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion

30

Experimental Environment

Xeon/Pentium M Linux Intra-cluster RTT: 60-120

microsecs TCP send/recv bufs: 256K

B ea.

FW

sheepXX(64 nodes)

istbsXXX(64 nodes)

hongoXXX(64 nodes)

0.3ms

4.4ms

4.3ms

chibaXXX(64 nodes)

6.9ms

6.8ms

10.8ms

31

Experiment 1: Conn. Management

Measure the performance of the NPB with limited numbers of connections MC-MPI

Limit the number of connections to 10%, 20%, ..., 100% by varying

Random Establish a comparable number of connections

randomly

32

BT, LU, MG and SP

50%

60%

70%

80%

90%

100%

0% 25% 50% 75% 100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

LU (Lower-Upper)

SOR(Successive Over-Relxation)

33

BT, LU, MG and SP (2)

60%

70%

80%

90%

100%

110%


Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

60%

70%

80%

90%

100%

0% 25% 50% 75%100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

MG (Multi-Grid)BT (Block Tridiagonal)

34

BT, LU, MG and SP (3)

40%

50%

60%

70%

80%

90%

100%


Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

SP (Scalar Pentadiagonal)

% of connections actually established was lower than that shown by the x-axis B/c of lazy connection

establishment To be discussed in

more detail later

35

EP

60%

70%

80%

90%

100%


Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

EP involves very little communication

EP (Embarrassingly Parallel)

36

IS

0%

100%

200%

300%

400%


Rel

ativ

e pe

rfor

man

ce MC-MPIRandom

0%

100%

200%

300%

400%

500%

0 32 64 96 128Buffer size (KB)

Rel

ativ

e pe

rfor

man

ce

20%60%100%

IS (Integer Sort)

Performance decreasedue to congestion!

37

Experiment 2: Lazy Conn. Establish.

Compare our lazy conn. establishment method with an MPICH-like method MC-MPI

Select so that the maximum number of allowed connections is 30%

MPICH-like Establish connections on demand without preselecting ca

ndidate connections(we can also say that we preselect all connections)

38

Experiment 2: Results

0%

20%

40%

60%

80%

100%

BT EP IS MG LU SP

Benchmark

Con

nect

ions

est

ablis

hed

MC- MPIMPICH- like

Connections Established

0%

100%

200%

300%

400%

BT EP IS MG LU SP

Benchmark

Rel

ativ

e Per

form

ance

MC- MPIMPICH- like

Relative Performance

Comparable number ofconns. except for IS

Comparable performanceexcept for IS

39

Experiment 3: Rank Assignment

Compare 3 assignment algorithms Random Hostname (24 patterns)

Real host names (1) What if istbsXXX were named sheepXX, etc. (23)

MC-MPI (QAP) chibaXXX

sheepXX

hongoXXXistbsXXX

40

LU and MG

0

0.5

1

1.5

2

2.5

3

Spe

edup

vs.

Ran

dom

0

0.5

1

1.5

2

Spe

edup

vs.

Ran

dom

Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)

LU MG

41

BT and SP

0

1

2

3

4

Spe

edup

vs.

Ran

dom

0

1

2

3

4

Spe

edup

vs.

Ran

dom

Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)

BT SP

42

BT and SP (cont’d)

Sourc

e

Destination

Hostname

MC-MPI (QAP)

Traffic Matrix

Cluster A

Cluster B

Cluster C

Cluster D

Rank

Rank

Rank Assignment

43

EP and IS

0

0.2

0.4

0.6

0.8

1

Spe

edup

vs.

Ran

dom

0

0.2

0.4

0.6

0.8

1

Spe

edup

vs.

Ran

dom

Random Hostname Hostname (Best) Hostname (Worst) QAP (MC-MPI)

EP IS

44

Outline

1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion

45

Conclusion MC-MPI

Connection management High performance with connections between just

10% of all process pairs Rank assignment

Up to 300% faster than locality-unaware assignments

Future Work An API to perform profiling w/in a single run Integration of adaptive collectives

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Documents

scalable mpi

communication costs

on2 connections wont

locality awarenessto

rankprocess mappings

nearby nodesmessages

doesnt work

strategies work