Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007
Jan 23, 2016
Locality-aware Connection Management and Rank Assignment for Wide-area MPI
Hideo Saito Kenjiro TauraThe University of TokyoMay 16, 2007
2
Background Increase in the bandwidth of WANs➭ More opportunities to perform parallel
computation using multiple clusters
WAN
3
Requirements for Wide-area MPI
Wide-area connectivity Firewalls and private addresses Only some nodes can connect to each other Perform routing using the connections that
happen to be possible
Firewall
NAT
4
Reqs. for Wide-area MPI (2) Scalability
The number of conns. must be limited in order to scale to thousands of nodes Various allocation limits of the
system (e.g., memory, file descriptors, router sessions)
Simplistic schemes that may potentially result in O(n2) connections won’t scale Lazy connect strategies work for
many apps, but not for those that involve all-to-all communication
5
Reqs. for Wide-area MPI (3) Locality awareness
To achieve high performance with few conns, select conns. in a locality-aware manner Many connections with nearby
nodes, few connections with faraway nodes
Many conns.within a cluster
Few conns.between clusters
6
Reqs. for Wide-area MPI (4) Application awareness
Select connections according to the application’s communication pattern
Assign ranks* according to the application’s communication pattern
Adaptivity Automatically, without tedious manual configuration
* rank = process ID in MPI
7
Contributions of Our Work Locality-aware connection management
Uses latency and traffic information obtained from a short profiling run
Locality-aware rank assignment Uses the same info. to discover rank-process
mappings with low comm. overhead
➭ Multi-Cluster MPI (MC-MPI) Wide-area-enabled MPI library
8
Outline
1. Introduction2. Related Work3. Proposed Method
Profiling Run Connection Management Rank Assignment
4. Experimental Results5. Conclusion
9
Grid-enabled MPI Libraries MPICH-G2 [Karonis et al. ‘03], MagPIe [Kielma
nn et al. ‘99] Locality-aware communication optimizations
E.g., wide-area-aware collective operations (broadcast, reduction, ...)
Doesn’t work with Firewalls
10
Grid-enabled MPI Libraries (cont’d)
MPICH/MADIII [Aumage et al. ‘03], StaMPI [Imamura et al. ‘00] Forwarding mechanisms that
allow nodes to communicate even in the presence of FWs
Manual configuration Amount of necessary config.
becomes overwhelming as more resources are used
Forward Firewall
11
P2P Overlays Pastry [Rowstron et al. ’00]
Each node maintains just O(log n) connections Messages are routed using those connections Highly scalable, but routing properties are unfavora
ble for high performance computing Few connections between nearby nodes Messages between nearby nodes need to be forwarded, c
ausing large latency penalties
12
Adaptive MPI Huang et al. ‘06
Performs load balancing by migrating virtual processors Balance the exec. times of the physical processors Minimize inter-processor communication
Adapts to apps. by tracking the amount of communication performed between procs.
Assumes that the communication cost of every processor pair is the same MC-MPI takes differences in communication costs into ac
count
Physical Processor
Virtual Processor
13
Lazy Connect Strategies MPICH [Gropp et al. ‘96], Scalable MPI over Inf
iniband [Yu et al. ‘06] Establish connections only on demand Reduces the number of conns. if each proc. only co
mmunicates with a few other procs. Some apps. generate all-to-all comm. patterns, res
ulting in many connections E.g., IS in the NAS Parallel Benchmarks
Doesn’t extend to wide-area environments where some communication may be blocked
14
Outline
1. Introduction2. Related Work3. Proposed Method
Profiling Run Connection Management Rank Assignment
4. Experimental Results5. Conclusion
15
Overview of Our Method
Latency matrix (L) Traffic matrix (T)
Locality-aware connection management Locality-aware rank assignment
Short Profiling Run
Optimized Real Run
16
Outline
1. Introduction2. Related Work3. Proposed Method
Profiling Run Connection Management Rank Assignment
4. Experimental Results5. Conclusion
17
Latency Matrix
Latency matrix L = {lij} lij: latency between processes i and j in the target e
nvironment Each process autonomously measures the RTT bet
ween itself and other processes Reduce the num. of measurements by using the tri
angular inequality to estimate RTTs
rttpr
rttpq
rttrq
p
r
q
if rttpr>αrttrq: rttpq=rttpr
(α: constant)
18
Traffic Matrix
Traffic matrix T = {tij} tij: traffic between ranks i and j in the target applica
tion Many applications repeat similar communication pa
tterns➭ Execute the application for a short amount of time
and make tij the number of transmitted messages
(E.g., one iteration of an iterative app.)
19
Outline
1. Introduction2. Related Work3. Proposed Method
Profiling Run Connection Management Rank Assignment
4. Experimental Results5. Conclusion
20
Connection Management
Bounding Graph Spanning Tree Lazy Connection Establishment
MPI_Init Application Body
Candidateconnections
Establishcandidate
connectionson demand
21
Selection of Candidate Connections
Each process selects O(log n) neighbors based on L and T : parameter that controls connection dens
ity n: number of processes
/ /2 /4
...
Many nearbyprocesses
Few farawayprocesses
near far
22
Bounding Graph Procs. try to establish temporary
conns. to their selected neighbors The collective set of
successful connections ➭ Bounding graph (Some conns. may fail due to FWs)
Temporaryconnections
BoundingGraph
23
Routing Table Construction Construct a routing table using
just the bounding graph Close the temporary connections
Conns. of the bounding graph are reestablished lazily as “real” conns. Temporary conns. => small bufs. Real conns. => large bufs.
BoundingGraph
24
Lazy Connection Establishment
FW
Bounding Graph
FW
Spanning Tree
FW
Send connectrequest usingspanning tree
Connectin reversedirection
Lazy connectfails due to FW
25
Outline
1. Introduction2. Related Work3. Proposed Method
Profiling Run Connection Management Rank Assignment
4. Experimental Results5. Conclusion
26
Commonly-used Method Sort the processes by host name (or IP
address) and assign ranks in that order Assumptions
Most communication takes place between processes with close ranks
The communication cost between processes with close host names is low
However, Applications have various comm. patterns Host names don’t necessarily have a
correlation to communication costs
27
Our Rank Assignment Scheme Find a rank-process mapping with low commun
ication overhead Map the rank assignment problem to the Quadratic
Assignment Problem QAP
Given two nxn cost matrices, L and T, find a permutation p of {0, 1, ..., n-1} that minimizes:
28
Solving QAPs NP-Hard, but there are heuristics for finding go
od suboptimal solutions Library based on GRASP [Resende et al. ’96] Test against QAPLIB [Burkard et al. ’97]
Instances of up to n = 256 n processors for problem size n Approximate solutions that were within one to two p
ercent of the best known solution in under one second
29
Outline
1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion
30
Experimental Environment
Xeon/Pentium M Linux Intra-cluster RTT: 60-120
microsecs TCP send/recv bufs: 256K
B ea.
FW
sheepXX(64 nodes)
istbsXXX(64 nodes)
hongoXXX(64 nodes)
0.3ms
4.4ms
4.3ms
chibaXXX(64 nodes)
6.9ms
6.8ms
10.8ms
31
Experiment 1: Conn. Management
Measure the performance of the NPB with limited numbers of connections MC-MPI
Limit the number of connections to 10%, 20%, ..., 100% by varying
Random Establish a comparable number of connections
randomly
32
BT, LU, MG and SP
50%
60%
70%
80%
90%
100%
0% 25% 50% 75% 100%Maximum % of connections
Rel
ativ
e pe
rfor
man
ce
MC-MPIRandom
LU (Lower-Upper)
SOR(Successive Over-Relxation)
33
BT, LU, MG and SP (2)
60%
70%
80%
90%
100%
110%
0% 25% 50% 75% 100%Maximum % of connections
Rel
ativ
e pe
rfor
man
ce
MC-MPIRandom
60%
70%
80%
90%
100%
0% 25% 50% 75%100%Maximum % of connections
Rel
ativ
e pe
rfor
man
ce
MC-MPIRandom
MG (Multi-Grid)BT (Block Tridiagonal)
34
BT, LU, MG and SP (3)
40%
50%
60%
70%
80%
90%
100%
0% 25% 50% 75% 100%Maximum % of connections
Rel
ativ
e pe
rfor
man
ce
MC-MPIRandom
SP (Scalar Pentadiagonal)
% of connections actually established was lower than that shown by the x-axis B/c of lazy connection
establishment To be discussed in
more detail later
35
EP
60%
70%
80%
90%
100%
0% 25% 50% 75% 100%Maximum % of connections
Rel
ativ
e pe
rfor
man
ce
MC-MPIRandom
EP involves very little communication
EP (Embarrassingly Parallel)
36
IS
0%
100%
200%
300%
400%
0% 25% 50% 75% 100%Maximum % of connections
Rel
ativ
e pe
rfor
man
ce MC-MPIRandom
0%
100%
200%
300%
400%
500%
0 32 64 96 128Buffer size (KB)
Rel
ativ
e pe
rfor
man
ce
20%60%100%
IS (Integer Sort)
Performance decreasedue to congestion!
37
Experiment 2: Lazy Conn. Establish.
Compare our lazy conn. establishment method with an MPICH-like method MC-MPI
Select so that the maximum number of allowed connections is 30%
MPICH-like Establish connections on demand without preselecting ca
ndidate connections(we can also say that we preselect all connections)
38
Experiment 2: Results
0%
20%
40%
60%
80%
100%
BT EP IS MG LU SP
Benchmark
Con
nect
ions
est
ablis
hed
MC- MPIMPICH- like
Connections Established
0%
100%
200%
300%
400%
BT EP IS MG LU SP
Benchmark
Rel
ativ
e Per
form
ance
MC- MPIMPICH- like
Relative Performance
Comparable number ofconns. except for IS
Comparable performanceexcept for IS
39
Experiment 3: Rank Assignment
Compare 3 assignment algorithms Random Hostname (24 patterns)
Real host names (1) What if istbsXXX were named sheepXX, etc. (23)
MC-MPI (QAP) chibaXXX
sheepXX
hongoXXXistbsXXX
40
LU and MG
0
0.5
1
1.5
2
2.5
3
Spe
edup
vs.
Ran
dom
0
0.5
1
1.5
2
Spe
edup
vs.
Ran
dom
Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)
LU MG
41
BT and SP
0
1
2
3
4
Spe
edup
vs.
Ran
dom
0
1
2
3
4
Spe
edup
vs.
Ran
dom
Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)
BT SP
42
BT and SP (cont’d)
Sourc
e
Destination
Hostname
MC-MPI (QAP)
Traffic Matrix
Cluster A
Cluster B
Cluster C
Cluster D
Rank
Rank
Rank Assignment
43
EP and IS
0
0.2
0.4
0.6
0.8
1
Spe
edup
vs.
Ran
dom
0
0.2
0.4
0.6
0.8
1
Spe
edup
vs.
Ran
dom
Random Hostname Hostname (Best) Hostname (Worst) QAP (MC-MPI)
EP IS
44
Outline
1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion
45
Conclusion MC-MPI
Connection management High performance with connections between just
10% of all process pairs Rank assignment
Up to 300% faster than locality-unaware assignments
Future Work An API to perform profiling w/in a single run Integration of adaptive collectives