Multi-core and Network Aware MPI Topology Functions Mohammad J. Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, and William D. Gropp Department of.

Multi-core and Network Aware MPI Topology Functions

Mohammad J. Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, and William D. Gropp

Department of Electrical and Computer Engineering, Queen’s University

Mathematics and Computer Science, Argonne National Laboratory

Department of Computer Science, University of Illinois at Urbana-Champaign

Ahmad Afsahi Parallel Processing Research Laboratory

2

Presentation Outline

Introduction Background and Motivation

MPI Graph and Cartesian Topology Functions

Related Work Design and Implementation of Topology Functions Experimental Framework and Performance Results

Micro-benchmark Results Applications Results

Concluding Remarks and Future Work


3

Introduction

MPI is the main standard for communication in HPC clusters. Scalability is the major concern for MPI over large-scale

hierarchical systems. System topology awareness is essential for MPI scalability:

Being aware of performance implications in each and every architectural hierarchy of the machine

Efficiently mapping processes to processor cores, based on applications’ communication pattern

Such functionality should be embedded in MPI topology interface


4

Background and Motivation

MPI topology functions: Define the communication topology of the application

o Logical process arrangement or virtual topology Possibly reorder the processes to efficiently map over the

system architecture (physical topology) for more performance

Virtual topology models: Cartesian topology: multi-dimensional Cartesian arrangement Graph topology: non-specific graph arrangement

Graph topology representation Non-distributed: easier to manage, less scalable Distributed: new to the standard, more scalable


5

Background and Motivation (II)

However, topology functions are mostly utilized for the construction of process arrangement (i.e., virtual topology). Most MPI applications are not utilizing them for performance

improvement

In addition, MPI implementations offer trivial functionality for these functions. Mainly constructing the virtual topology No reordering of the ranks; thus no performance improvement

This work designs topology functions with reorder ability: Designing non-distributed API functions Supporting multi-hierarchy nodes and networks


6

MPI Graph and Cartesian Topology Functions MPI defines a set of virtual topology definition functions for graph

and Cartesian structures. MPI_Graph_create and MPI_Cart_create non-distributed

functions: Are collective calls that accept a virtual topology Return a new MPI communicator enclosing the desired topology The input topology is in a non-distributed form All nodes have a full view of the entire structure

o Pass the whole information to the function If the user opts for reordering, the function may reorder the ranks

for an efficient process-to-core mapping.


7

MPI Graph and Cartesian Topology Functions (II) MPI_Cart_create(comm_old, ndims, dims, periods, reorder, comm_cart )

comm_old [in] input communicator without topology (handle) ndims [in] number of dimensions of Cartesian grid (integer) dims [in] integer array of size ndims specifying the number

of processes in each dimension periods [in] logical array of size ndims specifying whether the

grid is periodic (true) or not (false) in each dimension

reorder [in] ranking may be reordered (true) or not (false) (logical) comm_graph [out] communicator with Cartesian topology (handle)

Dimension #Processes

12

42

ndims = 2dims = 4, 2periods = 1, 0

4x2 2D-Torus

0 1

54

2 3

76


8

MPI Graph and Cartesian Topology Functions (III) MPI_Graph_create(comm_old, nnodes, index, edges, reorder, comm_graph )

comm_old [in] input communicator without topology (handle)

nnodes [in] number of nodes in graph (integer) index [in] array of integers describing node degrees edges [in] array of integers describing graph edges reorder [in] ranking may be reordered (true) or not (false)

(logical) comm_graph [out] communicator with graph topology added

(handle)Process Neighbors

0123

1, 303

0, 2

0 1

23

nnodes = 4index = 2, 3, 4, 6edges = 1, 3, 0, 3, 0, 2


9








10

Related Work (I)

Hatazaki, Träff, worked on topology mapping using graph embedding algorithms (Euro PVM/MPI 1998, SC 2002)

Träff et. al, proposed extending MPI-1 topology interface (HIPS 2003, Euro PVM/MPI 2006) To support weighted-edge topologies and dynamic process

reordering, and to Provide architectural clues to the applications for a better mapping

MPI Forum introduced distributed topology functionality in MPI-2.2 (2009)

Hoefler et. al, proposed guidelines for efficient implementation of distributed topology functionality (CCPE 2010)


11

Related Work (II)

Mercier et. al, studied efficient process-to-core mapping (Euro PVM/MPI 2009, EuroPar 2010] Using external libraries for node architecture discovery and

graph mapping Using weighted graphs and/or trees, and outside MPI topology

interface

How is our work different from the related work? Supports a physical topology spanning nodes and the network Uses edge replication to support weighted edges in virtual

topology graphs Integrates the above functionality in MPI non-distributed

topology interface


12








13

Design of MPI Topology Functions (I)

Both Cartesian and graph interfaces are treated as graph at the underlying layers Cartesian topology is internally copied to a graph topology

Virtual topology graph: Vertices: MPI processes Edges: existence, or significance, of communication between any

two processes Significance of communication : normalized total communication

volume between any pair of processes, used as edge weights Edge replication is used to represent graph edge weight

o Recap: MPI non-distributed interface does not support weighted edges


14

Design of MPI Topology Functions (II)

Physical topology graph: Integrated node and network architecture Vertices: architectural components such as:

o Network nodeso Coreso Caches

Edges: communication links between the components Edge weights: communication performance between components

o Processor cores: closer cores have higher edge weighto Network nodes: closer nodes have higher edge weighto Farthest on-node cores get higher weight than closest network nodes


15

Physical Topology Distance Example

d1 will have the highest load value in the graph. The path between N2 and N3 (d4) will have the lowest load

value, indicating the lowest performance path.

d1 > d2 > d3 > d4 = 1


16

Tools for Implementation of Topology Functions HWLOC library for extracting node architecture:

A tree architecture, with nodes at top level and cores at the leaves Cores with lower-level parents (such as caches) are considered to

have higher communication performance

IB subnet manager (ibtracert) for extracting network distances: Do the discovery offline, before the application run Make a pre-discovered network distance file

Scotch library for mapping virtual to physical topologies: Source and target graphs are weighted and undirected Uses recursive bi-partitioning for graph mapping


17

Implementation of Topology Functions

Communication pattern profiling: Probes are placed inside MPI library to profile applications’

communication pattern. Pairwise communication volume is normalized in the range of

0...10, with 0 meaning no edge between the two vertices.

All processes perform node architecture discovery One process performs network discovery for all Make the physical architecture view unified across the processes

(using Allgather)


18

Existing MPICH function

Graph topology

Graph topology initialization

Creating physical topology: by extracting and merging node and network architectures. 1. Initialize Scotch architecture. 2. Extract network topology (if required).3. Extract node topology.4. Merge node and network topology.5. Distribute the merged topology among processes (using allgather).6. Build Scotch physical topology.

Constructing a new reordered communicator: using Scotch mapping of the previous step.

SCOTCH

HWLOC

Cartesian topology

Trivial graph topology creation

Trivial Cartesian topology creation

Cartesian topology initialization

No Reorder No Reorder Reorder Reorder

SCOTCH Graph mapping: by constructing Scotch weighted virtual topology from the input graph and mapping it to the extracted physical topology. 1. Initialize and build the Scotch virtual topology graph.2. Initialize the mapping algorithms’ strategy in Scotch. 3. Map the virtual topology graph to the extracted physical topology.

Creating the new MPI communicator

Creating the new MPI communicator

IB Subnet manager

Flow of Functionalities

Creating equivalent graph topology

Application profiling

Input virtual topology graph

New function added to MPICH

External library utilized

Calling a function

Following a function in the code


19








20

Experimental Framework

Cluster A (4 servers, 32-cores total) Hosts: 2-way quad-core AMD Opteron 2350 servers, with 2MB

shard L3 cache per processor, and 8GB RAM Network: QDR InfiniBand, 3 switches at 2 levels

Software: Fedora 12, Kernel 2.6.27, MVAPICH2 1.5, OFED 1.5.2

Cluster B (16 servers, 192 cores total) Hosts: 2-way hexa-core Intel Xeon X5670 servers, with a 12MB

multi-level cache per processor, and 24GB RAM Network: QDR InfiniBand, 4 switches at 2 levels

Software: RHEL 5, Kernel 2.6.18.94, MVAPICH2 1.5, OFED 1.5.2


21

MPI Applications – Some Statistics

MPI Application Communication Primitives

NPB CG - MPI Send/Irecv: ~100% of the calls- MPI Barrier: ~0% of the calls

NPB MG - MPI Send/Irecv: 98.5% of the calls, ~100% of the volume

- MPI Allreduce, Reduce, Barrier, Bcast: 1.5% of the calls, ~0.002% of the volume

LAMMPS - MPI Send/Recv/Irecv/Sendrecv: 95% of the calls, 99% of the volume

- MPI Allreduce, Reduce, Barrier, Bcast, Scatter, Allgather, Allgatherv: 5% of the calls, 1% of the volume


22

Exchange Micro-benchmark: Topology-aware Mapping Improvement over Block Mapping (%)

1 8 64 512 4K 32K 128K

-60

-40

-20

0

20

40

60

80

100

8x4x4 3D-Torus with heavy com-munication on the longer dimen-

sion (128-core cluster B)Non-weighted graphWeighted-graphWeighted and network-aware graph

Exchange Message Size (Byte)

1 8 64 512 4K 32K 128K-10

0

10

20

30

40

50

60

4x4x2 3D-Torus with heavy com-munication on the longer dimen-

sion (32-core cluster A)Non-weighted graphWeighted-graphWeighted and network-aware graph



23

Exchange Micro-benchmark: Topology-aware Mapping Improvement over Block Mapping (%)

1 8 64 512 4K 32K 128K

-10

0

10

20

30

40

50

5D-Hypercube with heavy commu-nication on one dimension (32-core

cluster A)Non-weighted graphWeighted-graphWeighted and network-aware graph


1 8 64 512 4K 32K 128K-10

0

10

20

30

40

50

60

70

80

7D-Hypercube with heavy commu-nication on one dimension (128-

core cluster B)Non-weighted graphWeighted-graphWeighted and network-aware graph



24

Collective Micro-benchmark: Topology-aware Mapping Improvement over Block Mapping (%)

1 8 64 512 4K 32K 128K0

10

20

30

40

50

60

70

8x4 2D-Torus with Alltoall collective on the longer dimension

(32-core cluster A)Non-weighted graphWeighted-graphWeighted and network-aware graph


1 8 64 512 4K 32K 128K0

5

10

15

20

25

30

35

16x2 2D-Torus with Alltoall collec-tive on the longer dimension

(32-core cluster A)Non-weighted graphWeighted-graphWeighted and network-aware graph



25

Applications: Topology-aware Mapping Improvement over Cyclic Mapping (%)32-core cluster A

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.A

.32

CG.B

.32

CG.C

.32

MG

.A.3

2

MG

.B.3

2

MG

.C.3

2

-505

10152025303540

Communication Time Improvement non-weighted graphweighted graphWeighted & network-aware graph

Applications

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.A

.32

CG.B

.32

CG.C

.32

MG.

A.32

MG.

B.32

MG.

C.32

-5

0

5

10

15

20

25

Run-time Improvementnon-weighted graphweighted graphWeighted & network-aware graph

Applications


26

Applications: Topology-aware Mapping Improvement over Block Mapping (%)

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.A

.32

CG.B

.32

CG.C

.32

MG.

A.32

MG.

B.32

MG.

C.32

-10

-5

0

5

10

15

Communication Time Improvement non-weighted graph

weighted graphWeighted & network-aware graph

Applications

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.A

.32

CG.B

.32

CG.C

.32

MG.

A.32

MG.

B.32

MG.

C.32

-2

-1

0

1

2

Run-time Improvement non-weighted graphweighted graphWeighted & network-aware graph

Applications

32-core cluster A


27

Applications: Topology-aware Mapping Improvement over Cyclic Mapping (%)

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.D

.128

MG.

D.12

8

-20

-10

0

10

20

30

40

50

Communication Time Improvementnon-weighted graphweighted graphWeighted & network-aware graph

Applications

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.D

.128

MG.

D.12

8

-5

0

5

10

15

20

25

30

Run-time Improvementnon-weighted graphweighted graphWeighted & network-aware graph

Applications

128-core cluster B


28

Applications: Topology-aware Mapping Improvement over Block Mapping (%)

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.D

.128

MG.

D.12

8

-20

-15

-10

-5

0

5

10

15

20

25

Communication Time Improvement

non-weighted graphweighted graphWeighted & network-aware graph

Applications

LAM

MPS

-Fric

tion

LAM

MPS

-Pou

r

LAM

MPS

-Cou

ple

CG.D

.128

MG.

D.12

8

-6

-4

-2

0

2

4

6

8

Run-time Improvement non-weighted graphweighted graphWeighted & network-aware graph

Applications

128-core cluster B


29

Communicator Creation time in MPI_Graph_create for LAMMPS

System # Processes Trivial (ms)

Non-weighted Graph (ms)

Weighted Graph (ms)

Network-aware Graph (ms)

Cluster A

8 0.3 7.3 7.3 7.9

16 0.3 7.6 7.7 8.1

32 0.5 8.6 8.7 9

Cluster B

16 0.9 5.7 5.9 6.6

32 1.2 6.4 6.4 7.2

64 2.5 9 9.4 10.1

128 4.7 18.8 18.9 19.4


30








31

Concluding Remarks

We presented design and implementation of MPI non-distributed graph and Cartesian functions in MVAPICH2 for multi-core nodes connected through multi-level InfiniBand networks.

The micro-benchmarks showed that the effect of reordering process ranks can be significant, and when the communication is heavier on one dimension the benefits of using weighted and network-aware graphs (instead of non-weighted graph) are considerable.

We also modified MPI applications with MPI_Graph_create. The evaluation results showed that MPI applications can benefit from topology-aware MPI_Graph_create.


32

Future Work

We intend to evaluate the effect of topology awareness on other MPI applications.

We would also like to run our applications on a larger testbed.

We would like to design a more general communication cost/weight model for graph mapping, and use other libraries.

We also intend to design and implement MPI distributed topology functions for more scalability in a more distributed, scalable fashion.


33

Acknowledgment


34

Thank you!

Contacts: Mohammad Javad Rashti: [email protected] Jonathan Green: [email protected] Pavan Balaji: [email protected] Ahmad Afsahi: [email protected] William D. Gropp: [email protected]

mailto:[email protected]








35

Backup Slides


36

MPI_Graph_create

MPIR_Graph_create_reorder

MPIU_Get_scotch_arch

MPIR_Comm_copy_reorder

SCOTCHHWLOC

MPI_Cart_create

MPIR_Graph_createMPIR_Graph_create

MPIR_Cart_create_reorder

MPIR_Topo_create

No Reorder No Reorder Reorder Reorder

SCOTCH_Graph_build/map

MPIR_Comm_copy MPIR_Comm_copy

Scotch mapping

LegendExisting MPICH function

New function added to MPICH

External library utilized

IB Subnet manager

Calling a function

Following a function in the code

Flow of function calls in MVAPICH code

Multi-core and Network Aware MPI Topology Functions Mohammad J. Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, and William D. Gropp Department of.

Documents