Abdelhalim Amer , Huiwei Lu , Pavan Balaji , Satoshi Matsuoka + Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Abdelhalim Amer*, Huiwei Lu*, Pavan Balaji*, Satoshi Matsuoka+

*Argonne National Laboratory, IL, USA+Tokyo Institute of Technology, Tokyo, Japan

Characterizing MPI and Hybrid MPI+Threads Applications at Scale:

Case Study with BFS

1

PPMM’15, in conjunction with CCGRID’15, May 4-7, 2015, Shenzhen, Guangdong, China

Systems with massive core counts already in production– Tianhe-2: 3,120,000 cores– Mira: 3,145,728 HW

threads Core density is increasing Other resources do not scale at

the same rate– Memory per core is reducing– Network endpoints

[1] Peter Kogge. Pim & memory: The need for a revolution in architecture. The Argonne Training Program on Extreme-Scale Computing (ATPESC), 2013.

Evolution of the memory capacity per core in the Top500 list [1]

2

Evolution of High-End Systems

Problem Domain Target Architecture

Core0 Core1 Core0 Core1


Node 0 Node 1



Node 2 Node3

3

Parallelism with Message Passing

ß

MPI-only = Core Granularity Domain Decomposition

Domain Decomposition with MPI vs. MPI+X

MPI+X = Node Granularity Domain Decomposition

Process Communication

Process Threads

MPI-only = Core Granularity Domain Decomposition

ProcessCommunication

(single copy)

Boundary Data (extra memory)

MPI vs. MPI+X

MPI+X = Node Granularity Domain Decomposition

Process Threads

Shared Data

• The process model has inherent limitations• Sharing is becoming a requirement• Using threads needs careful thread-safety implementations

6

Process Model vs. Threading Model with MPIProcesses Threads

Data all private Global data all shared

Sharing requires extra work (e.g. MPI-3 shared memory)

Sharing is given, consistency is not and implies protection

Communication fine-grained (core-to-core) Communication coarse-grained (typically node-to-node)

Space overhead is high (buffers, boundary data, MPI runtime, etc)

Space overhead is reduced

Contention only for system resources Contention for system resources and shared data

No thread-safety overheads Magnitude of thread-safety overheads depend on the application and MPI runtime

MPI_THREAD_SINGLE– No additional threads

MPI_THREAD_FUNNELED– Master thread communication only

MPI_THREAD_SERIALIZED– Threaded communication serialized

MPI_THREAD_MULTIPLE– No restrictions

• Restriction

• Low Thread-Safety Costs

• Flexibility

• High Thread-Safety Costs

7

MPI + Threads Interoperation by the Standard An MPI process is allowed to spawn multiple threads Threads share the same rank A thread blocking for communication must not block other

threads Applications can specify the way threads interoperate with MPI

Search in graph Neighbors first Solves many problems in graph theory

Graph500 benchmark BFS kernel Kronecker graph as input Communication

Two-sided nonblocking

This small, synthetic graph was generated by a method called Kronecker multiplication. Larger versions of this generator, modeling real-world graphs, are used in the Graph500 benchmark. (Courtesy of Jeremiah Willcock, Indiana University) [Sandia National Laboratory]

0

1 2 3

4 5

6

8

Breadth First Search and Graph500

0

1 2 3

4 5

6

9

Breadth First Search Baseline Implementation

While(1){

Process_Current_Level();

Synchronize();

MPI_Allreduce(QLength); if(QueueLenth == 0) break;}

Sync()

Sync()

MPI Only Hybrid MPI + OpenMP

• MPI_THREAD_MULTIPLE • Shared read queue• Private temp write queues • Private buffers• Lock-Free/Atomic-Free

While(1){

Process_Current_Level(); Synchronize();


While(1){ #pragma omp parallel { Process_Current_Level(); Synchronize(); }


10

MPI only to Hybrid BFS

1024 2048 4096 8192 163840

5

10

15

20

25

30

35

Processes ThreadsProcesses_est Threads_est

Number of Cores

To

tal C

om

mu

nic

ati

on

(G

B)

1024 2048 4096 8192 163841

10

100

1000

Processes Threads

Number of Cores

Nu

mb

er

Of

Me

ss

ag

es

(M

illio

ns

)

Problem size = 226 vertices (SCALE = 26)

11

Communication Characterization

Communication Volume (GB) Message Count

Architecture Blue Gene/Q

Processor PowerPC A2

Clock frequency 1.6 GHz

Cores per node 16

HW threads/Core 4

Number of nodes 49152

Memory/node 1GB

Interconnect Proprietary

Topology 5D Torus

Compiler GCC 4.4.7

MPI library MPICH 3.1.1

Network driver BG/Q V1R2M1

12

Target Platform

• Memory/HW thread = 256 MB!

• We use in the following 1 rank/thread per core

• MPICH: global critical section

128

1024

8191

.999

9999

9998

6553

5.99

9999

9999

5242

87.9

9999

9999

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Processes

Hybrid

Number of Cores

Pe

rfo

rma

nc

e (

GT

EP

S)

13

Baseline Weak Scaling Performance

14

Main Sources of Overhead

512 1024 2048 4096 8192 163840%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Computation User Polling MPI_Test MPI_Others

Number of Cores

BF

S T

ime

512 1024 2048 4096 8192 163840%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Compute OMP_Sync User Polling

MPI_Test MPI_Others

Number of Cores

BF

S T

ime

MPI-only MPI+Threads

Make_Progress(){ MPI_Test(recvreq,flag) if(flag) compute();

for(each process P){ MPI_Test(sendreq[P],flag) if(flag) buffer_free[P] = 1; }}

Eager polling for communication progress

O(P)

Synchronize(){ for(each process P) MPI_Isend(buf,0,P, sendreq[P]);

while(!all_procs_done) Check_Incom_Msgs();}

Global synchronization (2.75G messages for 512K cores)

15

Non-Scalable Sub-Routines

O(P2) Empty Messages

Use a lazy polling (LP) policy Use the MPI 3 nonblocking barrier (IB)

Weak Scaling Results 16

Fixing the Scalability Issues

128

1024

8191

.999

9999

9998

6553

5.99

9999

9999

5242

87.9

9999

9999

0

2

4

6

8

10

12

MPI-Only

Hybrid

MPI-Only-Optmized

Hybrid-Optmized

Number of Cores

Per

form

ance

(G

TE

PS

)

1 10 1000.1

1

10

100

1000Global-CS Per-Object-CS

Number of Threads per Node

Av

g M

PI_

Te

st

Tim

e [

10

00

cy

c]

MPI_Test Latency 17

Thread Contention in the MPI Runtime

Default: global critical section to avoid extra overheads in uncontended cases Fine-grained critical section can be used for highly contented scenarios

Profiling with 1K NodesWeak Scaling Performance

1 2 4 8 16 32 640%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Compute OMP_Sync User PollingMPI_Test MPI_Others

Number of Threads per NodeB

FS

Tim

e

128

1024

8191

.999

9999

9998

6553

5.99

9999

9999

5242

87.9

9999

9999

0

2

4

6

8

10

12

14

16

18Processes+LP+IB

Hybrid+LP+IB

Hybrid+LP+IB+FG

Number of Cores

Pe

rfo

rma

nc

e (

GT

EP

S)

18

Performance with Fine-Grained Concurrency

The coarse-grained MPI+X communication model is generally more scalable

In BFS, MPI+X reduced for example the– O(P) polling overhead– O(P2) empty messages for global sync

The model does not fix root scalability issues Thread-safety overheads can be significant source It is not a fatality:

– Various techniques can be used thread contention and safety overheads

– We are actively working on improving multhreading support in MPICH (MPICH derivatives can benefit from it)

Characterizing MPI+shared-memory vs. MPI+threads models is being considered for a future study

19

Summary

Abdelhalim Amer , Huiwei Lu , Pavan Balaji , Satoshi Matsuoka + Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Documents

japancharacterizing

mpi runtimempi

ranka thread

standardan mpi process

mpi xmpi x

massive core

requirementusing threads

way threads

Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Abdelhalim Amer , Huiwei Lu , Pavan Balaji , Satoshi Matsuoka + Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.