Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Introduction to MPI

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

A Presentation at HPC Advisory Council Workshop, Lugano 2011

by

Sayantan Sur

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~surs







• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• How to Use MPI

• Challenges in Designing MPI Library on Petaflop and

Exaflop Systems

• Overview of MVAPICH and MVAPICH2 MPI Stack

• Sample Performance Numbers

2

Presentation Overview

HPC Advisory Council, Lugano Switzerland '11

• Growth of High Performance Computing

– Growth in processor performance

• Chip density doubles every 18 months

– Growth in commodity networking

• Increase in speed/features + reducing cost

• Clusters: popular choice for HPC

– Scalability, Modularity and Upgradeability

Current and Next Generation Applications and Computing Systems

3 HPC Advisory Council, Lugano Switzerland '11

PetaFlop to ExaFlop Computing

4

10 PFlops

in 2011 100 PFlops

in 2015

Expected to have an ExaFlop system in 2018-2019 !


Trends for Computing Clusters in the Top 500 List (http://www.top500.org)

Nov. 1996: 0/500 (0%) Nov. 2001: 43/500 (8.6%) Nov. 2006: 361/500 (72.2%)

Jun. 1997: 1/500 (0.2%) Jun. 2002: 80/500 (16%) Jun. 2007: 373/500 (74.6%)

Nov. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Nov. 2007: 406/500 (81.2%)

Jun. 1998: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Jun. 2008: 400/500 (80.0%)

Nov. 1998: 2/500 (0.4%) Nov. 2003: 208/500 (41.6%) Nov. 2008: 410/500 (82.0%)

Jun. 1999: 6/500 (1.2%) Jun. 2004: 291/500 (58.2%) Jun. 2009: 410/500 (82.0%)

Nov. 1999: 7/500 (1.4%) Nov. 2004: 294/500 (58.8%) Nov. 2009: 417/500 (83.4%)

Jun. 2000: 11/500 (2.2%) Jun. 2005: 304/500 (60.8%) Jun. 2010: 424/500 (84.8%)

Nov. 2000: 28/500 (5.6%) Nov. 2005: 360/500 (72.0%) Nov. 2010: 415/500 (83%)

Jun. 2001: 33/500 (6.6%) Jun. 2006: 364/500 (72.8%) Jun. 2011: To be announced


• Hardware Components

– Processing Core and Memory

sub-system

– I/O Bus

– Network Adapter

– Accelerator

– Network Switch

• Software Components

– Communication software

• Memory <-> Accelerator

• Memory <-> Nw Adapter

• Memory<-> Accelerator <->

NW Adapter

Major Components in Modern Computing Systems

I O

B U S

P0

Core 0

Core 1

Core 2

Core 3

P1

Core 0

Core 1

Core 2

Core 3

Memory

Memory

Network Adapter

Network Switch

Processing Bottleneck

I/O Bottleneck

Network Bottleneck


Accelerators

7

InfiniBand in the Top500

Percentage share of InfiniBand is steadily increasing


• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (top500.org)

• Installations in the Top 30 (13 systems):

Large-scale InfiniBand Installations

120,640 cores (Nebulae) in China (3rd) 15,120 cores (Loewe) in Germany (22nd)

73,278 cores (Tsubame-2.0) at Japan (4th) 26,304 cores (Juropa) in Germany (23rd)

138,368 cores (Tera-100) at France (6th) 26,232 cores (Tachyonll) in South Korea (24th)

122,400 cores (RoadRunner) at LANL (7th) 23,040 cores (Jade) at GENCI (27th)

81,920 cores (Pleiades) at NASA Ames (11th) 33,120 cores (Mole-8.5) in China (28th)

42,440 cores (Red Sky) at Sandia (14th) More are getting installed !

62,976 cores (Ranger) at TACC (15th)

35,360 cores (Lomonosov) in Russia (17th)

8 HPC Advisory Council, Lugano Switzerland '11 HPC Advisory Council, Lugano Switzerland '11

http://www.top500.org/



• How to Use MPI


Exaflop Systems



9



• Parallel system offers greater compute and memory

capacity than a serial system

– Tackle problems that are too big to fit in one computer

• Different types of systems

– Uniform Shared Memory (bus based)

• Many way symmetric multi-processor machines: SGI, Sun, …

– Non uniform Shared Memory (NUMA)

• CCNUMA machines: Cray CX1000, AMD Magny Cours, Intel Westmere

– Distributed Memory Machines

• Commodity clusters, Blue Gene, Cray XT5

• Similarly, there are different types of programming models

– Shared memory, Distributed memory …

HPC Advisory Council, Lugano Switzerland '11 10

Parallel Systems - History and Overview


Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• In this presentation series, we concentrate on MPI

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models Message Passing Interface (MPI), Sockets and

PGAS (UPC, Global Arrays)

Applications

Networking Technologies (InfiniBand, 1/10/40GigE, RNICs & Intelligent NICs)

Commodity Computing System Architectures

(single, dual, quad, ..) Multi/Many-core architecture and

Accelerators


Point-to-point Communication QoS

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Library or Runtime for Programming Models

12

• Message Passing Library standardized by MPI Forum

– C, C++ and Fortran

• Goal: portable, efficient and flexible standard for writing

parallel applications

• Not IEEE or ISO standard, but widely considered “industry

standard” for HPC application

• Evolution of MPI

– MPI-1: 1994

– MPI-2: 1996

– MPI-3: on-going effort (2008 – current)

13

MPI Overview and History


• Primarily intended for

distributed memory machines

• P2 needs value of A

– MPI-1: P1 will have to send a

message to P2 with value of A

using MPI_Send

– MPI-2: P2 can get value of A

directly using MPI_Get

• P1, P2, P3 need sum of A+B+C

– MPI_Allreduce with SUM op

– Multi-way communication

14

What does MPI do?

P1 P2

A=5

Memory

B=6

Memory

Network

MPI_Send

MPI_Get

A=5

Memory

B=6

Memory

Network

C=4

Memory

P1 P2

P3

MPI_Allreduce




• How to Use MPI


Exaflop Systems



15



• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

• Involvement of Network in MPI operations

Using MPI


17

Types of Point-to-Point Communication

• Synchronous (MPI_Ssend)

– Sender process blocks on send until receiver arrives

• Blocking Send / Receive (MPI_Send, MPI_Recv)

– Block until send buffer can be re-used

– Block until receive buffer is ready to read

• Non-blocking Send / Receive (MPI_Isend, MPI_Irecv)

– Start send and receive, but don’t wait until complete

• Others: buffered send, sendrecv, ready send

– Not used very frequently


• How does MPI know which send is for which receive?

• Programmer (i.e. you!) need to provide this information

– Sender side: tag, destination rank and communicator

– Communicator is a subset of the entire set of MPI processes

– MPI_COMM_WORLD represents all MPI processes

– Receiver side: tag, source rank and communicator

– The triples: tag, rank and communicator must match

• Some special, pre-defined values: MPI_ANY_TAG,

MPI_ANY_RANK


Message Matching

19

Buffering

• MPI library has internal

“system” buffers

– Optimize throughput (do not

wait for receiver)

– Opaque to programmer

– Finite Resource

• Blocking send may copy to

sender system buffer and

return

Courtesy: https://computing.llnl.gov/tutorials/mpi/


20

Blocking vs. Non-blocking

• Blocking

– Send returns when safe to re-use buffer (maybe in system buffer)

– Receive only returns when data is fully received

• Non-blocking

– Returns immediately (data may or may not be buffered)

– Simply request MPI library to transfer the data

– Need to “wait” on handle returned by call

– Benefit is that computation and communication can be overlapped


• Messages will not overtake each other

• If sender sends two messages M1 and M2

in succession and both match the same

receive, then M1 will be received before

M2

• Converse is also true: if receiver posts two

receives R1 and R2 and both match

message M, then R1 will be completed

first

• Note: this does not mean MPI requires in-

order delivery (although many

implementations do this for simplicity)


Ordering

Sender Receiver

Receive: R

✔

Sender Receiver

Receive: R

✗

M1

M2

M1

M2

• MPI does not guarantee fairness

• If receive R matches message M1 from P1 and M2 from P2,

MPI does not say which one will match first


Fairness

P1 P2

P3

M1 M2

Receive:

R

?



Sample code for point-to-point

#include "mpi.h"

#include <stdio.h>

int main(argc,argv)

int argc;

char *argv[]; {

int numtasks, rank, dest, source, rc, count, tag=1;

char inmsg, outmsg='x';

MPI_Status Stat;

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {

dest = 1;

source = 1;

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

}

else if (rank == 1) {

dest = 0;

source = 0;

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

}

rc = MPI_Get_count(&Stat, MPI_CHAR, &count);

printf ("Task %d: Received %d char(s) f rom task %d with tag %d \n",

rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);

MPI_Finalize();

}





• Job Startup

• Parallel I/O


Using MPI


25

Types of Collective Communication

• Synchronization

– Processes wait until all of them have reached a certain point

• Data Movement

– Broadcast, Scatter, All-to-all …

• Collective Computation

– Allreduce with min, max, multiply, sum … on data

• Considerations

– Blocking, no tag required, only with pre-defined datatypes

– MPI-3 considering non-blocking versions



Example Collective Operation: Scatter

Rank 0

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0

Rank 1 Rank 2 Rank 3

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

• Using Scatter, an array can be distributed to multiple

processes


Example code for the Scatter Example

#include "mpi.h"

#include <stdio.h>

#def ine SIZE 4

int main(argc,argv)

int argc;

char *argv[]; {

int numtasks, rank, sendcount, recvcount, source;

f loat sendbuf [SIZE][SIZE] = {

{1.0, 2.0, 3.0, 4.0},

{5.0, 6.0, 7.0, 8.0},

{9.0, 10.0, 11.0, 12.0},

{13.0, 14.0, 15.0, 16.0} };

f loat recvbuf [SIZE];

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

if (numtasks == SIZE) {

source = 1;

sendcount = SIZE;

recvcount = SIZE;

MPI_Scatter(sendbuf ,sendcount,MPI_FLOAT,recvbuf ,recvcount,

MPI_FLOAT,source,MPI_COMM_WORLD);

printf ("rank= %d Results: %f %f %f %f\n",rank,recvbuf [0],

recvbuf [1],recvbuf[2],recvbuf[3]);

}

else

printf ("Must specify %d processors. Terminating.\n",SIZE);

MPI_Finalize();

} Courtesy: https://computing.llnl.gov/tutorials/mpi/




• Job Startup

• Parallel I/O


Using MPI


29

Benefits of one-sided communication

• Easy to express irregular pattern of communication

– Easier than request-response pattern using two-sided

• Decouple data transfer with synchronization

– Various methods of synchronization

• Active synchronization

• Passive synchronization

• Potentially better performance with overlap of

computation and communication



Basic model of one-sided communication

Rank 0

Rank 2

Rank 1

Rank 3

mem

mem

mem

mem

window

• Each process can contribute part of its memory to form a

larger “window” of global memory

• Creation and destruction of “windows” are collective

operations (all processes participate)


MPI One Sided Taxonomy

MPI-2 One Sided Model

Communication Synchronization

Put Get Accumulate Active Passive

Lock/Un

lock

Collective

(Entire Window)

Group

(Subset Window)

Fence Post/Wait/Start/C

omplete

• Different modes suit different applications patterns


Synchronization Modes – Active Synchronization

Origin Target

MPI_Win_post

MPI_Win_start

Overlapped

Computation MPI_Put

Overlapped

Computation

MPI_Win_complete

MPI_Win_wait

• Window is exposed with

win_post, and access started

with win_start

• Win_complete to end access,

and win_wait to make sure

window is available to read


Synchronization Modes – Active Synchronization (Fence)

Origin Target

MPI_Win_creat

e

MPI_Win_creat

e

Overlapped

Computation

MPI_Put

Overlapped

Computation

MPI_Win_fence MPI_Win_fence

MPI_Win_fence

Local memory

access

• Collective synchronization

• Collective among members of a

window

• Updates between fences only

visible after fence is complete

MPI_Win_fence


Synchronization Modes – Passive Synchronization

Origin Target

MPI_Win_creat

e

MPI_Win_creat

e

Overlapped

Computation MPI_Put

Overlapped

Computation

MPI_Win_unlock

MPI_Win_lock

MPI_Win_lock

Local memory

access

• Billboard model

• Lock/unlock to have dedicated

access

• Lock/unlock are not blocking

• Put executed only when lock

granted on target




• Job Startup

• Parallel I/O


Using MPI


• MPI process managers provide support to launch jobs

• “mpiexec” is a utility to launch jobs

• Example usage:

– mpiexec -np 2 -machinefile mf ./a.out

• Supports SPMD (single program multiple data) model along

with MPMD (multiple program multiple data)

• Different resource management systems / MPI stacks may

do things slightly differently

– SLURM, PBS, Torque

– mpirun_rsh (fastest launcher for MVAPICH and MVAPICH2), hydra

and ORTE (Open MPI)

• Launch time should not increase with number of processes HPC Advisory Council, Lugano Switzerland '11 36

Launching MPI jobs




• Job Startup

• Parallel I/O


Using MPI


• Parallel I/O very important for scientific applications

• Parallel file systems offer high bandwidth access to large

volumes of data

– PVFS (parallel virtual file system)

– Lustre, GPFS …

• MPI applications can use MPI-IO layer for collective I/O

– Using MPI-IO optimal I/O access patterns are used to read data

from disks

– Fast communication network then helps re-arrange data in order

desired by end application


Parallel I/O

• Critical optimization in MPI I/O

• All processes must call collective function

• Basic idea: build large blocks from small requests so

requests will be large from disk point of view

– Particularly effective when accesses by different processes are

non-contiguous and interleaved


Collective I/O in MPI

Courtesy: http://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdf



• Using MPI


Exaflop Systems



40



Designing MPI Using InfiniBand Features


Many different design choices

RDMA Operations

Unreliable Datagram

Atomic Operations

Shared Receive Queues

(SRQ)

Static Rate Control

Multicast End-to-End

Flow Control

Major Components in MPI

Send / Receive

Multi-Path LMC

QoS (SL and VL)

InfiniBand Features

Reliable Connection

eXtended Reliable Connection

(XRC)

Optimal design choices • Performance • Scalability • Fault-Tolerance & Resiliency • Power-Aware

Protocol Mapping

Buffer Management

Flow Control

Connection Management

Communication Progress

Collective Communication

Multi-rail Support

One-sided Active/Passive

Checkpoint Restart

Process Migration

Reliability and Resiliency

• Inter-node Pt-to-pt Communication

– Challenges

• Sender memory -> IB adapter (sender) -> IB switch -> IB adapter

(receiver) -> Receiver memory

• Short message (eager) and Long message (rendezvous)

• Send/Recv vs. RDMA

– Metrics

• Latency (lowest)

• Bandwidth and Bi-Directional bandwidth (highest)

• CPU utilization (lowest)

– Maximum overlap between communication and computation

• Message Rate (highest)


Performance Issues

• Added Challenges for Intra-node Pt-to-Pt Communication

– Multi-core platforms are emerging

– Cache hierarchy (shared L2 or not, L3)

– Intra-socket and Inter-socket communication cost (latency and

bandwidth) are different

• May need different scheme for intra-socket and inter-socket

communication

– Process-core mapping plays an important role

• Concurrent Communication

– Multi-rail organizations and schemes for efficient usage of the rails

– Polling scheme within the MPI library


Performance Issues (Cont’d)


– Metrics

• Minimize latency

• Maximize throughput (example: concurrent broadcasts)

– Challenges

• Optimal algorithms to minimize

– Network contention

– Contention at the source and destination adapter(s)

– CPU involvement/overhead

• Different algorithms based on system size and message size

• Multi-core-aware algorithms for the emerging multi-core platforms

• Topology-aware algorithms to dynamically adopt based on the

underlying network topology


Performance Issues (Cont’d)

• Performance of an application should increase as system

size increases

– Strong-Scaling

• Problem size is kept constant as system size increases

– Weak-Scaling

• Problem size keeps on increasing as system size increases

• Depends on

– Structure of the application

– Underling algorithms being used

– Performance of MPI library

• All Performance Issues (as indicated earlier) matter for the MPI library

• Additional Issues

– Network topology

– CPU mapping to cores (block and cyclic across nodes and within nodes)


Obtaining Scalable Performance

• Does the memory needed for MPI library increases with System size?

• Different transport protocols with IB

– Reliable Connection (RC) is the most common

– Unreliable Datagram (UD) is used in some cases

• Buffers need to be posted at each receiver to receive message from

any sender

– Buffer requirement can increase with system size

• Connections need to be established across processes under RC

– Each connection requires certain amount of memory for handling related

data structures

– Memory required for all connections can increase with system size

• Both issues have become critical as large-scale IB deployments have

taken place

– Being addressed by IB specification (SRQ, XRC, UD/RC/XRC Hybrid) and

MPI library (Will be discussed more in Day 2)


Memory Scalability of MPI Library in large-scale systems

• Millions of cores and components in next-generation Multi-PetaFlop

and Exaflop systems

• Components are bound to fail

• Mean Time Between Failure (MTBF) has to remain high so that

Exascale applications can run efficiently

• Two broad kinds of failures

– Network failure (adapter, link and switch)

– Node or Process failure

• InfiniBand provides multiple schemes like CRC, end-to-end reliability,

reliable connection (RC) mode, Automatic Path Migration (APM) to

handle network related errors

• Can MPI library be made Resilient? (Day 2)

• Can MPI library support efficient checkpoint-restart and process

migration for process/node failure? (Day 2)


Fault-Tolerance and Resiliency

• Power consumption is becoming a significant issue for the

design and deployment of Multi-Petaflop and Exaflop

systems

• All different hardware components (CPU, memory, storage,

network adapter, switch and links) are being re-designed

with less power consumption in mind

• Targeted goal is 20MW for an Exaflop system in 2018-2020

• Can we make MPI-library Power-Aware?

– Polling-based schemes are common in MPI library to receive

messages and act upon these quickly

– Continuous polling by CPU consumes a lot of power

– Can the CPUs be made running at lower speed when large collective

operations are taking place

– Can we design power-aware collective schemes (Day 2) HPC Advisory Council, Lugano Switzerland '11 48

Power-Aware Designs



• Using MPI


Exaflop Systems



49



• High Performance MPI Library for IB, 10GE/iWARP & RoCE

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Latest Releases: MVAPICH 1.2 and MVAPICH2 1.6

– Used by more than 1,500 organizations in 60 countries

• Registered at the OSU site voluntarily

– More than 57,000 downloads from OSU site directly

– Empowering many TOP500 production clusters during the last eight years

– Available with software stacks of many IB, 10GE and server vendors including

Open Fabrics Enterprise Distribution (OFED) and Linux Distros

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/

50

MVAPICH/MVAPICH2 Software


MVAPICH-1 Architecture

MVAPICH (MPI-1) (1.2)

OpenFabrics/ Gen2

(Single-rail)

InfiniBand (Mellanox)

#1

PCI-X, PCIe, PCIe-Gen2 (SDR, DDR & QDR)

Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..

#2

OpenFabrics/ Gen2-Hybrid (Single-rail)

PSM

#3

Shared- Memory

InfiniBand (QLogic)

PCIe & HT (SDR, DDR & QDR)

#4

TCP/IP

#5

Single Node/ Laptops with

Multi-core

VAPI Gen2-Multirail

uDAPL (deprecated)


Major Features of MVAPICH 1.2

• OpenFabrics-Gen2 – Scalable job start-up with mpirun_rsh, support for SLURM – RC and XRC support – Flexible message coalescing – Multi-core-aware pt-to-pt communication – User-defined processor affinity for multi-core platforms – Multi-core-optimized collective communication – Asynchronous and scalable on-demand connection management – RDMA Write and RDMA Read-based protocols – Lock-free Asynchronous Progress for better overlap between

computation and communication – Polling and blocking support for communication progress – Multi-pathing support leveraging LMC mechanism on large fabrics – Network-level fault tolerance with Automatic Path Migration

(APM) – Mem-to-mem reliable data transfer mode (for detection of I/O

error with 32-bit CRC) – Network Fault Resiliency

52


Major Features of MVAPICH 1.2 (Continued)

• OpenFabrics-Gen2-Hybrid – Introduced interface in 1.1 – Replaces UD interface in 1.0 – Targeted for emerging multi-thousand-core clusters to

achieve the best performance with minimal memory footprint

– Most of the features as in Gen2 – Adaptive selection during run-time (based on

application and systems characteristics) to switch between

• RC and UD (or between XRC and UD) transports

– Multiple buffer organization with XRC support


MVAPICH2 Architecture (Latest Release 1.6)

Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..


All Different PCI interfaces

MVAPICH2 1.6 Features • Support for GPUDirect

• Using LiMIC2 for true one-sided intra-node RMA transfer to avoid extra memory copies

• Upgraded to LiMIC2 version 0.5.4

• Removing the limitation on number of concurrent windows in RMA operations

• Support for InfiniBand Quality of Service (QoS) with multiple virtual lanes

• Support for 3D Torus Topology

• Enhanced support for multi-threaded applications

• Fast Checkpoint-Restart support with aggregation scheme

• Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance

• Support for new standardized Fault-Tolerance Backplane (FTB) Events for CR and Migration Frameworks

• Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations

• Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations

• Enhanced and optimized algorithms for MPI_Reduce and MPI_AllReduce operations for small and medium message sizes

• XRC support with Hydra Process Manager


56

Support for Multiple Interfaces/Adapters

• OpenFabrics/Gen2-IB and OpenFabrics/Gen2-Hybrid – All IB adapters supporting OpenFabrics/Gen2

• Qlogic/PSM • Qlogic adapters

• OpenFabrics/Gen2-iWARP • Chelsio and Intel-NetEffect

• RoCE • ConnectX-EN

• uDAPL – Linux-IB – Solaris-IB – Any other adapter supporting uDAPL

• TCP/IP – Any adapter supporting TCP/IP interface

• Shared Memory Channel • for running applications in a node with multi-core processors (laptop,

SMP systems)




• Using MPI


Exaflop Systems



57



MVAPICH2 Inter-Node Performance Ping Pong Latency


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Late

ncy

(u

s)

Message Size (Bytes)

Small Messages

MVAPICH2-1.6

1.56 us

0

50

100

150

200

250

300

350

Late

ncy

(u

s)


Large Messages

MVAPICH2-1.6

Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter

MVAPICH2 Inter-Node Performance


0

500

1000

1500

2000

2500

3000

3500

4000

Ban

dw

idth

(M

B/s

)


Bandwidth

MVAPICH2-1.6

0

1000

2000

3000

4000

5000

6000

7000

Bi-

Dir

ect

ion

al B

and

wid

th (

MB

/s)


Bi-Directional Bandwidth

MVAPICH2-1.63394 MB/s

6539 MB/s

Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter

Performance of HPC Applications on TACC Ranger using MVAPICH + IB

• Rob Farber’s facial

recognition

application was run

up to 60K cores using

MVAPICH

• Ranges from 84% of

peak at low end to

65% of peak at high

end

http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber




Performance of HPC Applications on TACC Ranger: DNS/Turbulence

Courtesy: P.K. Yeung, Diego Donzis, TG 2008


Application Example: Blast Simulations

• Researchers from the

University of Utah have

developed a simulation

framework, called Uintah

• Combines advanced

mechanical, chemical and

physical models into a

novel computational

framework

• Have run > 32K MPI tasks

on Ranger

• Uses asynchronous

communication

http://www.tacc.utexas.edu/news/feature-stories/2009/explosive-science/

Courtesy: J. Luitjens, M. Bertzins, Univ of Utah


Application Example: OMEN

• OMEN is a two- and

three-dimensional

Schrodinger-Poisson

solver based

• Used in semi-conductor

modeling

• Run to almost 60K tasks

Courtesy: Mathieu Luisier, Gerhard Klimeck, Purde

http://www.tacc.utexas.edu/RangerImpact/pdf/Save_Our_Semiconductors.pdf


• Presented trends in Petaflop an Exaflop systems

• Presented an overview of MPI

• Discussed how to use the basic features of MPI

• Discussed challenges in designing MPI libraries

• Overview of MVAPICH and MVAPICH2 stack with sample

performance numbers

• MPI has long standing reputation for portability and

performance; likely going to remain a critical component

for future Exascale machines

Concluding Remarks


• Day 2 (MPI Performance and Optimizations)

– Major components of MVAPICH and MVAPICH2 stacks

• Job start-up, Connection Management, Pt-to-pt (inter-node, intra-

node) communication, LiMIC2, One-sided communication, Collective

communication, Multi-rail, Scalability (SRQ, XRC, UD/RC/XRC Hybrid),

QoS and 3D Torus and Fault-tolerance (network-level and process-

level)

– How to use these components and carry out runtime optimizations

• Day 3 (Future of MPI)

– Advanced and Upcoming Features of MVAPICH2 stack

• Collective Offload, Topology-aware collectives, Power-aware

collectives, GPUDirect support and PGAS (UPC) support

– Upcoming MPI-3 standard and Features

Preview of Day 2 and Day 3 Presentations


Web Pointers



http://nowlab.cse.ohio-state.edu

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu

[email protected]

[email protected]










http://nowlab.cse.ohio-state.edu/




http://mvapich.cse.ohio-state.edu/



mailto:[email protected]






Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Documents