Performance Oriented MPI

Performance Oriented MPIPerformance Oriented MPI

Jeffrey M. SquyresJeffrey M. Squyres

Andrew LumsdaineAndrew Lumsdaine

NERSC/LBNL and U. Notre DameNERSC/LBNL and U. Notre Dame

QuickTime™ and aGIF decompressor

are needed to see this picture.

OverviewOverview

Overview and History of MPIOverview and History of MPI Performance Oriented Point to PointPerformance Oriented Point to Point Collectives, Data TypesCollectives, Data Types Diagnostics and TuningDiagnostics and Tuning Rules of Thumb and GotchasRules of Thumb and Gotchas

Scope of This TalkScope of This Talk

Beginning to intermediate userBeginning to intermediate user General principles and rules of thumbGeneral principles and rules of thumb When and where performance might be When and where performance might be

availableavailable Omit (advanced) low-level issuesOmit (advanced) low-level issues

Overview and History of MPIOverview and History of MPI

Library (not language) specificationLibrary (not language) specification GoalsGoals

– PortabilityPortability– EfficiencyEfficiency– Functionality (small and large)Functionality (small and large)

Safety (communicators)Safety (communicators) Conservative (current best practices)Conservative (current best practices)

Performance in MPIPerformance in MPI

MPI includes many performance-MPI includes many performance-oriented featuresoriented features

These features are only These features are only potentiallypotentially high- high-performanceperformance

The standard seeks not to preclude The standard seeks not to preclude performance, it does not mandate itperformance, it does not mandate it

Progress might only be made during MPI Progress might only be made during MPI function callsfunction calls

(Potential) Performance (Potential) Performance FeaturesFeatures

Non-blocking operationsNon-blocking operations Persistent operationsPersistent operations Collective operationsCollective operations MPI DatatypesMPI Datatypes

Basic Point to PointBasic Point to Point

““Six function MPI” includesSix function MPI” includes MPI_Send()MPI_Send() MPI_Recv()MPI_Recv() These are useful, but there is moreThese are useful, but there is more

Basic Point to PointBasic Point to Point

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {

MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);

} else {

MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {

MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);

} else {

MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);

Non-Blocking OperationsNon-Blocking Operations

MPI_Isend()MPI_Isend() MPI_Irecv()MPI_Irecv() ““I” is for immediateI” is for immediate Paired with MPI_Test()/MPI_Wait()Paired with MPI_Test()/MPI_Wait()

Non-Blocking OperationsNon-Blocking Operations

MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);

/* Do some computation */

MPI_Wait(&request,&status);

} else {

MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);

MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);

} else {

MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);

Persistent OperationsPersistent Operations

MPI_Send_Init() MPI_Send_Init() MPI_Recv_init()MPI_Recv_init() Creates a request but does not start itCreates a request but does not start it MPI_Start() begins the communicationMPI_Start() begins the communication A single request can be re-used with A single request can be re-used with

multiple calls to MPI_Start()multiple calls to MPI_Start()

MPI_Comm_rank(comm, &rank);

if (rank == 0)

MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);

MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);

/* … */

for (i = 0; i < n; i++) {

MPI_Start(&request);

/* Do some work */

MPI_Wait(&request, &status);

MPI_Comm_rank(comm, &rank);

if (rank == 0)

MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);

MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);

/* … */

for (i = 0; i < n; i++) {

MPI_Start(&request);

/* Do some work */

MPI_Wait(&request, &status);

Collective OperationsCollective Operations

May be layered on point to pointMay be layered on point to point May use tree communication patterns May use tree communication patterns

for efficiencyfor efficiency Synchronization! (No non-blocking Synchronization! (No non-blocking

collectives)collectives)

Collective OperationsCollective Operations

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm);

O(P) O(log P)

MPI DatatypesMPI Datatypes May allow MPI to send a message May allow MPI to send a message

directly from memory directly from memory May avoid copying/packingMay avoid copying/packing (General) high performance (General) high performance

implementations not widely availableimplementations not widely available

network

Quiz: MPI_Send()Quiz: MPI_Send()

After I call MPI_Send()After I call MPI_Send()– The recipient has received the messageThe recipient has received the message– I have sent the messageI have sent the message– I can write to the message buffer without I can write to the message buffer without

corrupting the messagecorrupting the message I can write to the message bufferI can write to the message buffer

Sidenote: MPI_Ssend()Sidenote: MPI_Ssend()

MPI_Ssend() has the (perhaps) MPI_Ssend() has the (perhaps) expected semanticsexpected semantics

When MPI_Ssend() returns, the When MPI_Ssend() returns, the recipient has received the messagerecipient has received the message

Useful for debugging (replace Useful for debugging (replace MPI_Send() with MPI_Ssend())MPI_Send() with MPI_Ssend())

Quiz: MPI_Isend()Quiz: MPI_Isend()

After I call MPI_Isend()After I call MPI_Isend()– The recipient has started to receive the The recipient has started to receive the

messagemessage– I have started to send the messageI have started to send the message– I can write to the message buffer without I can write to the message buffer without

corrupting the messagecorrupting the message None of the above (I must call None of the above (I must call

MPI_Test() or MPI_Wait())MPI_Test() or MPI_Wait())

Quiz: MPI_Isend()Quiz: MPI_Isend()

True or FalseTrue or False– I can overlap communication and I can overlap communication and

computation by putting some computation computation by putting some computation between MPI_Isend() and between MPI_Isend() and MPI_Test()/MPI_Wait()MPI_Test()/MPI_Wait()

False (in many/most cases)False (in many/most cases)

Communication is Still Communication is Still ComputationComputation

A CPU, usually the main one, must do A CPU, usually the main one, must do the communication workthe communication work– Part of your process (inside MPI calls)Part of your process (inside MPI calls)– Another process on main CPUAnother process on main CPU– Another thread on main CPUAnother thread on main CPU– Another processorAnother processor

No Free LunchNo Free Lunch Part of your process (most common)Part of your process (most common)

– Fast but no overlapFast but no overlap Another process (daemons)Another process (daemons)

– Overlap, but slow (extra copies)Overlap, but slow (extra copies) Another thread (rare)Another thread (rare)

– Overlap and fast, but difficultOverlap and fast, but difficult Another processor (emerging)Another processor (emerging)

– Overlap and fast, but more hardwareOverlap and fast, but more hardware– E.g., Myri/gm, VIAE.g., Myri/gm, VIA

How Do I Get Performance?How Do I Get Performance?

Minimize time spent communicatingMinimize time spent communicating– Minimize data copiesMinimize data copies

Minimize synchronizationMinimize synchronization– I.e., time waiting for communicationI.e., time waiting for communication

Minimizing Communication Minimizing Communication TimeTime

BandwidthBandwidth LatencyLatency

Minimizing LatencyMinimizing Latency

Collect small messages together (if you Collect small messages together (if you can)can)– One 1024-byte message instead of 1024 One 1024-byte message instead of 1024

one-byte messagesone-byte messages Minimize other overhead (e.g., copying)Minimize other overhead (e.g., copying) Overlap with computation (if you can)Overlap with computation (if you can)

Example: Domain Example: Domain DecompositionDecomposition

Naïve ApproachNaïve Approach

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

doblack(D);

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++)

MPI_send(…);

for (i = 0; i < 4; i++)

MPI_recv(…);

while (!done) {

dored(D);

doblack(D);

for (i = 0; i < 4; i++)

MPI_send(…);

for (i = 0; i < 4; i++)

MPI_recv(…);

Naïve ApproachNaïve Approach

Deadlock! (Deadlock! (MaybeMaybe)) Can fix with careful coordination of Can fix with careful coordination of

receiving versus sending on alternate receiving versus sending on alternate processesprocesses

But this can still serializeBut this can still serialize

MPI_Sendrecv()MPI_Sendrecv()

while (!done) {

dored(D);

doblack(D);

for (i = 0; i < 4; i++) {

MPI_Sendrecv(…);

while (!done) {

dored(D);

doblack(D);

for (i = 0; i < 4; i++) {

MPI_Sendrecv(…);

Immediate OperationsImmediate Operations

while (!done) {

dored(D);

doblack(D);

for (i = 0; i < 4; i++) {

MPI_Isend(…);

MPI_Irecv(…);

MPI_Waitall(…);

while (!done) {

dored(D);

doblack(D);

for (i = 0; i < 4; i++) {

MPI_Isend(…);

MPI_Irecv(…);

MPI_Waitall(…);

Receive Before SendingReceive Before Sending

while (!done) {

dored(D);

doblack(D);

for (i = 0; i < 4; i++)

MPI_Irecv(…);

for (i = 0; i < 4; i++)

MPI_Isend(…);

MPI_Waitall(…);

while (!done) {

dored(D);

doblack(D);

for (i = 0; i < 4; i++)

MPI_Irecv(…);

for (i = 0; i < 4; i++)

MPI_Isend(…);

MPI_Waitall(…);

for (i = 0; i < 4; i++) {

MPI_Recv_init(…);

MPI_Send_init(…);

while (!done) {

dored(D);

doblack(D);

MPI_Startall(…)

MPI_Waitall(…);

for (i = 0; i < 4; i++) {

MPI_Recv_init(…);

MPI_Send_init(…);

while (!done) {

dored(D);

doblack(D);

MPI_Startall(…)

MPI_Waitall(…);

OverlappingOverlappingwhile (!done) {

MPI_Startall(…); /* Start exchanges */

do_inner_red(D); /* Internal computation */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* As information arrives */

do_received_red(D); /* Process */

MPI_Startall(…);

do_inner_black(D);

for (i = 0; i < 4; i++) {

MPI_Wait_any(…);

do_received_black(D);

while (!done) {

MPI_Startall(…); /* Start exchanges */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* As information arrives */

MPI_Startall(…);

do_inner_black(D);

for (i = 0; i < 4; i++) {

MPI_Wait_any(…);

do_received_black(D);

Advanced OverlapAdvanced Overlap

MPI_Startall(…); /* Start all receives */

/* … */

while (!done) {

MPI_Startall(…); /* Start sends */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* Wait on receives */

if (received) {

MPI_Start(…); /* Restart receive */

/* Repeat for black */

MPI_Startall(…); /* Start all receives */

/* … */

while (!done) {

MPI_Startall(…); /* Start sends */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* Wait on receives */

if (received) {

MPI_Start(…); /* Restart receive */

/* Repeat for black */

MPI Data TypesMPI Data Types

MPI_Type_vectorMPI_Type_vector MPI_Type_structMPI_Type_struct Etc.Etc. MPI_Pack might be betterMPI_Pack might be better

network

Minimizing SynchronizationMinimizing Synchronization

At synchronization point (e.g., with At synchronization point (e.g., with collective communication) all processes collective communication) all processes must arrive at collective callmust arrive at collective call

Can spend lots of time waitingCan spend lots of time waiting This is often an algorithmic issueThis is often an algorithmic issue

– E.g., check for convergence every 5 E.g., check for convergence every 5 iterations instead of every iterationiterations instead of every iteration

GotchasGotchas

MPI_ProbeMPI_Probe– Guarantees extra memory copyGuarantees extra memory copy

MPI_Any_sourceMPI_Any_source– Can cause additional (internal) loopingCan cause additional (internal) looping

MPI_All_to_allMPI_All_to_all– All pairs must communicateAll pairs must communicate– Synchronization (avoid in general)Synchronization (avoid in general)

Diagnostic ToolsDiagnostic Tools

TotalviewTotalview PrismPrism UpshotUpshot XMPIXMPI

SummarySummary

Receive before sendingReceive before sending Collect small messages togetherCollect small messages together Overlap (if possible)Overlap (if possible) Use immediate operationsUse immediate operations Use persistent operationsUse persistent operations Use diagnostic toolsUse diagnostic tools

Performance Oriented MPI

Documents

High%Performance/MPI/Library/with/SRIOV/and/ … ·...

MPI over Scripting Languages: Usability and Performance...

MPInside: a Performance Analysis and Diagnostic Tool for...

Performance Analysis of MPI Collective...

performance analysis mpi - High Performance Computing ...

Intel Omni-Path Fabric Performance Tuning · • Updated...

RJE Performance Oriented Search

Metabolomics-oriented Bioinformatics at the MPI for ...

Designing High-Performance MPI Libraries for Multi-/Many ...

Performance Oriented Design

An Assessment of MPI 3.x (Part I) High Performance MPI ...

Formal Analysis for Debugging and Performance … Analysis.....

High-performance Communication in MPI through Message ...

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP...

Splotch: High Performance Visualization using MPI, OpenMP...

MPI Internals - High Performance Computing · Enough to get...