Performance Oriented MPI
Post on 15-Jan-2016
32 Views
Preview:
DESCRIPTION
Transcript
Performance Oriented MPIPerformance Oriented MPI
Jeffrey M. SquyresJeffrey M. Squyres
Andrew LumsdaineAndrew Lumsdaine
NERSC/LBNL and U. Notre DameNERSC/LBNL and U. Notre Dame
QuickTime™ and aGIF decompressor
are needed to see this picture.
OverviewOverview
Overview and History of MPIOverview and History of MPI Performance Oriented Point to PointPerformance Oriented Point to Point Collectives, Data TypesCollectives, Data Types Diagnostics and TuningDiagnostics and Tuning Rules of Thumb and GotchasRules of Thumb and Gotchas
Scope of This TalkScope of This Talk
Beginning to intermediate userBeginning to intermediate user General principles and rules of thumbGeneral principles and rules of thumb When and where performance might be When and where performance might be
availableavailable Omit (advanced) low-level issuesOmit (advanced) low-level issues
Overview and History of MPIOverview and History of MPI
Library (not language) specificationLibrary (not language) specification GoalsGoals
– PortabilityPortability– EfficiencyEfficiency– Functionality (small and large)Functionality (small and large)
Safety (communicators)Safety (communicators) Conservative (current best practices)Conservative (current best practices)
Performance in MPIPerformance in MPI
MPI includes many performance-MPI includes many performance-oriented featuresoriented features
These features are only These features are only potentiallypotentially high- high-performanceperformance
The standard seeks not to preclude The standard seeks not to preclude performance, it does not mandate itperformance, it does not mandate it
Progress might only be made during MPI Progress might only be made during MPI function callsfunction calls
(Potential) Performance (Potential) Performance FeaturesFeatures
Non-blocking operationsNon-blocking operations Persistent operationsPersistent operations Collective operationsCollective operations MPI DatatypesMPI Datatypes
Basic Point to PointBasic Point to Point
““Six function MPI” includesSix function MPI” includes MPI_Send()MPI_Send() MPI_Recv()MPI_Recv() These are useful, but there is moreThese are useful, but there is more
Basic Point to PointBasic Point to Point
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);
} else {
MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);
} else {
MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);
}
Non-Blocking OperationsNon-Blocking Operations
MPI_Isend()MPI_Isend() MPI_Irecv()MPI_Irecv() ““I” is for immediateI” is for immediate Paired with MPI_Test()/MPI_Wait()Paired with MPI_Test()/MPI_Wait()
Non-Blocking OperationsNon-Blocking Operations
MPI_Comm_rank(comm,&rank);
if (rank == 0) {
MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
} else {
MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
}
MPI_Comm_rank(comm,&rank);
if (rank == 0) {
MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
} else {
MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
}
Persistent OperationsPersistent Operations
MPI_Send_Init() MPI_Send_Init() MPI_Recv_init()MPI_Recv_init() Creates a request but does not start itCreates a request but does not start it MPI_Start() begins the communicationMPI_Start() begins the communication A single request can be re-used with A single request can be re-used with
multiple calls to MPI_Start()multiple calls to MPI_Start()
Persistent OperationsPersistent Operations
MPI_Comm_rank(comm, &rank);
if (rank == 0)
MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);
else
MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);
/* … */
for (i = 0; i < n; i++) {
MPI_Start(&request);
/* Do some work */
MPI_Wait(&request, &status);
}
MPI_Comm_rank(comm, &rank);
if (rank == 0)
MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);
else
MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);
/* … */
for (i = 0; i < n; i++) {
MPI_Start(&request);
/* Do some work */
MPI_Wait(&request, &status);
}
Collective OperationsCollective Operations
May be layered on point to pointMay be layered on point to point May use tree communication patterns May use tree communication patterns
for efficiencyfor efficiency Synchronization! (No non-blocking Synchronization! (No non-blocking
collectives)collectives)
Collective OperationsCollective Operations
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm);
O(P) O(log P)
MPI DatatypesMPI Datatypes May allow MPI to send a message May allow MPI to send a message
directly from memory directly from memory May avoid copying/packingMay avoid copying/packing (General) high performance (General) high performance
implementations not widely availableimplementations not widely available
network
copy
Quiz: MPI_Send()Quiz: MPI_Send()
After I call MPI_Send()After I call MPI_Send()– The recipient has received the messageThe recipient has received the message– I have sent the messageI have sent the message– I can write to the message buffer without I can write to the message buffer without
corrupting the messagecorrupting the message I can write to the message bufferI can write to the message buffer
Sidenote: MPI_Ssend()Sidenote: MPI_Ssend()
MPI_Ssend() has the (perhaps) MPI_Ssend() has the (perhaps) expected semanticsexpected semantics
When MPI_Ssend() returns, the When MPI_Ssend() returns, the recipient has received the messagerecipient has received the message
Useful for debugging (replace Useful for debugging (replace MPI_Send() with MPI_Ssend())MPI_Send() with MPI_Ssend())
Quiz: MPI_Isend()Quiz: MPI_Isend()
After I call MPI_Isend()After I call MPI_Isend()– The recipient has started to receive the The recipient has started to receive the
messagemessage– I have started to send the messageI have started to send the message– I can write to the message buffer without I can write to the message buffer without
corrupting the messagecorrupting the message None of the above (I must call None of the above (I must call
MPI_Test() or MPI_Wait())MPI_Test() or MPI_Wait())
Quiz: MPI_Isend()Quiz: MPI_Isend()
True or FalseTrue or False– I can overlap communication and I can overlap communication and
computation by putting some computation computation by putting some computation between MPI_Isend() and between MPI_Isend() and MPI_Test()/MPI_Wait()MPI_Test()/MPI_Wait()
False (in many/most cases)False (in many/most cases)
Communication is Still Communication is Still ComputationComputation
A CPU, usually the main one, must do A CPU, usually the main one, must do the communication workthe communication work– Part of your process (inside MPI calls)Part of your process (inside MPI calls)– Another process on main CPUAnother process on main CPU– Another thread on main CPUAnother thread on main CPU– Another processorAnother processor
No Free LunchNo Free Lunch Part of your process (most common)Part of your process (most common)
– Fast but no overlapFast but no overlap Another process (daemons)Another process (daemons)
– Overlap, but slow (extra copies)Overlap, but slow (extra copies) Another thread (rare)Another thread (rare)
– Overlap and fast, but difficultOverlap and fast, but difficult Another processor (emerging)Another processor (emerging)
– Overlap and fast, but more hardwareOverlap and fast, but more hardware– E.g., Myri/gm, VIAE.g., Myri/gm, VIA
How Do I Get Performance?How Do I Get Performance?
Minimize time spent communicatingMinimize time spent communicating– Minimize data copiesMinimize data copies
Minimize synchronizationMinimize synchronization– I.e., time waiting for communicationI.e., time waiting for communication
Minimizing Communication Minimizing Communication TimeTime
BandwidthBandwidth LatencyLatency
Minimizing LatencyMinimizing Latency
Collect small messages together (if you Collect small messages together (if you can)can)– One 1024-byte message instead of 1024 One 1024-byte message instead of 1024
one-byte messagesone-byte messages Minimize other overhead (e.g., copying)Minimize other overhead (e.g., copying) Overlap with computation (if you can)Overlap with computation (if you can)
Example: Domain Example: Domain DecompositionDecomposition
Naïve ApproachNaïve Approach
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_send(…);
for (i = 0; i < 4; i++)
MPI_recv(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_send(…);
for (i = 0; i < 4; i++)
MPI_recv(…);
}
Naïve ApproachNaïve Approach
Deadlock! (Deadlock! (MaybeMaybe)) Can fix with careful coordination of Can fix with careful coordination of
receiving versus sending on alternate receiving versus sending on alternate processesprocesses
But this can still serializeBut this can still serialize
MPI_Sendrecv()MPI_Sendrecv()
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Sendrecv(…);
}
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Sendrecv(…);
}
}
Immediate OperationsImmediate Operations
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Isend(…);
MPI_Irecv(…);
}
MPI_Waitall(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Isend(…);
MPI_Irecv(…);
}
MPI_Waitall(…);
}
Receive Before SendingReceive Before Sending
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_Irecv(…);
for (i = 0; i < 4; i++)
MPI_Isend(…);
MPI_Waitall(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_Irecv(…);
for (i = 0; i < 4; i++)
MPI_Isend(…);
MPI_Waitall(…);
}
Persistent OperationsPersistent Operations
for (i = 0; i < 4; i++) {
MPI_Recv_init(…);
MPI_Send_init(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
MPI_Startall(…)
MPI_Waitall(…);
}
for (i = 0; i < 4; i++) {
MPI_Recv_init(…);
MPI_Send_init(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
MPI_Startall(…)
MPI_Waitall(…);
}
OverlappingOverlappingwhile (!done) {
MPI_Startall(…); /* Start exchanges */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* As information arrives */
do_received_red(D); /* Process */
}
MPI_Startall(…);
do_inner_black(D);
for (i = 0; i < 4; i++) {
MPI_Wait_any(…);
do_received_black(D);
}
}
while (!done) {
MPI_Startall(…); /* Start exchanges */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* As information arrives */
do_received_red(D); /* Process */
}
MPI_Startall(…);
do_inner_black(D);
for (i = 0; i < 4; i++) {
MPI_Wait_any(…);
do_received_black(D);
}
}
Advanced OverlapAdvanced Overlap
MPI_Startall(…); /* Start all receives */
/* … */
while (!done) {
MPI_Startall(…); /* Start sends */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* Wait on receives */
if (received) {
do_received_red(D); /* Process */
MPI_Start(…); /* Restart receive */
}
}
/* Repeat for black */
}
MPI_Startall(…); /* Start all receives */
/* … */
while (!done) {
MPI_Startall(…); /* Start sends */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* Wait on receives */
if (received) {
do_received_red(D); /* Process */
MPI_Start(…); /* Restart receive */
}
}
/* Repeat for black */
}
MPI Data TypesMPI Data Types
MPI_Type_vectorMPI_Type_vector MPI_Type_structMPI_Type_struct Etc.Etc. MPI_Pack might be betterMPI_Pack might be better
network
copy
Minimizing SynchronizationMinimizing Synchronization
At synchronization point (e.g., with At synchronization point (e.g., with collective communication) all processes collective communication) all processes must arrive at collective callmust arrive at collective call
Can spend lots of time waitingCan spend lots of time waiting This is often an algorithmic issueThis is often an algorithmic issue
– E.g., check for convergence every 5 E.g., check for convergence every 5 iterations instead of every iterationiterations instead of every iteration
GotchasGotchas
MPI_ProbeMPI_Probe– Guarantees extra memory copyGuarantees extra memory copy
MPI_Any_sourceMPI_Any_source– Can cause additional (internal) loopingCan cause additional (internal) looping
MPI_All_to_allMPI_All_to_all– All pairs must communicateAll pairs must communicate– Synchronization (avoid in general)Synchronization (avoid in general)
Diagnostic ToolsDiagnostic Tools
TotalviewTotalview PrismPrism UpshotUpshot XMPIXMPI
SummarySummary
Receive before sendingReceive before sending Collect small messages togetherCollect small messages together Overlap (if possible)Overlap (if possible) Use immediate operationsUse immediate operations Use persistent operationsUse persistent operations Use diagnostic toolsUse diagnostic tools
top related