1 What is MPI? MPI = Message Passing Interface Specification of message passing libraries for developers and users Not a library by itself, but specifies what such a library should be Specifies application programming interface (API) for such libraries Many libraries implement such APIs on different platforms – MPI libraries Goal: provide a standard for writing message passing programs Portable, efficient, flexible Language binding: C, C++, FORTRAN programs
44
Embed
1 What is MPI? MPI = Message Passing Interface Specification of message passing libraries for developers and users Not a library by itself, but specifies.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
What is MPI? MPI = Message Passing InterfaceSpecification of message passing libraries for
developers and usersNot a library by itself, but specifies what such a library
should beSpecifies application programming interface (API) for
such librariesMany libraries implement such APIs on different
platforms – MPI librariesGoal: provide a standard for writing message
passing programsPortable, efficient, flexible
Language binding: C, C++, FORTRAN programs
2
History & Evolution 1980s – 1990s: incompatible
libraries and software tools; need for a standard
1994, MPI 1.0; 1995, MPI 1.1, revision and
clarification to MPI 1.0 Major milestone C, FORTRAN Fully implemented in all MPI
libraries 1997, MPI 1.2
Corrections and clarifications to MPI 1.1
1997, MPI 2 Major extension (and clarifications)
to MPI 1.1 C++, C, FORTRAN Partially implemented in most
libraries; a few full implementations (e.g. ANL MPICH2)
MPI Evolution
3
Why Use MPI?Standardization: de facto standard for parallel
computingNot an IEEE or ISO standard, but “industry standard”Practically replaced all previous message passing
librariesPortability: supported on virtually all HPC
platformsNo need to modify source code when migrating to
different machinesPerformance: so far the best; high performance
and high scalabilityRich functionality:
MPI 1.1 – 125 functionsMPI 2 – 152 functions.
If you know 6 MPI functions, you can do almost everything in parallel.
4
Programming Model Message passing model: data exchange through explicit
communications. For distributed memory, as well as shared-memory
parallel machines User has full control (data partition, distribution): needs
to identify parallelism and implement parallel algorithms using MPI function calls.
Number of CPUs in computation is static New tasks cannot be dynamically spawned during run time (MPI
1.1) MPI 2 specifies dynamic process creation and management, but
not available in most implementations. Not necessarily a disadvantage
General assumption: one-to-one mapping of MPI processes to processors (although not necessarily always true).
5
MPI 1.1 Overview
Point to point communicationsCollective communicationsProcess groups and communicatorsProcess topologiesMPI environment management
Public domain (free) MPI implementationsMPICH and MPICH2 (from ANL)LAM MPI
8
General MPI Program Structure
9
Example
#include <mpi.h>#include <stdio.h>
int main(int argc, char **argv){ int my_rank, num_cpus;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &num_cpus); printf(“Hello, I am process %d among %d processes\n”, my_rank, num_cpus); MPI_Finalize(); return 0;}
Hello, I am process 1 among 4 processesHello, I am process 2 among 4 processesHello, I am process 0 among 4 processesHello, I am process 3 among 4 processes
On 4 processors:
10
Example
program helloimplicit none
include ‘mpif.h’integer :: ierr, my_rank, num_cpus
call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank)call MPI_COMM_SIZE(MPI_COMM_WORLD, num_cpus)write(*,*) “Hello, I am process “, my_rank, “ among “ & , num_cpus, “ processes”call MPI_FINALIZE(ierr)
end program hello
Hello, I am process 1 among 4 processesHello, I am process 2 among 4 processesHello, I am process 0 among 4 processesHello, I am process 3 among 4 processes
On 4 processors:
11
MPI Header Files
In C/C++:
In FORTRAN:
#include <mpi.h>
include ‘mpif.h’
or (in FORTRAN90 and later)
use MPI
12
MPI Naming Conventions
All names have MPI_ prefix. In FORTRAN:
All subroutine names upper case, last argument is return code
buf – memory address of start of message count – number of data items datatype – what type each data item is (integer,
character, double, float …) dest – rank of receiving process tag – additional identification of message comm – communicator, usually MPI_COMM_WORLD
int MPI_Send(void *buf,int count,MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_SEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,IERROR) <type>BUF(*) integer COUNT,DATATYPE,DEST,TAG,COMM,IERROR
buf – initial address of receive buffer count – number of elements in receive buffer (size of receive buffer)
may not equal to the count of items actually received Actual number of data items received can be obtained by calling
MPI_Get_count(). datatype – data type in receive buffer source – rank of sending process tag – additional identification for message comm – communicator, usually MPI_COMM_WORLD status – object containing additional info of received message ierror – return code
MPI_Recv Status In C: MPI_Status structure, 3 members; MPI_Status status
status.MPI_TAG – tag of received message status.MPI_SOURCE – source rank of message status.MPI_ERROR – error code
In FORTRAN: integer array; integer status(MPI_STATUS_SIZE) Status(MPI_TAG) – tag of received message status(MPI_SOURCE) – source rank of message status(MPI_ERROR) – error code
Length of received message: MPI_Get_count()
Int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)MPI_GET_COUNT(STATUS,DATATYPE,COUNT,IERROR) integer STATUS(MPI_STATUS_SIZE),DATATYPE,COUNT,IERROR
MPI_Status status;int count;…MPI_Recv(message,256,MPI_CHAR,0,99,MPI_COMM_WORLD,&status);MPI_Get_count(&status, MPI_CHAR, &count); // count contains actual length
26
Message Data
Consists of count successive entries of the type indicated by datatype, starting with the entry at the address buf.
MPI data types:Basic data types: one for each data type in
hosting languages of C/C++, FORTRANDerived data type: will learn later.
MPI_Send is blocking: will not return until the message data and envelope is safely stored away. The message data might be delivered to the matching receive
buffer, or copied to some temporary system buffer. After MPI_Send returns, user can safely access or overwrite the
send buffer. MPI_Recv is blocking: returns only after the receive
buffer has the received message After it returns, the data is here and ready for use.
Non-blocking send/recv: will be discussed later.Non-blocking calls will return immediately; however, not safe to access the send/receive buffers. Need to call other functions to complete send/recv, then safe to access/modify send/receive buffers.
33
BufferingSend and matching receive operations may not
be (and are not) synchronized in reality. MPI implementation must decide what happens when send/recv are out of sync.
Consider:Send occurs 5 seconds before receive is ready;
where is the message when receive is pending?Multiple sends arrive at the same receiving task which
can receive one send at a time – what happens to the messages that are backing up?
MPI implementation (not the MPI standard) decides what happens in these cases. Typically a system buffer is used to hold data in transit.
34
Buffering
System buffer:Invisible to users and
managed by MPI libraryFinite resource that can be
easily exhaustedMay exist on sending or
receiving side, or bothMay improve performance.
User can attach own buffer for MPI message buffering.
35
Communication Modes for Send Standard mode: MPI_Send
System decides whether the outgoing message will be buffered or not
Usually, small messages buffering mode; large messages, no buffering, synchronous mode.
Buffered mode: MPI_Bsend Message will be copied to buffer; Send call then returns User can attach own buffer for use
Synchronous mode: MPI_Ssend No buffering. Will block until a matching receive starts receiving data
Ready mode: MPI_Rsend Can be used only if a matching receive is already posted; avoid
handshake etc. otherwise erroneous.
36
Communication Modes
MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_Bsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_Rsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
There is only one MPI_Recv; will match any send mode
37
Properties Order: MPI messages are non-overtaking
If a sender sends two messages in succession to same destination, and both match the same receive, then this receive will receive the first message no matter which message physically arrives the receiving end first.
If a receiver posts two receives in succession and both match the same message, then the first receive will be satisfied.
Note: if a receive matches two messages from two different senders, the receive may receive either one (implementation dependent).
Fairness: no guarantee of fairnessIf a message is sent to a destination, and the
destination process repeatedly posts a receive that matches this send, however the message may never be received since it is each time overtaken by another message sent from another source.
It is user’s responsibility to prevent the starvation in such situations.
40
Properties
Resource limitation:Pending communications consume system
resources (e.g. buffer space)Lack of resource may cause error or prevent
execution of MPI call.e.g. MPI_Bsend that cannot complete due to
lack of buffer space is erroneous.MPI_Send that cannot complete due to lack
of buffer space will only block, waiting for buffer space to be available or for a matching receive.
41
Deadlock Deadlock is a state when the program cannot proceed. Cyclic dependencies cause deadlock
MPI_Comm_rank(MPI_COMM_WORLD,&rank);If(rank==0){ MPI_Recv(buf1,count,MPI_DOUBLE,1,tag,comm); MPI_Send(buf2,count,MPI_DOUBLE,1,tag,comm);} else if (rank==1) { MPI_Recv(buf1,count,MPI_DOUBLE,0,tag,comm); MPI_Send(buf2,count,MPI_DOUBLE,0,tag,comm);}
0
1
MPI_Comm_rank(MPI_COMM_WORLD,&rank);If(rank==0){ MPI_Ssend(buf1,count,MPI_DOUBLE,1,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,1,tag,comm);} else if (rank==1) { MPI_Ssend(buf1,count,MPI_DOUBLE,0,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,0,tag,comm);}
42
Deadlock
Lack of buffer space may also cause deadlock
MPI_Comm_rank(MPI_COMM_WORLD,&rank);If(rank==0){ MPI_Send(buf1,count,MPI_DOUBLE,1,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,1,tag,comm);} else if (rank==1) { MPI_Send(buf1,count,MPI_DOUBLE,0,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,0,tag,comm);}
Deadlock if not enough buffer space!
43
Send-receive Two remedies: non-blocking communication, send-recv MPI_SENDRECV: combine send and recv in one call
Useful in shift operations; Avoid possible deadlock with circular shift and like operations
Equivalent to: execute a nonblocking send and a nonblocking recv, and then wait for them to complete
int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag,
*sendbuf and *recvbuf may not be the same memory address
44
#include <mpi.h>#include <stdio.h>
int main(int argc, char **argv){ int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int send_tag = 101, recv_tag=101; MPI_Status status;
MPI_COMM_WORLD, &status); printf("Among %d processes, process %d received from right neighbor: %d\n",
ncpus, my_rank, data_received);
// clean up MPI_Finalize(); return 0;}
Examplempirun –np 4 test_shift
Among 4 processes, process 3 received from right neighbor: 0Among 4 processes, process 2 received from right neighbor: 3Among 4 processes, process 0 received from right neighbor: 1Among 4 processes, process 1 received from right neighbor: 2