1 What is MPI? MPI = Message Passing Interface Specification of message passing libraries for developers and users Not a library by itself, but specifies.

1

What is MPI? MPI = Message Passing InterfaceSpecification of message passing libraries for

developers and usersNot a library by itself, but specifies what such a library

should beSpecifies application programming interface (API) for

such librariesMany libraries implement such APIs on different

platforms – MPI librariesGoal: provide a standard for writing message

passing programsPortable, efficient, flexible

Language binding: C, C++, FORTRAN programs

2

History & Evolution 1980s – 1990s: incompatible

libraries and software tools; need for a standard

1994, MPI 1.0; 1995, MPI 1.1, revision and

clarification to MPI 1.0 Major milestone C, FORTRAN Fully implemented in all MPI

libraries 1997, MPI 1.2

Corrections and clarifications to MPI 1.1

1997, MPI 2 Major extension (and clarifications)

to MPI 1.1 C++, C, FORTRAN Partially implemented in most

libraries; a few full implementations (e.g. ANL MPICH2)

MPI Evolution

3

Why Use MPI?Standardization: de facto standard for parallel

computingNot an IEEE or ISO standard, but “industry standard”Practically replaced all previous message passing

librariesPortability: supported on virtually all HPC

platformsNo need to modify source code when migrating to

different machinesPerformance: so far the best; high performance

and high scalabilityRich functionality:

MPI 1.1 – 125 functionsMPI 2 – 152 functions.

If you know 6 MPI functions, you can do almost everything in parallel.

4

Programming Model Message passing model: data exchange through explicit

communications. For distributed memory, as well as shared-memory

parallel machines User has full control (data partition, distribution): needs

to identify parallelism and implement parallel algorithms using MPI function calls.

Number of CPUs in computation is static New tasks cannot be dynamically spawned during run time (MPI

1.1) MPI 2 specifies dynamic process creation and management, but

not available in most implementations. Not necessarily a disadvantage

General assumption: one-to-one mapping of MPI processes to processors (although not necessarily always true).

5

MPI 1.1 Overview

Point to point communicationsCollective communicationsProcess groups and communicatorsProcess topologiesMPI environment management

6

MPI 2 Overview

Dynamic process creation and management

One-sided communicationsMPI Input/Output (Parallel I/O)Extended collective communicationsC++ binding

7

MPI Resources

MPI Standard:http://www.mpi-forum.org/

MPI web sites/tutorials etc, see class web site

Public domain (free) MPI implementationsMPICH and MPICH2 (from ANL)LAM MPI

8

General MPI Program Structure

9

Example

#include <mpi.h>#include <stdio.h>

int main(int argc, char **argv){ int my_rank, num_cpus;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &num_cpus); printf(“Hello, I am process %d among %d processes\n”, my_rank, num_cpus); MPI_Finalize(); return 0;}

Hello, I am process 1 among 4 processesHello, I am process 2 among 4 processesHello, I am process 0 among 4 processesHello, I am process 3 among 4 processes

On 4 processors:

10

Example

program helloimplicit none

include ‘mpif.h’integer :: ierr, my_rank, num_cpus

call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank)call MPI_COMM_SIZE(MPI_COMM_WORLD, num_cpus)write(*,*) “Hello, I am process “, my_rank, “ among “ & , num_cpus, “ processes”call MPI_FINALIZE(ierr)

end program hello

Hello, I am process 1 among 4 processesHello, I am process 2 among 4 processesHello, I am process 0 among 4 processesHello, I am process 3 among 4 processes

On 4 processors:

11

MPI Header Files

In C/C++:

In FORTRAN:

#include <mpi.h>

include ‘mpif.h’

or (in FORTRAN90 and later)

use MPI

12

MPI Naming Conventions

All names have MPI_ prefix. In FORTRAN:

All subroutine names upper case, last argument is return code

A few functions without return code

In C: mixed uppercase/lowercase

MPI constants all uppercase

call MPI_XXXX(arg1,arg2,…,ierr)call MPI_XXXX_XXXX(arg1,arg2,…,ierr)

ierr = MPI_Xxxx(arg1,arg2,…);ierr = MPI_Xxxx_xxx(arg1,arg2,…);

MPI_COMM_WORLD, MPI_SUCCESS, MPI_DOUBLE, MPI_SUM, …

If ierr == MPI_SUCCESS,Everything is ok; otherwise, something is wrong.

13

Initialization Initialization: MPI_Init() initializes MPI environment;

(MPI_Init_thread() if multiple threads) Must be called before any other MPI routine (so put it at the

beginning of code) except MPI_Initialized() routine. Can be called only once; subsequent calls are erroneous.

MPI_Initialized() to check if MPI_Init() is called

int main(int argc, char ** argv){ MPI_Init(&argc, &argv); int flag; MPI_Initialized(&flag); if(flag != 0) … // MPI_Init called … … MPI_Finalize(); return 0;}

int MPI_Init(int *argc, char ***argv)

program testinteger ierrcall MPI_INIT(ierr)…call MPI_FINALIZE(ierr)end program test

MPI_INIT(ierr)

14

TerminationMPI_Finalize() cleans up MPI environment

Must be called before exits.No other MPI routine can be called after this call,

even MPI_INIT()Exception: MPI_Initialized() (and MPI_Get_version(), MPI_Finalized()).

Abnormal termination: MPI_Abort()Makes a best attempt to abort all tasks

int MPI_Finalize(void)MPI_FINALIZE(IERR) integer IERR

int MPI_Abort(MPI_Comm comm, int errorcode)MPI_ABORT(COMM,ERRORCODE,IERR) integer COMM, ERRORCODE, IERR

15

MPI Processes MPI is process-oriented: program consists of multiple

processes, each corresponding to one processor. MIMD: Each process runs its own code. In practice, runs

its own copy of the same code (SPMD). MPI process and threads: MPI process can contain a

single thread (common case) or multiple threads. Most MPI implementations do not support multiple threads.

Needs special processing with that support. We will assume a single thread per process from now on.

MPI processes are identified by their ranks: If total nprocs processes in computation, rank ranges from 0, 1, …, nprocs-1. (true in C and FORTRAN).

nprocs does not change during computation.

16

Communicators and Process Groups

Communicator: is a group of processes that can communicate with one another.

Most MPI routines require a communicator argument to specify the collection of processes the communication is based on.

All processes in the computation form the communicator MPI_COMM_WORLD. MPI_COMM_WORLD is pre-defined by MPI, available anywhere

Can create subgroups/subcommunicators within MPI_COMM_WORLD. A process may belong to different communicators, and have

different ranks in different communicators.

17

How many CPUs, Which one am I … How many CPUs: MPI_COMM_SIZE() Who am I: MPI_COMM_RANK() Can compute data decomposition etc.

Know total number of grid points, total number of cpus and current cpu id; can calc which portion of data current cpu is to work on.

E.g. Poisson equ on a square Ranks also used to specify source and destination of

communications.

…int my_rank, ncpus;MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Comm_size(MPI_COMM_WORLD, &ncpus);…

int MPI_Comm_rank(MPI_Comm comm, int *rank)int MPI_Comm_size(MPI_Comm comm, int *size)MPI_COMM_RANK(comm,rank,ierr)MPI_COMM_SIZE(comm,size,ierr)

my_rank value different on different processors !

18

Compiling, Running MPI standard does not specify how to start up the

program Compiling and running MPI code implementation dependent

MPI implementations provide utilities/commands for compiling/running MPI codes

Compile: mpicc, mpiCC, mpif77, mpif90, mpCC, mpxlf …mpiCC –o myprog myfile.C (cluster)mpif90 –o myprog myfile.f90 (cluster)CC –Ipath_mpi_include –o myprog myfile.C –lmpi (SGI)mpCC –o myprog myfile.C (IBM)

Run: mpirun, poe, prun, ibrun …mpirun –np 2 myprog (cluster)mpiexec –np 2 myprog (cluster)poe myprog –node 1 –tasks_per_node 2 … (IBM)

19

Example#include <mpi.h>#include <stdio.h>#include <string.h>

int main(int argc, char **argv){ char message[256]; int my_rank; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); if(my_rank==0){ strcpy(message,”Hello, there!”); MPI_Send(message,strlen(message)+1,MPI_CHAR,1,99,MPI_COMM_WORLD); } else if(my_rank==1) { MPI_Recv(message,256,MPI_CHAR,0,99,MPI_COMM_WORLD,&status); printf(“Process %d received: %s\n”,my_rank,message); } MPI_Finalize(); return 0;}

mpirun –np 2 test_hello

Process 1 received: Hello, there!

6 MPI functions:MPI_Init()MPI_Finalize()MPI_Comm_rank()MPI_Comm_size()MPI_Send()MPI_Recv()

20

MPI Communications

Point-to-point communicationsInvolves a sender and a receiver, one

processor to another processorOnly the two processors participate in

communicationCollective communications

All processors within a communicator participate in communication (by calling same routine, may pass different arguments);

Barrier, reduction operations, gather, …

21

Point-to-Point Communications

22

Send / Receive

Message data: what to send/receive?Where is the message? Where to put it?What kind of data is it? What is the size?

Message envelope: where to send/receive?Sender, receiverCommunication contextMessage tag.

…MPI_Send(message,strlen(message)+1,MPI_CHAR,1,99,MPI_COMM_WORLD);MPI_Recv(message,256,MPI_CHAR,0,99,MPI_COMM_WORLD,&status);…

23

Send

buf – memory address of start of message count – number of data items datatype – what type each data item is (integer,

character, double, float …) dest – rank of receiving process tag – additional identification of message comm – communicator, usually MPI_COMM_WORLD

int MPI_Send(void *buf,int count,MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_SEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,IERROR) <type>BUF(*) integer COUNT,DATATYPE,DEST,TAG,COMM,IERROR

char message[256];MPI_Send(message,strlen(message)+1,MPI_CHAR,1,99,MPI_COMM_WORLD);

24

Receive

buf – initial address of receive buffer count – number of elements in receive buffer (size of receive buffer)

may not equal to the count of items actually received Actual number of data items received can be obtained by calling

MPI_Get_count(). datatype – data type in receive buffer source – rank of sending process tag – additional identification for message comm – communicator, usually MPI_COMM_WORLD status – object containing additional info of received message ierror – return code

int MPI_Recv(void *buf,int count,MPI_Datatype datatype,int source,int tag, MPI_Comm comm,MPI_Status *status)MPI_RECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,STATUS,IERROR) <type>BUF(*) integer COUNT,DATATYPE,SOURCE,TAG,COMM,STATUS(MPI_STATUS_SIZE),IERROR

Actual number of data items received can be queried from status object; it may be smaller than count, but cannot be larger (if larger overflow error).

char message[256];MPI_Recv(message,256,MPI_CHAR,0,99,MPI_COMM_WORLD,&status);

25

MPI_Recv Status In C: MPI_Status structure, 3 members; MPI_Status status

status.MPI_TAG – tag of received message status.MPI_SOURCE – source rank of message status.MPI_ERROR – error code

In FORTRAN: integer array; integer status(MPI_STATUS_SIZE) Status(MPI_TAG) – tag of received message status(MPI_SOURCE) – source rank of message status(MPI_ERROR) – error code

Length of received message: MPI_Get_count()

Int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)MPI_GET_COUNT(STATUS,DATATYPE,COUNT,IERROR) integer STATUS(MPI_STATUS_SIZE),DATATYPE,COUNT,IERROR

MPI_Status status;int count;…MPI_Recv(message,256,MPI_CHAR,0,99,MPI_COMM_WORLD,&status);MPI_Get_count(&status, MPI_CHAR, &count); // count contains actual length

26

Message Data

Consists of count successive entries of the type indicated by datatype, starting with the entry at the address buf.

MPI data types:Basic data types: one for each data type in

hosting languages of C/C++, FORTRANDerived data type: will learn later.

27

Basic MPI Data Types

MPI datatype FORTRAN datatype

MPI_INTEGER INTEGER

MPI_REAL REAL

MPI_DOUBLE_PRECISION

DOUBLE PRECISION

MPI_COMPLEX COMPLEX

MPI_LOGICAL LOGICAL

MPI_CHARACTER CHARACTER(1)

MPI_BYTE

MPI_PACKED

MPI datatype C datatype

MPI_CHAR signed char

MPI_SHORT signed short

MPI_INT signed int

MPI_LONG signed long

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_DOUBLE double

MPI_FLOAT float

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

28

Exampleint num_students;double grade[10];char note[1024];int tag1=1001, tag2=1002;int rank,ncpus;MPI_Status status;…MPI_Comm_rank(MPI_COMM_WORLD,&rank);… // set num_students,grade,note in rank=0 cpuif(rank==0){ MPI_Send(&num_students,1,MPI_INT,1,tag1,MPI_COMM_WORLD); MPI_Send(grade, 10, MPI_DOUBLE,2,tag1,MPI_COMM_WORLD); MPI_Send(note,strlen(note)+1,MPI_CHAR,1,tag2,MPI_COMM_WORLD);}if(rank==1){ MPI_Recv(&num_students,1,MPI_INT,0,tag1,MPI_COMM_WORLD,&status); MPI_Recv(note,1024,MPI_CHAR,0,tag2,MPI_COMM_WORLD,&status);}if(rank==2){ MPI_Recv(grade,10,MPI_DOUBLE,0,tag1,MPI_COMM_WORLD,&status);}…

num_students: 0 1note: 0 1grade: 0 2

29

Message Envelope Envelope:

Source Destination Tag Communicator

MPI_Send: source is the calling process, destination process specified; Destination can be MPI_PROC_NULL – will return asap, no

effect. MPI_Recv: destination is the calling process, source is

specified Source can be a wildcard, MPI_ANY_SOURCE. Source can also be MPI_PROC_NULL - Return asap, no effect,

receive buffer not modified. Tag: non-negative number, 0, 1, …, UB; UB can be

determined by querying MPI environment (UB>=32767). For MPI_Recv, can be a wildcard, MPI_ANY_TAG.

Communicator: specified, usually MPI_COMM_WORLD.

30

In Order for a Message To be Received …Message envelopes must match

Message must be directed to the process calling MPI_Recv

Message source must match that specified by MPI_Recv, unless MPI_ANY_SOURCE is specified.

Message tag must match that specified by MPI_Recv, unless MPI_ANY_TAG is specified

Message communicator must match that specified by MPI_Recv.

Data type must matchDatatype specified by MPI_Send and MPI_Recv

must match.(MPI_PACKED can match any other data type.)Can be more complicated when derived data types

are involved.

31

Example

double A[10], B[15];int rank, tag = 1001, tag1=1002;MPI_Status status;MPI_Comm_rank(MPI_COMM_WORLD,&rank);…if(rank==0) MPI_Send(A, 10, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);else if(rank==1){ MPI_Recv(B, 15, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &status); // ok // MPI_Recv(B, 15, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &status); wrong // MPI_Recv(B,15,MPI_DOUBLE,0,tag1,MPI_COMM_WORLD,&status); un-match // MPI_Recv(B,15,MPI_DOUBLE,1,tag,MPI_COMM_WORLD,&status); un-match // MPI_Recv(B,15,MPI_DOUBLE,MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&status); ok // MPI_Recv(B,15,MPI_DOUBLE,0,MPI_ANY_TAG,MPI_COMM_WORLD,&status); ok}

A: 01

32

Blocking Send/Recv

MPI_Send is blocking: will not return until the message data and envelope is safely stored away. The message data might be delivered to the matching receive

buffer, or copied to some temporary system buffer. After MPI_Send returns, user can safely access or overwrite the

send buffer. MPI_Recv is blocking: returns only after the receive

buffer has the received message After it returns, the data is here and ready for use.

Non-blocking send/recv: will be discussed later.Non-blocking calls will return immediately; however, not safe to access the send/receive buffers. Need to call other functions to complete send/recv, then safe to access/modify send/receive buffers.

33

BufferingSend and matching receive operations may not

be (and are not) synchronized in reality. MPI implementation must decide what happens when send/recv are out of sync.

Consider:Send occurs 5 seconds before receive is ready;

where is the message when receive is pending?Multiple sends arrive at the same receiving task which

can receive one send at a time – what happens to the messages that are backing up?

MPI implementation (not the MPI standard) decides what happens in these cases. Typically a system buffer is used to hold data in transit.

34

Buffering

System buffer:Invisible to users and

managed by MPI libraryFinite resource that can be

easily exhaustedMay exist on sending or

receiving side, or bothMay improve performance.

User can attach own buffer for MPI message buffering.

35

Communication Modes for Send Standard mode: MPI_Send

System decides whether the outgoing message will be buffered or not

Usually, small messages buffering mode; large messages, no buffering, synchronous mode.

Buffered mode: MPI_Bsend Message will be copied to buffer; Send call then returns User can attach own buffer for use

Synchronous mode: MPI_Ssend No buffering. Will block until a matching receive starts receiving data

Ready mode: MPI_Rsend Can be used only if a matching receive is already posted; avoid

handshake etc. otherwise erroneous.

36

Communication Modes

MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_Bsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)MPI_Rsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

There is only one MPI_Recv; will match any send mode

37

Properties Order: MPI messages are non-overtaking

If a sender sends two messages in succession to same destination, and both match the same receive, then this receive will receive the first message no matter which message physically arrives the receiving end first.

If a receiver posts two receives in succession and both match the same message, then the first receive will be satisfied.

Note: if a receive matches two messages from two different senders, the receive may receive either one (implementation dependent).

MPI_Comm_rank(MPI_COMM_WORLD, &rank);if(rank==0) { MPI_Bsend(buf1,count,MPI_DOUBLE,1,tag,comm); MPI_Bsend(buf2,count,MPI_DOUBLE,1,tag,comm);}else if(rank==1) { MPI_Recv(buf1,count,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&status); MPI_Recv(buf2,count,MPI_DOUBLE,0,tag,comm,&status);}

38

PropertiesProgress: if a pair of matching send/recv are

initiated on two processes, at least one of them will completeSend will complete, unless the receive is satisfied by

some other messageReceive will complete, unless send is consumed by

some other matching receive.

MPI_Comm_rank(MPI_COMM_WORLD, &rank);If(rank==0) { MPI_Bsend(buf1,count,MPI_DOUBLE,1,tag1,comm); MPI_Ssend(buf2,count,MPI_DOUBLE,1,tag2,comm);} else if(rank==1) { MPI_Recv(buf1,count,MPI_DOUBLE,0,tag2,comm); MPI_Recv(buf2,count,MPI_DOUBLE,0,tag1,comm);}

39

Properties

Fairness: no guarantee of fairnessIf a message is sent to a destination, and the

destination process repeatedly posts a receive that matches this send, however the message may never be received since it is each time overtaken by another message sent from another source.

It is user’s responsibility to prevent the starvation in such situations.

40

Properties

Resource limitation:Pending communications consume system

resources (e.g. buffer space)Lack of resource may cause error or prevent

execution of MPI call.e.g. MPI_Bsend that cannot complete due to

lack of buffer space is erroneous.MPI_Send that cannot complete due to lack

of buffer space will only block, waiting for buffer space to be available or for a matching receive.

41

Deadlock Deadlock is a state when the program cannot proceed. Cyclic dependencies cause deadlock

MPI_Comm_rank(MPI_COMM_WORLD,&rank);If(rank==0){ MPI_Recv(buf1,count,MPI_DOUBLE,1,tag,comm); MPI_Send(buf2,count,MPI_DOUBLE,1,tag,comm);} else if (rank==1) { MPI_Recv(buf1,count,MPI_DOUBLE,0,tag,comm); MPI_Send(buf2,count,MPI_DOUBLE,0,tag,comm);}

0

1

MPI_Comm_rank(MPI_COMM_WORLD,&rank);If(rank==0){ MPI_Ssend(buf1,count,MPI_DOUBLE,1,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,1,tag,comm);} else if (rank==1) { MPI_Ssend(buf1,count,MPI_DOUBLE,0,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,0,tag,comm);}

42

Deadlock

Lack of buffer space may also cause deadlock

MPI_Comm_rank(MPI_COMM_WORLD,&rank);If(rank==0){ MPI_Send(buf1,count,MPI_DOUBLE,1,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,1,tag,comm);} else if (rank==1) { MPI_Send(buf1,count,MPI_DOUBLE,0,tag,comm); MPI_Recv(buf2,count,MPI_DOUBLE,0,tag,comm);}

Deadlock if not enough buffer space!

43

Send-receive Two remedies: non-blocking communication, send-recv MPI_SENDRECV: combine send and recv in one call

Useful in shift operations; Avoid possible deadlock with circular shift and like operations

Equivalent to: execute a nonblocking send and a nonblocking recv, and then wait for them to complete

int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag,

MPI_Comm comm, MPI_Status *status)

MPI_SENDRECV(SENDBUF, SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVBUF, RECVCOUNT, RECVTYPE, SOURCE, RECVTAG, COMM, STATUS, IERROR)

<type> SENDBUF(*), RECVBUF(*) INTEGER SENDCOUNT, SENDTYPE, SENDTAG, DEST, RECVCOUNT, RECVTYPE, SOURCE,

RECVTAG, COMM, IERROR, STATUS(MPI_STATUS_SIZE)

*sendbuf and *recvbuf may not be the same memory address

44

#include <mpi.h>#include <stdio.h>

int main(int argc, char **argv){ int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int send_tag = 101, recv_tag=101; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus);

left_neighbor = (my_rank-1 + ncpus)%ncpus; right_neighbor = (my_rank+1)%ncpus;

MPI_Sendrecv(&my_rank, 1, MPI_INT, left_neighbor, send_tag, &data_received, 1, MPI_INT, right_neighbor, recv_tag,

MPI_COMM_WORLD, &status); printf("Among %d processes, process %d received from right neighbor: %d\n",

ncpus, my_rank, data_received);

// clean up MPI_Finalize(); return 0;}

Examplempirun –np 4 test_shift

Among 4 processes, process 3 received from right neighbor: 0Among 4 processes, process 2 received from right neighbor: 3Among 4 processes, process 0 received from right neighbor: 1Among 4 processes, process 1 received from right neighbor: 2

1 What is MPI? MPI = Message Passing Interface Specification of message passing libraries for developers and users Not a library by itself, but specifies.

Documents

cpus mpi

rank mpi

argv mpi

mpi functions

initierrcall mpi

cpuscall mpi

mpi libraries1997

rankcall mpi