An Introduction

An Introduction Background

The message-passing model Parallel computing model Communication between process Source of further MPI information

Basic of MPI message passing Fundamental concepts Simple examples in C

Extended point-to-pint operations Non-blocking communication modes

Companion Material Online examples available at

http://www.mcs.anl.gov/mpi/tutorial http://www.netlib.org/utk/papers/mpi-bo

ok/mpi-book.html contains a relative complete reference

on how to program on MPI and examples.

The Message-Passing Model A process is (traditionally) a program counter

and address space Processes may have multiple threads(program

counters and associated stacks) sharing a single address space. MPI is for communication among processes, which have separate address spaces.

Interprocess communication consists of Synchronization/Asynchronization Movement of data from one process’s address space

to another’s

Synchronous Communication

•A synchronous communication does not complete until the message has been received

–A FAX or registered mail

Asynchronous Communication

•An asynchronous communication completes as soon as the message is on the way.

–A post card or email

What is MPI? A message-passing library specification

Extended message-passing model Not a language or compiler specification Not a specific implementation or product

For parallel computers, clusters, and heterogeneous networks

Full-featured Designed to permit the development of parallel software

libraries Designed to provide access to advanced parallel hardware

for End users Library writers Tool developers

What is message passing? Data transfer plus synchronization Requires cooperation of sender

and receiver Cooperation not always apparent

in code

MPI Sources The standard itself:

At http://www.mpi-forum.org All MPI official release, in both postscript and HTML

Books: Using MPI: Portable parallel programming with the message-passing

interface, by Gropp, Lusk, and Skjellum, MIT Press, 1994 MPI: The Complete Reference, by Snir, Otto, Huss-Lederman, Walker,and

Dongarra, MIT Press, 1996 Designing and building Parallel programs, by Ian Foster, Addison-Wesley,

1995 Parallel Programming with MPI, by Peter Pacheco, Morgan-Kaufmann, 1997 MPI: The Complete Reference Vol 1 and 2, MIT Press, 1998

Other information on web: At http://www.mcs.anl.gov/mpi Pointers to lot of stuff, including other talks and tutorials, a FAQ, other MPI

pages

Features of MPI General

Communications combine context and group for message security

Thread safety Point-to-point communication

Structured buffers and derived datatypes, heterogeneity. Modes: normal(blocking and non-blocking), synchronous,

ready(to allow access to fast protocol), buffered Collective

Both built-in and user-defined collective operations. Large number of data movement routines. Subgroups defined directly or by topology

Features NOT in MPI Process Management Remote memory transfer Threads Virtual shared memory

Why use MPI? MPI provides a powerful, efficient, and

portable way to express parallel programs.

MPI was explicitly designed to enable libraries which may eliminate the need for many users to learn (much of) MPI.

Portable Expressive Good way to learn about subtle issues in

parallel computing

Is MPI large or small? MPI is large(125 functions)

MPI’s extensive functionality requires many functions.

Number of functions not necessarily a measure of complexity.

MPI is small(6 functions) Many parallel programs can be written with just 6

basic functions. MPI is just right

One can access flexibility when it is required. One need not master all parts of MPI to use it. MPI is whatever size you like

Skeleton MPI Program#include <mpi.h>

main( int argc, char** argv ) { MPI_Init( &argc, &argv );

/* main part of the program */ Use MPI function call depend on your data partition and parallization architecture

MPI_Finalize();}

Initializing MPI The first MPI routine called in any

MPI program must be the initialization routine MPI_INIT

MPI_INIT is called once by every process, before any other MPI routines

int mpi_Init( int *argc, char **argv );

A minimal MPI program(c)#include “mpi.h”

#include <stdio.h>

int main(int argc, char *argv[])

{

MPI_Init(&argc, &argv);

printf(“Hello, world!\n”);

MPI_Finalize();

Return 0;

}

Commentary #include “mpi.h” provides basic

MPI definitions and types. MPI_Init starts MPI MPI_Finalize exits MPI Note that all non-MPI routines are

local; thus “printf” run on each process

Notes on C In C:

mpi.h must be included by using #include mpi.h

MPI functions return error codes or MPI_SUCCESS

Error handling By default, an error causes all

processes to abort. The user can cause routines to

return(with an error code) instead. A user can also write and install

custom error handlers. Libraries might want to handle

errors differently from applications.

Running MPI Programs The MPI-1 standard does not specify

how to run an MPI program, just as the Fortran standard does not specify how to run a Fortran program.

In general, starting an MPI program is dependent on the implementation of MPI you are using, and might require various scripts, program arguments, and/or environment variables.

Common way: mpirun –np 2 hello

Finding out about the environment

Two important questions that arise early in a parallel program are:

-How many processes are participating in this computation?

-Which one am I?

MPI provides functions to answer these questions:

-MPI_Comm_size reports the number of processes.-MPI_Comm_rank reports the rank, a number

between 0 and size-1, identifying the calling process.

Better Hello(c)#include “mpi.h”#include <stdio.h>int main(int argc, char *argv[]){

int rank, size; MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);printf(“I am %d of\n”, rank, size);MPI_Finalize();return 0;

}

Some basic concepts Processes can be collected into groups. Each message is sent in a context, and must

be received in the same context. A group and context together form a

communicator. A process is identified by its rank in the group

associated with a communicator. There is a default communicator whose group

contains all initial processes, called MPI_COMM_WORLD.

MPI Datatypes The data in a message to sent or received is described

by a triple(address, count, datatype) An MPI datatype is recursively defined as:

Predefined, corresponding to a datatype from the language(e.g., MPI_INT, MPI_DOUBLE_PRECISION)

A contiguous array of MPI datatypes A strided block of datatypes An indexed array of blocks of datatypes An arbitrary structure of datatypes

There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.

Why datatypes Since all data is labeled by type, an MPI

implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes(heterogeneous communication).

Specifying application-oriented layout of data in memory

Reduces memory-to-memory copies in the implementation

Allows the use of special hardware(scatter/gather) when available.

MPI basic send/receive We need to fill in the details in

Things that need specifying:How will “data” be described?How will processes be identified?How will the receiver recognize/screen messages?What will it mean for these operation to complete?

Process 0

Send(data)

Process 1

Receive(data)

MPI basic (blocking) send MPI_Send(void *start, int count,MPI_Datatype

datatype, int dest, int tag, MPI_Comm comm) The message buffer is described by (start,

count, datatype). The target process is specified by dest, which

is the rank of the target process in the communicator specified by comm.

When this function returns, the data has been delivered to the system and the buffer can be reused. The message may not have been received by the target process.

MPI_Send primitive MPI_Send performs a standard-mode,

blocking send. The send buffer specified by MPI_Send

consists of count successive entries of the type indicated by datatype, starting with the entry at address “buf”.

The count may be zero, in which case the data part of the message is empty.

Basic MPI DatatypeMPI datatype C datatype

MPI_CHAR signed charMPI_SIGNED_CHAR signed charMPI_UNSIGNED_CHAR unsigned charMPI_SHORT signed shortMPI_UNSIGNED_SHORT unsigned shortMPI_INT signed intMPI_UNSIGNED unsigned intMPI_LONG signed longMPI_UNSIGNED_LONG unsigned long

Basic MPI Datatype(continue)MPI_FLOAT floatMPI_DOUBLE doubleMPI_LONG_DOUBLE long double

MPI basic (blocking) receive MPI_RECV(void *start, int count, MPI_Datatype

datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Waits until a matching (on source and tag) message is received from the system, and the buffer can be used.

Source is rank in communicator specified by comm, or MPI_ANY_SOURCE.

Status contains further information Receiving fewer than count occurrences of

datatype is OK, but receiving more is an error.

More comment on send and receive

A receive operation may accept messages from an arbitrary sender, but a send operation must specify a unique receiver.

Source equals destination is allowed, that is, a process can send a message to itself.

MPI_ISEND and MPI_IRECV primitive

MPI_ISEND(buf, count, datatype, dest, tag, comm, request)

Nonblocking communication use request objects to identify communication operations and link the posting operation with the completion operation.

request is a request handle which can be used to query the status of the communication or wait for its completion.

More on nonblocking send and receive

A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender must not access any part of the send buffer after a nonblocking send operation is posted, until the complete-send returns.

A nonblocking receive indicates that the system may start writing data into the receive buffer. The receiver must not access any part of the receive buffer after a nonblocking receive operation is posted, until the complete-receive returns.

MPI_RECV primitive MPI_RECV performs a standard-mode, blocking

receive. The receive buffer consists of storage

sufficient to contain count consecutive entries of the type specified by datatype, starting at address buf.

An overflow error occurs if all incoming data don’t fit into the receive buffer.

The receiver can specify a wildcard value for souce(MPI_ANY_SOURCE) and/or a wildcard value for tag(MPI_ANY_TAG), indicating that any source and/or tag are acceptable.

MPI tags Message are sent with an accompanying

user-defined integer tag, to assist the receiving process in identifying the message.

Message can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.

Some non-MPI message-passing systems have called tags”message types”. MPI calls them tags to avoid confusion with datatype.

Retrieving further information Status is a data structure allocated

in the user’s program. In C:

int recvd_tag, recvd_from, recvd_count; MPI_Status status; MPI_Recv(…, MPI_ANY_SOURCE, MPI_ANY_TAG, …,

&status)recvd_tag = status.MPI_TAG;recvd_from = status.MPI_SOURCE; MPI_Get_count(&status, datatype, &recvd_count);

Tags and contexts Separation of messages used to be

accomplished by use of tags, but This requires libraries to be aware of tags used by other

libraries This can be defeated by use of “wild card” tags

Contexts are different from tags No wild cards allowed Allocated dynamically by the system when a library sets

up a communicator for its own use. User-defined tags still provided in MPI for user

convenience in organizing application Use MPI_Comm_split to create new

communicators

MPI is simple Many parallel programs can be written

using just these six functions, only two of which are non-trivial;

MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV

Point-to-point(send/recv) isn’t the only way…

Alternative set of 6 functions fro simplified MPI MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK MPI_BCAST MPI_REDUCE

Introduction to collective operations in MPI Collective operations are called by all processes in a

communicator MPI_Bcast distributes data from one process(the root) to

all others in a communicator.Syntax: MPI_Bcast(void *message, int count, MPI_Datatype

datatype, int root, MPI_Comm comm)

MPI_Reduce combines data from all processes in communicator or and returns it to one process

Syntax: MPI_Reduce(void *message, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

In many numerical algorithm, send/receive can be replaced by Bcast/Reduce, improving both simplicity and efficiency.

MPI_Datatype valuesMPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG,

MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LONG, MPI_BYTE, MPI_PACKED

MPI_OP ValuesMPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND,

MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC

Example: PI in c –1#include “mpi.h”#include <math.h>int main(int argc, char *argv[]){

int done = 0, n, myid, numprocs, I, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numprocs);MPI_COMM_rank(MPI_COMM_WORLD, &myid);while (!done){

if (myid == 0){

printf(“Enter the number of intervals: (0 quits) “);scanf(“%d”, &n);

}MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);if (n == 0)

break;}

}

Example: PI in c - 2h = 1.0 / (double)n;sum = 0.0;for (i = myid + 1; i <= n; i += numprocs){

x = h * ((double)i – 0.5);sum += 4.0 / (1.0 + x * x);

}mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);if (myid == 0)

printf(“pi is approximately %.16f, Error is %.16f\n”, pi, fabs(pi – PI25DT));}MPI_Finalize();return 0;

}

Compile and run the code mpicc –o pi pi.c Run the code by using:

mpirun –np # of procs –machinefile XXX pi

-machinefile tells MPI to run the program on the machines of XXX.

One more example Simple summation of numbers

using MPI This program adds numbers that

stored in the data file "rand_data.txt" in parallel.

This program is taken from book "Parallel Programming" by Barry Wilkinson and Michael Allen.

Continue….#include "mpi.h" #include <stdio.h> #include <math.h> #define MAXSIZE 100 int main(int argc, char **argv) {

int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult=0, result; char fn[255]; FILE *fp;

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid);

if(0 == myid) { /* open input file and intialize data */ strcpy(fn, getenv("PWD")); strcat(fn, "/rand_data.txt");

Continue…if( NULL == (fp = fopen(fn, "r")) ) {

printf("Can't open the input file: %s\n\n", fn); exit(1);

} for(i=0; i<MAXSIZE; i++) { fscanf(fp, "%d", &data[i]); } } } /* broadcast data */ MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* add portion of data */ x = MAXSIZE/numprocs; /* must be an integer */ low = myid * x; high = low + x; for(i=low; i<high; i++) { myresult += data[i]; }

Continue…printf("I got %d from %d\n", myresult, myid);

/* compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if(0 == myid) { printf("The sum is %d.\n", result); }

MPI_Finalize();

}

Communication models

MPI provides multiple modes for sending messages:

Synchronous mode(MPI_Ssend): the send does not complete until a matching receive has begun. (Unsafe programs become incorrect and usually deadlock within an MPI_Ssend)Buffered mode(MPI_Bsend): the user supplies the buffer to system for its use. (User supplies enough memory to make unsafe program safe)Read mode(MPI_Rsend): user guarantees that matching

Deadlocks in blocking operations Send a large message from process 0 to

process 1 If there is insufficient storage at the destination, the

send must wait for the user to provide the memory space(through a receive)

What happens with Process 0 Process 1Send(1) Send(0)Recv(1) Recv(0)

This is called “unsafe” because it depends on the availability of system buffers.

Some solutions to the “unsafe” problem

Order the operations more carefully

Process 0 Process 1Send(1) Recv(0)Recv(1) Send(0)

Use non-blocking operations:Process 0 Process 1ISend(1) ISend(0)IRecv(1) IRecv(0)

Waitall Waitall

Non-blocking operations Non-blocking operations return

immediately “request Handles” that can be waited on and queried:

MPI_Isend(start, count, datatype, dest, tag, comm, request)

MPI_Irecv(start, count, datatype, dest, tag, comm, request)

MPI_Wait(request, status)

Without waiting: MPI_Test(request, flag, status)

When to use MPI Portability and Performance Irregular data structure Building tools for others Need to manage memory on a per

processor basis

When not to use MPI

Solution (e.g., library) already exists

http://www.mcs.anl.gov/mpi/libraries.html

Require fault toleranceSocket

Program with MPI and play with it

MPICH-1.2.4 for windows 2000 has installed in ECE226.

On every machine, please refer to c:\Program Files\MPICH\www\nt to find the HTML help page on how to run and program in the environment of Visual C++ 6.0

Examples have been installed under c:\Program Files\MPICH\SDK\Examples

How to run the example1. Open the MSDEV workspace file found in MPICH\SDK\

Examples\nt\examples.dsw

2. Build the Debug target of the cpi project3. Copy MPICH\SDK\Examples\nt\Debug\cpi.exe to a

shared directory. (use copy/paste to \\pearl\files\mpi directory)Open a command prompt and change to the directory where you placed cpi.exe

4. Execute mpirun.exe –np 4 cpi5. In order to set path in DOS, in this case, use

command: set PATH=%PATH%;c:\Program Files\MPICH\mpd\bin

Create your own project

1. Open MS Developer Studio - Visual C++

2. Create a new project with whatever name you want in whatever directory you want. The easiest one is a Win32 console application with no files in it.

Create your own project

continue

3. Finish the new project wizard. 4. Go to Project->Settings or hit Alt F7 to

bring up the project settings dialog box.

5. Change the settings to use the multithreaded libraries. Change the settings for both Debug and Release targets.

continue

continue

continue6. Set the include path for all target configurations: This

should be c:\Program Files\MPICH\SDK\include

continue7. Set the lib path for all target configurations: This should

be c:\Program Files\MPICH\SDK\lib

continue8. Add the ws2_32.lib library to all configurations (This is the

Microsoft Winsock2 library. It's in your default library path).Add mpich.lib to the release target and mpichd.lib to the debug target.

continue

continue9. Close the project settings dialog box.

10. Add your source files to the project

Useful MPI function to test your program

MPI_Get_processor_name(name, resultlen) name is a unique specifier for the actual node.

(string) resultlen is length of the result returned in

name(integer)This routine returns the name of the processor

on which it was called at the moment of the call. The number of characters actually written is returned in the output argument resultlen.

An Introduction

Documents

parallel computingis

mpi official release

parallel programs

portable parallel programming

advanced parallel hardware

productfor parallel

mit press

complete reference vol