Parallel Programming with Message-Passing Interface (MPI) Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne.

Parallel Programming with Message-Passing Interface

(MPI)

Rajkumar BuyyaGrid Computing and Distributed Systems

(GRIDS) Lab. The University of MelbourneMelbourne, Australiawww.gridbus.org

WW Grid

An Introduction

Message-Passing Programming Paradigm

Each processor in a message-passing program runs a sub-program

written in a conventional sequential language all variables are private communicate via special subroutine calls

M

P

M

P

M

P

Memory

Processors

Interconnection Network

MPI Slides are Derived from

Dirk van der Knijff, High Performance Parallel Programming, PPT Slides

MPI Notes, Maui HPC Centre. Melbourne Advanced Research

Computing Center http://www.hpc.unimelb.edu.au

Single Program Multiple Data

Introduced in data parallel programming (HPF)

Same program runs everywhere Restriction on general message-passing

model Some vendors only support SPMD parallel

programs Usual way of writing MPI programs General message-passing model can be

emulated

SPMD examples

main(int argc, char **argv){

if(process is to become Master){

MasterRoutine(/*arguments*/)}

else /* it is worker process */ {

WorkerRoutine(/*arguments*/)}

}

Messages

Messages are packets of data moving between sub-programs

The message passing system has to be told the following information

Sending processor Source location Data type Data length Receiving processor(s) Destination location Destination size

Messages

Access: Each sub-program needs to be connected to a message

passing system Addressing:

Messages need to have addresses to be sent to Reception:

It is important that the receiving process is capable of dealing with the messages it is sent

A message passing system is similar to: Post-office, Phone line, Fax, E-mail, etc

Message Types: Point-to-Point, Collective, Synchronous

(telephone)/Asynchronous (Postal)

Point-to-Point Communication

Simplest form of message passing One process sends a message to another Several variations on how sending a

message can interact with execution of the sub-program

Point-to-Point variations

Synchronous Sends provide information about the completion of the message e.g. fax machines

Asynchronous Sends Only know when the message has left e.g. post cards

Blocking operations only return from the call when operation has completed

Non-blocking operations return straight away - can test/wait later for completion

Collective Communications

Collective communication routines are higher level routines involving several processes at a time

Can be built out of point-to-point communications

Barriers synchronise processes

Broadcast one-to-many communication

Reduction operations combine data from several processes to produce a single

(usually) result

Message Passing Systems

Initially each manufacturer developed their own

Wide range of features, often incompatible

Several groups developed systems for workstations

PVM - (Parallel Virtual Machine) de facto standard before MPI Open Source (NOT public domain!) User Interface to the System (daemons) Support for Dynamic environments

MPI Forum - www.mpi-forum.org

Sixty people from forty different organisations Both users and vendors, from the US and Europe Two-year process of proposals, meetings and

review Produced a document defining a standard

Message Passing Interface (MPI) to provide source-code portability to allow efficient implementation it provides a high level of functionality support for heterogeneous parallel architectures parallel I/O (in MPI 2.0)

MPI 1.0 contains over 115 routines/functions that can be grouped into 8 categories.

General MPI Program Structure

MPI Include File

Initialise MPI Environment

Do work and perform message communication

Terminate MPI Environment

MPI programs

MPI is a library - there are NO language changes

Header Files C: #include <mpi.h>

MPI Function Format C: error = MPI_Xxxx(parameter,...);

MPI_Xxxx(parameter,...);

MPI helloworld.c

#include <mpi.h>main(int argc, char **argv){ int numtasks, rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, & numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("Hello World from process %d of %d\n“,

rank, numtasks);

MPI_Finalize();}

Example - C

#include <mpi.h>

/* include other usual header files*/

main(int argc, char **argv)

{

/* initialize MPI */

MPI_Init(&argc, &argv);

/* main part of program */

/* terminate MPI */

MPI_Finalize();

exit(0);

}

Handles

MPI controls its own internal data structures

MPI releases ‘handles’ to allow programmers to refer to these

C handles are of distinct typedef‘d types and arrays are indexed from 0

Some arguments can be of any type - in C these are declared as void *

Initializing MPI

The first MPI routine called in any MPI program must be MPI_Init.

The C version accepts the arguments to main

int MPI_Init(int *argc, char ***argv);

MPI_Init must be called by every MPI program

Making multiple MPI_Init calls is erroneous MPI_INITIALIZED is an exception to first

rule

MPI_COMM_WORLD

MPI_INIT defines a communicator called MPI_COMM_WORLD for every process that calls it.

All MPI communication calls require a communicator argument

MPI processes can only communicate if they share a communicator.

A communicator contains a group which is a list of processes

Each process has it’s rank within the communicator

A process can have several communicators

Communicators

MPI uses objects called Communicators that defines which collection of processes communicate with each other.

Every process has unique integer identifier assigned by the system when the process initialises. A rand is sometimes called process ID.

Processes can request information from a communicator

MPI_Comm_rank(MPI_comm comm, int *rank) Returns the rank of the process in comm

MPI_Comm_size(MPI_Comm comm, int *size) Returns the size of the group in comm

Finishing up

An MPI program should call MPI_Finalize when all communications have completed.

Once called no other MPI calls can be made

Aborting:MPI_Abort(comm)

Attempts to abort all processes listed in commif comm = MPI_COMM_WORLD the whole program terminates

MPI Programs Compilation and Execution

Let us look into MARC Aplha Cluster

Manjra: GRIDS Lab Linux Cluster

Master Node: manjra.cs.mu.oz.au

Dual Xeon 2GHz 512 MB memory 250 GB integrated storage Gigabit LAN CDROM & Floppy Drives Red Hat Linux release 7.3

(Valhalla) Worker

Nodes(node1..node13) Each of the 13 worker node

consists of the following: Pentium 4 2GHz 512 MB memory 40 GB harddisk Gigabit LAN Red Hat Linux release 7.3

(Valhalla)

Master: manjra.cs.mu.oz.au Internal worker nodes:

node1 node2 .... node13

Manjra Linux cluster

How legion clusters looks

Front View Back View

A legion cluster view from angle!

Compile and Run Commands

Compile: manjra> mpicc helloworld.c -o helloworld

Run: manjra> mpirun -np 3 -machinefile machines.list helloworld

The file machines.list contains nodes list: manjra.cs.mu.oz.au node1 node2 node3 node4 node6

node5 and node7 are not working today!

No of processes

Sample Run and Output

A Run with 3 Processes: manjra> mpirun -np 3 -machinefile machines.list

helloworld Hello World from process 0 of 3 Hello World from process 1 of 3 Hello World from process 2 of 3

A Run by default manjra> helloworld

Hello World from process 0 of 1



helloworld Hello World from process 0 of 6 Hello World from process 3 of 6 Hello World from process 1 of 6 Hello World from process 5 of 6 Hello World from process 4 of 6 Hello World from process 2 of 6

Note: Process execution need not be in process number order.




Note: Change in process output order. For each run, process mapping can be different. They may run on machines with different load. Hence such difference.

Hello World with Error Check

MPI Routines

MPI Routines – C and Fortran

Environment Management Point-to-Point Communication Collective Communication Process Group Management Communicators Derived Type Virtual Topologies Miscellaneous Routines

Environment Management Routines


Collective Communication Routines

Process Group Management Routines

Communicators Routines

Derived Type Routines

Virtual Topologies Routines

Miscellaneous Routines

MPI Messages

A message contains a number of elements of some particular datatype

MPI datatypes Basic Types Derived types

Derived types can be built up from basic types

C types are different from Fortran types

MPI Basic Datatypes - C

MPI datatype C datatypeMPI_CHAR signed charMPI_SHORT signed short intMPI_INT signed intMPI_LONG signed long intMPI_UNSIGNED_CHAR unsigned charMPI_UNSIGNED_SHORT unsigned short intMPI_UNSIGNED unsigned intMPI_UNSIGNED_LONG unsigned long intMPI_FLOAT floatMPI_DOUBLE doubleMPI_LONG_DOUBLE long doubleMPI_BYTEMPI_PACKED


Communication between two processes Source process sends message to destination

process Communication takes place within a

communicator Destination process is identified by its rank in

the communicator MPI provides four communication modes for

sending messages standard, synchronous, buffered, and ready

Only one mode for receiving

Standard Send

Completes once the message has been sent Note: it may or may not have been received

Programs should obey the following rules: It should not assume the send will complete before the

receive begins - can lead to deadlock It should not assume the send will complete after the

receive begins - can lead to non-determinism processes should be eager readers - they should

guarantee to receive all messages sent to them - else network overload

Can be implemented as either a buffered send or synchronous send

Standard Send (cont.)

MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

buf the address of the data to be sentcount the number of elements of datatype buf containsdatatype the MPI datatypedest rank of destination in communicator commtag a marker used to distinguish different message typescomm the communicator shared by sender and receiverierror the fortran return value of the send

Standard Blocking Receive

Note: all sends so far have been blocking (but this only makes a difference for synchronous sends)

Completes when message receivedMPI_Recv(buf, count, datatype, source, tag, comm,

status)

source - rank of source process in communicator commstatus - returns information about message

Synchronous Blocking Message-Passing processes synchronise sender process specifies the synchronous mode blocking - both processes wait until transaction completed

For a communication to succeed

Sender must specify a valid destination rank Receiver must specify a valid source rank The communicator must be the same Tags must match Message types must match Receivers buffer must be large enough Receiver can use wildcards

MPI_ANY_SOURCE MPI_ANY_TAG actual source and tag are returned in status parameter

Standard/Blocked Send/Receive

MPI Send/Receive a Character (cont...)

#include <mpi.h>#include <stdio.h>

int main(int argc, char *argv[]){int numtasks, rank, dest, source, rc, tag=1;char inmsg, outmsg='X';MPI_Status Stat;

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) { dest = 1;rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); printf("Rank0 sent: %c\n", outmsg);source = 1; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); }

MPI Send/Receive a Character

else if (rank == 1) {source = 0; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,

MPI_COMM_WORLD, &Stat); printf("Rank1 received: %c\n", inmsg); dest = 0; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,

MPI_COMM_WORLD); }

MPI_Finalize();}

Synchronous Send

Completes when the message has been received

Effect is to synchronise the sender and receiver

Deadlocks if no receiver

Safer than standard send but may be slower

MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

All parameters as for standard send Fortran equivalent as usual (plus ierror)

Buffered Send

Guarantees to complete immediately Copies message to buffer if necessary To use buffered mode the user must

explicitly attach buffer spaceMPI_Bsend(void *buf, int count, MPI_Datatype

datatype, int dest, int tag, MPI_Comm comm)

MPI_Buffer_attach(void *buf, int size)

Only one buffer can de attached at any one time

Buffers can be detachedMPI_Buffer_detach(void *buf, int size)

Ready Send

Completes immediately Guaranteed to succeed if receive is already

posted Outcome is undefined in no receive posted May improve performance Requires careful attention to messaging patternsMPI_Rsend(buf, count, datatype, dest, tag, comm)

non-blocking receive with tag 0

blocking receive with tag 1

test non-blocking receive

synchronous send with tag 1

ready send with tag 0

process 0 process 1

Communication Envelope Information

Envelope information is returned from MPI_Recv as status

Information includes Source:

status.MPI_SOURCE or status(MPI_SOURCE) Tag:

status.MPI_TAG or status(MPI_TAG) Count:

MPI_Get_count(MPI_Status status, MPI_Datatype datatype, int *count)

Point-to-Point Rules

Message Order Preservation messages do not overtake each other true even for non-synchronous sends i.e. if process a posts two sends and process posts

matching receives then they will complete in the order they were sent

Progress It is not possible for a matching send and receive pair

to remain permanently outstanding. It is possible for a third process to match one of the

pair

Non Blocking Message Passing

Exercise: Ping Pong

1. Write a program in which two processes repeatedly pass a message back and forth.

2. Insert timing calls to measure the time taken for one message.

3. Investigate how the time taken varies with the size of the message.

A simple Ping Pong.c (cont..)

#include <mpi.h>#include <stdio.h>int main(int argc, char *argv[]){int numtasks, rank, dest, source, rc, tag=1;char inmsg, outmsg='X';char pingmsg[10]; char pongmsg[10]; char buff[100];MPI_Status Stat;

strcpy(pingmsg, "ping");strcpy(pongmsg, "pong");

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) { /* Send Ping, Receive Pong */ dest = 1; source = 1; rc = MPI_Send(pingmsg, strlen(pingmsg)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(buff, strlen(pongmsg)+1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); printf("Rank0 Sent: %d & Received: %s\n", pingmsg, buff); }

Why + 1 ?

A simple Ping Pong.c

else if (rank == 1) { /* Receive Ping, Send Pong */ dest = 0; source = 0; rc = MPI_Recv(buff, strlen(pingmsg)+1, MPI_CHAR, source,

tag, MPI_COMM_WORLD, &Stat); printf("Rank1 received: %s & Sending: %s\n", buff,

pongmsg); rc = MPI_Send(pongmsg, strlen(pongmsg)+1, MPI_CHAR,

dest, tag, MPI_COMM_WORLD); }

MPI_Finalize();}

Timers

C: double MPI_Wtime(void); Returns an elapsed wall clock time in seconds (double

precision) on the calling processor.

Time is measured in seconds Time to perform a task is measured by consulting the

time before and after

mpich on legion cluster

Compile with mpicc or mpif90 Don’t need -lmpi

Run withqsub -q pque <jobscript>

where jobscript is#PBS -np=2

mpirun <progname>

mpich After the MPI standard was announced a portable

implementation, mpich, was produced by ANL (Argonne National Lab, Chicago, US). It consists of:

libraries and include files - libmpi, mpi.h compilers - mpicc, mpif90.

These know about things like where relevant include and library files are

mpicc helloworld.c –o helloworld runtime loader - mpirun

Has arguments -np <number of nodes>, and -machinefile <file of nodenames>

implements SPMD paradigm by starting a copy of program on each node. The program must therefore do do any differentitation itself (using MPI_Comm_size() and MPI_Comm_rank() functions).

mpicc –np 3 –machinefile machines.list helloworld

NOTE: our version gets # CPUs and their addresses from PBS (ie, don't use -np and/or -machinefile)

PBS

PBS is a batch system - jobs get submitted to a queue

The job is a shell script to execute your program The shell script can contain job management

instructions (note that these instructions can also be in the command line)

PBS will allocate your job to some other computer, log in as you, and execute your script, ie your script must contain cd's or aboslute references to access files (or globus objects)

Useful PBS commands: qsub - submits a job qstat - monitors status qdel - deletes a job from a queue

PBS directives

Some PBS directives to insert at the start of your shell script:

#PBS -q <queuename> #PBS -e <filename> (stderr location) #PBS -o <filename> (stdout location) #PBS -eo (combines stderr and stdout) #PBS -t <seconds> (maximum time) #PBS -l <attribute>=<value> (eg -l nodes=2)

MPI Programs Compilation and Execution

Let us look into MARC Aplha Cluster

Melbourne Advanced Research Computing (MARC) Alpha Cluster

Exclusive nodes (cnet1..cnet16)(Parallel jobs only)

Compaq Personal Workstation 600au

600 MHz 21164AXP cpu - 96 KByte internal cache

2 MByte external cache 192 MByte memory 4.3 GByte Ultrawide SCSI

disc 100 Mbps Ethernet

legion.hpc.unimelb.edu.au cnet1.hpc.unimelb.edu.au cnet2.hpc.unimelb.edu.au .... cnet16.hpc.unimelb.edu.a

u

legion Alpha cluster

How legion clusters looks

Front View Back View

A legion cluster view from angle!

Compile and Run Commands

Compile: legion> mpicc helloworld.c -o helloworld

Run: legion> mpirun -np 3 -machinefile machines.list

helloworld The file machines.list contains nodes list:

legion.hpc.unimelb.edu.au cnet1.hpc.unimelb.edu.au cnet2.hpc.unimelb.edu.au cnet3.hpc.unimelb.edu.au cnet4.hpc.unimelb.edu.au cnet5.hpc.unimelb.edu.au

No of processes


A Run with 3 Processes: legion> mpirun -np 3 -machinefile machines.list

helloworld Hello World from process 0 of 3 Hello World from process 1 of 3 Hello World from process 2 of 3

A Run by default legion> helloworld

Hello World from process 0 of 1




Note: Process execution need not be in process number order.




Note: Change in process output order. For each run, process mapping can be different. They may run on machines with different load. Hence such difference.

Parallel Programming with Message-Passing Interface (MPI) Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne.

Documents

message passing systems

message types

sub program slide

completion slide

emulated slide

introduction slide

messages messages

mpi slides