Parallel Programming on the SGI Origin2000

Parallel Programming on theSGI Origin2000

With thanks to Igor Zacharov / Benoit Marchand, SGI

Taub Computer CenterTechnion

Moshe Goldberg, [email protected]

Mar 2004 (v1.2)

Parallel Programming on the SGI Origin2000

1) Parallelization Concepts

2) SGI Computer Design

3) Efficient Scalar Design

4) Parallel Programming -OpenMP

5) Parallel Programming- MPI

2 )Parallel Programming-MPI

Parallel classification

• Parallel architectures

Shared Memory /

Distributed Memory• Programming paradigms

Data parallel /

Message passing

Shared Memory

• Each processor can access any part of the memory

• Access times are uniform (in principle)

• Easier to program (no explicit message passing)

• Bottleneck when several tasks access same location

Distributed Memory

• Processor can only access local memory

• Access times depend on location

• Processors must communicate via explicit message passing

Distributed Memory

Interconnection network

Message Passing Programming

• Separate program on each processor

• Local Memory

• Control over distribution and transfer of data

• Additional complexity of debugging due to communications

Performance issues

• Concurrency – ability to perform actions simultaneously

• Scalability – performance is not impaired by increasing number of processors

• Locality – high ration of local memory accesses/remote memory accesses (or low communication)

SP2 Benchmark

• Goal : Checking performance of real world applications on the SP2

• Execution time (seconds):CPU time for applications

• Speedup Execution time for 1 processor = ------------------------------------ Execution time for p processors

WHAT is MPI?

• A message- passing library specification

• Extended message-passing model

• Not specific to implementation or computer

BASICS of MPI PROGRAMMING

• MPI is a message-passing library

• Assumes : a distributed memory architecture

• Includes : routines for performing communication (exchange of data and synchronization) among the processors.

Message Passing

• Data transfer + synchronization

• Synchronization : the act of bringing one or more processes to known points in their execution

• Distributed memory: memory split up into segments, each may be accessed by only one process.

MPI STANDARD

• Standard by consensus, designed in an open forum

• Introduced by the MPI FORUM in May 1994, updated in June 1995.

• MPI-2 (1998) produces extensions to the MPI standard

Why use MPI ?

• Standardization

• Portability

• Performance

• Richness

• Designed to enable libraries

Writing an MPI Program

• If there is a serial version , make sure it is debugged

• If not, try to write a serial version first

• When debugging in parallel , start with a few nodes first.

Format of MPI routines

CMPI_xxx(parameters)

include mpi.h

FOR

TRAN

call MPIxxx(parame

ters, ierror)

include mpif.h

Six useful MPI functions

MPI_INITInitialized for the MPI environment

MPI_COMM_SIZEReturns the number of processes

MPI_COMM_RANKReturns this process’s number (rank)

Communication routines

MPI_SENDSends a message

MPI_RECV Receives a message

End MPI part of program

MPI_FINALIZE Exit in an orderly way

program hello include ’mpif.h’ status(MPI_STATUS_SIZE) character*12 message call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size,ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank,ierror) tag = 100 if(rank .eq. 0) then message = 'Hello, world' do i=1, size-1 call MPI_SEND(message, 12, MPI_CHARACTER , i, & tag,MPI_COMM_WORLD,ierror)

enddo else

call MPI_RECV(message, 12, MPI_CHARACTER, 0,tag,MPI_COMM_WORLD, status, ierror)

endif print*, 'node', rank, ':', message call MPI_FINALIZE(ierror) end

int main( int argc, char *argv[]){

int tag=100;

int rank,size,i;

MPI_Status * status char message[12];

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD,&size);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

strcpy(message,"Hello,world");

if (rank==0)

for (i=1;i<size;i++){ MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD);

}

}else MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_COMM_WORLD,&status);

printf("node %d : %s \n",rank,message);

MPI_Finalize;)(

return 0;

}

MPI Messages

• DATA data to be sent

• ENVELOPE – information to route the data.

Description of MPI_Send (MPI_Recv)

Startbuf The address where data start

Count Number of elements in message

DatatypeType of elements

Destination /source

Rank in communicator (0 .. Size-1)

Description of MPI_Send (MPI_Recv)

TagArbitrary number to help distinguish between messages

Communicator Communications universe

Status only for receive !!!!

Contains 3 fields : sender, tag and error code

Some useful remarks

• Source= MPI_ANY_SOURCE means that any source is acceptable

• Tags specified by sender and receiver must match, or MPI_ANY_TAG : any tag is acceptable

• Communicator must be the same for send/receive. Usually : MPI_COMM_WORLD

POINT-TO-POINT COMMUNICATION

• Transmission of a message between one pair of processes

• Programmer can choose mode of transmission

• Programmer can choose mode of transmission

MODE of TRANSMISSION

• Can be chosen by programmer

• …or let the system decide

• Synchronous mode• Ready mode• Buffered mode

• Standard mode

BLOCKING /NON-BLOCKING COMMUNICATIONS

BlockingSend or receive suspends execution till message buffer is safe to use

Non -blocking

Separates computation from communication. Send is initiated, but not completed. We can use a separate call to verify that communication has been completed.

SR

MPI_SEND

MPI_RECV

BLOCKING STANDARD SEND

Size>threshold Task waits

Date transfer fromsource complete

Task continues when data transfer to buffer is complete

waitTransfer begins

when MPI_RECV has been posted

SR

MPI_ISEND

MPI_IRECV

NON BLOCKING STANDARD SEND

Size>threshold Task waits


waitTransfer begins

when MPI_IRECV has been posted

MPI_WAIT

MPI_WAIT

No interruption if wait is late enough

SR

MPI_SEND

MPI_RECV

BLOCKING STANDARD SEND

Size<=thresholdData transfer

fromsource complete

Task continues when data transfer to user’sbuffer is complete

Transfer to buffer

on receiver

SR

MPI_ISEND

MPI_IRECV

NON BLOCKING STANDARD SEND

Size<=thresholdNo delay even though message is not yet in buffer on R


Transfer to buffer can be avoided if MPI_IRECV postedearly enough

MPI_WAIT

MPI_WAIT

No delay if wait is late enough

BLOCKING COMMUNICATION

print *, “Task “,irank, “ has sent the message”

call MPI_Send(rmessage1,MSGLEN,MPI_REAL,

& idest,isend_tag,MPI_COMM_WORLD,ierr)

call MPI_Recv(rmessage2,MSGLEN,MPI_REAL,

& isrc,irecv_tag,MPI_COMM_WORLD, &status,ierr)

NON-BLOCKINGcall MPI_ISend(rmessage1,MSGLEN,MPI_REAL,

& idest,isend_tag,MPI_COMM_WORLD,

&request_send, ierr)

call MPI_IRecv(rmessage2,MSGLEN,MPI_REAL,

& isrc,irecv_tag,MPI_COMM_WORLD,

&request_rec,ierr)

call MPI_WAIT(request_rec,istatus,ierr)

program deadlock

implicit none include 'mpif.h' integer MSGLEN, ITAG_A, ITAG_B parameter ( MSGLEN = 2048, ITAG_A = 100, ITAG_B = 200 ) real rmessage1(MSGLEN), ! message buffers . rmessage2(MSGLEN) integer irank, ! rank of task in communicator. idest, isrc, ! rank in communicator of destination ! and source tasks . isend_tag, irecv_tag, ! message tags . istatus(MPI_STATUS_SIZE), ! status of communication . ierr, ! return status . i

call MPI_Init ( ierr ) call MPI_Comm_Rank ( MPI_COMM_WORLD, irank, ierr ) print *, " Task ", irank, " initialized"C initialize message buffers do i = 1, MSGLEN rmessage1(i) = 100 rmessage2(i) = -100 end do

C

Deadlock program (cont)if ( irank.EQ.0 ) then idest = 1

isrc = 1 isend_tag = ITAG_A irecv_tag = ITAG_B else if ( irank.EQ.1 ) then idest = 0 isrc = 0 isend_tag = ITAG_B irecv_tag = ITAG_A end ifC ----------------------------------------------------------------C send and receive messagesC ------------------------------------------------------------- print *, " Task ", irank, " has sent the message" call MPI_Send ( rmessage1, MSGLEN, MPI_REAL, idest, isend_tag, . MPI_COMM_WORLD, ierr ) call MPI_Recv ( rmessage2, MSGLEN, MPI_REAL, isrc, irecv_tag, . MPI_COMM_WORLD, istatus, ierr ) print *, " Task ", irank, " has received the message"

call MPI_Finalize (ierr)end

DEADLOCK example

A

B

MPI_SEND

MPI_SEND

MPI_RECV

MPI_RECV

Deadlock example

• SP2 implementation:No Receive has been posted yet,so both processes block

• Solutions

Different ordering

Non-blocking calls

MPI_Sendrecv

Determining Information about

Messages

• Wait

• Test

• Probe

MPI_WAIT

• Useful for both sender and receiver of non-blocking communications

• Receiving process blocks until message is received, under programmer control

• Sending process blocks until send operation completes, at which time the message buffer is available for re-use

MPI_WAIT

compute

transmit

S

R

MPI_WAIT

MPI_TEST

compute

transmit

S

R

MPI_Isend

MPI_TEST

MPI_TEST

• Used for both sender and receiver of non-blocking communication

• Non-blocking call• Receiver checks to see if a specific sender has sent a message that is waiting to be delivered ... messages from all other senders are ignored

MPI_TEST (cont.)

Sender can find out if the message-buffer can be re-used ... have to wait until operation is complete before doing so

MPI_PROBE

• Receiver is notified when messages from potentially any sender arrive and are ready to be processed.

• Blocking call

Programming recommendations

• Blocking calls are needed when:

• Tasks must synchronize• MPI_Wait immediately follows communication call

Collective Communication

• Establish a communication pattern within a group of nodes.

• All processes in the group call the communication routine, with matching arguments.

• Collective routine calls can return when their participation in the collective communication is complete.

Properties of collective calls

• On completion: he caller is now free to access locations in the communication buffer.

• Does NOT indicate that other processors in the group have completed

• Only MPI_BARRIER will synchronize all processes

Properties

• MPI guarantees that a message generated by collective communication calls will not be confused with a message generated by point-to-point communication

• Communicator is the group identifier.

Barrier

• Synchronization primitive. A node calling it will block until all the nodes within the group have called it.

• Syntax

MPI_Barrier(Comm, Ierr)

Broadcast

• Send data on one node to all other nodes in communicator.

• MPI_Bcast(buffer, count, datatype,root,comm,ierr)

Broadcast DATA

A0

A0

A0

A0

A0P0

P1

P2

P3

Gather and ScatterDATA

A0

A3

A2

A1

A0P0

P1

P2

P3

A1 A2 A3 scatter

gather

Allgather effect

C0

DATA

A0

A0

A0

A0

A0P0

P1

P2

P3allgather

B0

C0

D0

D0

B0 D0

D0

D0

B0

B0

B0

C0

C0

C0

Syntax for Scatter & Gather

MPI_Gather(sendbuf,scount,,datatype,recvbuf,rcount,rdatatype,root,comm,ierr)

MPI_Scatter(sndbuf,scount,datatype,recvbuf,rcount, datatype,root,comm,ierr)

Scatter and Gather

• Gather: Collect data from every member of the group (including the root) on the root node in linear order by the rank of the node.

• Scatter: Distribute data from the root to every member of the group in linear order by node.

ALLGATHER

• All processes, not just the root, receive the result. The jth block of the receive buffer is the block of data sent from the jth process

• Syntax :

MPI_Allgather( sndbuf,scount,datatype,recvbuf,rcount,rdatatype,comm,ierr)

Gather example

DIMENSION A(25,100),b(100),cpart(25),ctotal(100) INTEGER root DATA root/0/

DO I=1,25 cpart(I) = 0. DO K=1,100 cpart(I) = cpart(I) + A(I,K)*b(K) END DO END DO call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL,

root, MPI_COMM_WORLD, ierr)

AllGather example

DIMENSION A(25,100),b(100),cpart(25),ctotal(100) INTEGER root

DO I=1,25 cpart(I) = 0. DO K=1,100 cpart(I) = cpart(I) + A(I,K)*b(K) END DO END DO call

MPI_AllGATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL, MPI_COMM_WORLD, ierr)

Parallel matrix-vector multiplication

=P125

P2P3

P4

A * b = c

25

25

25

Global Computations

• Reduction

• Scan

Reduction

• The partial result in each process in the group is combined in one specified process

Reduction

DjJth item of data at the root process

*Reduction operation (sum, max,min ….)

Dj = D(0,j)*D(1,j)* ... *

D(n-1,j)

Scan operation

•Scan or prefix-reduction operation performs partial reductions on distributed data

• Dkjkj = D0j*D1j* ... *Dkj k=0,1,n-1

Varying size gather and scatter

• Both size and memory location of the messages are varying

• More flexibility in writing code • less need to copy data into temporary buffers

• more compact final code • Vendor implementation may be optimal

Scatterv syntax

Scatterv(sbuf,scount,stype,rbuf,rcount,displs,rtype,root,comm,ierr)

SCOUNTS(I) number of items to send from process root to process I

DISPLS(I) displacement from sbuf to beginning of ith message

SCATTER

P0

P0

P1

P2

P3

SCATTERV

P0

P0

P1

P2

P3

Advanced Datatypes

• Predefined basic datatypes -- contiguous data of the same type.

• We sometimes need:

non-contiguous data of single type

contiguous data of mixed types

Solutions

• multiple MPI calls to send and receive each data element

• copy the data to a buffer before sending it (MPI_PACK)

• use MPI_BYTE to get around the datatype-matching rules

Drawback

• Slow , clumsy and wasteful of memory

• Using MPI_BYTE or MPI_PACKED can hamper portability

General Datatypes and Typemaps

• a sequence of basic datatypes

• a sequence of integer (byte) displacements

Typemaps

typemap= [(type0,disp0),(type1,disp1),….,

(typen,disp n)]

Displacement are relative to the buffer

Example :

Typemap (MPI_INT)= [(int,0)]

Extent of a Derived Datatype

Lb Min(disp0,disp1,…,dispn)

Ub Max(disp0+sizeof(type0),….

ExtentUb – Lb +pad

MPI_TYPE_EXTENT

• MPI_TYPE_EXTENT(datatype,extent,ierr)

Describes distance (in bytes) from start of datatype to start of the next datatype .

How and When Do I Use Derived Datatypes?

• MPI derived datatypes are created at run-time through calls to MPI library routines.

How and When Do I Use Derived Datatypes?}

How to use

• Construct the datatype• Allocate the datatype.• Use the datatype• Deallocate the datatype

integer oldtype,newtype,count,blocklength,stride

integer ierr,n

real buffer(n,n)

call MPI_TYPE_VECTOR(count,blocklength,stride,oldtype,newtype,ierr)

call MPI_TYPE_COMMIT(newtype,ierr)

call MPI_SEND(buffer,1,newtype,dest,tag,comm,err)

*** use it in communication operation *********

call MPI_TYPE_FREE(newtype,ierr)

**** deallocate it ************

EXAMPLE

Example on MPI_TYPE_VECTOR

oldtype

newtype

BLOCK BLOCK

COUNT 2

BLOCKLENGTH3

STRIDE5

Summary

• Derived datatypes are datatypes that are built from the basic MPI datatypes

• Derived datatypes provide a portable and elegant way of communicating non-contiguous or mixed types in a message.

• Efficiency may depend on the implementation(see how it compares to MPI_BYTE)

Several datatypes MPI_TYPE_ CONTIGUOUS

replicating the existing datatype

MPI_TYPE_ VECTOR

Same , allowing gaps in the displacement

MPI_TYPE_ HVECTOR

Same as former, but displacement in bytes

MPI_TYPE_ INDEXED

replicates the datatype into a sequence

Several datatypes

MPI_TYPE_HINDEXED

replicates the datatype into a sequence of different blocks

MPI_TYPE_STRUCT

Mix of different datatypes

GROUP

c this is a program for testing MPI_Group c program GROUP include 'mpif.h'

implicit noneINTEGER WCOMM, WGROUP, GROUP1, SUBCOMM, RANK, SIZE, IERR, IINTEGER SBUF(100), RBUF(100),count, count2, sbuf2(100),

rbuf2(100)integer ranks(100)CALL MPI_INIT(IERR)CALL MPI_COMM_RANK(MPI_COMM_WORLD, RANK, IERR) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, SIZE, IERR)

c call MPI_BARRIER(MPI_COMM_WORLD, IERR)c print*, 'rank =', rank, 'size =', size

RANKS(1) = 0WCOMM = MPI_COMM_WORLD

c WGROUP = MPI_COMM_GROUPCALL MPI_COMM_GROUP(WCOMM, WGROUP, IERR)

)

Group (cont.)

c call MPI_BARRIER(MPI_COMM_WORLD, IERR)CALL MPI_GROUP_EXCL(WGROUP, 1, RANKS, GROUP1, IERR) CALL MPI_COMM_CREATE(WCOMM, GROUP1, SUBCOMM, IERR)

c call MPI_BARRIER(MPI_COMM_WORLD, IERR)c print*, 'group1 =', rank, group1c print*, 'subcomm =', rank, subcommc print*, 'after creation of group1 & subcomm' IF(RANK .NE. 0) THEN COUNT = size do i=1, COUNT SBUF(i) = rank enddo CALL MPI_REDUCE(SBUF,RBUF,COUNT,MPI_INTEGER, * MPI_SUM,0,SUBCOMM,IERR)c print*, 'sum of group1 at rank', rank,(rbuf(i), i=1, count)

ENDIF

Group (cont.)

c if(rank .eq. 1) then print*, 'sum of group1', (rbuf(i), i=1, count)c print*, 'sum of group1', (sbuf(i), i=1, count)

endif count2 = size

do i=1, count2 sbuf2(i) = rank * rank

enddoCALL MPI_REDUCE(SBUF2,RBUF2,COUNT2,MPI_INTEGER,

* MPI_SUM,0,WCOMM,IERR)if(rank .eq. 0) then

print*, 'sum of wgroup', (rbuf2(i), i=1, count2) else CALL MPI_COMM_FREE(SUBCOMM, IERR) endif

CALL MPI_GROUP_FREE(GROUP1, IERR)CALL MPI_FINALIZE(IERR)

stopend

PERFORMANCE ISSUES

• Hidden communication takes place

• Performance depends on implementation of MPI

• Because of forced synchronization, it is not always best to use collective communication

Example : simple broadcast

1

2

3

8

B

BB

Data:B*(P-1)Steps : P-1

Example : simple scatter

1

2

3

8

B

BB

Data:B*(P-1)Steps : P-1

Example : better scatter

1

1 24*B

Data:B*p*logPSteps : log P

1 3 2 4

1 5 3 6 2 7 4 8

2*B 2*B

B BBB

Timing for sending a message

Time is composed of startup time – time to send a 0 length message and transfer time – time to transfer a byte of data.

Tcomm = Tstartup + B * Ttransfer

It may be worthwhile to group several sends together

Performance evaluation

Fortran :

Real*8 t1

T1= MPI_Wtime() ! Returns elapsed time

C:

double t1 ;

t1 =MPI_Wtime ();

MPI References

• The MPI Standard :

www-unix.mcs.anl.gov/mpi/index.html

• Parallel Programming with MPI,Peter S. Pacheco,Morgan Kaufmann,1997

• Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum, The MIT Press,1999.

Example : better broadcast

1

1 2B B

Data:B*(P-1)Steps : log P

1 3 2 7

1 5 3 6 2 7 4 8

Parallel Programming on the SGI Origin2000

Documents

argv mpi

mpi standardwhy use

mpi programif

mpi forum

ierror enddo elsecall

execution distributed

ierror tag

int tag