Top Banner
Other Means of Executing Parallel Programs OpenMP And Paraguin 1 (c) 2011 Clayton S. Ferner
70

Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Other Means of Executing Parallel Programs

OpenMPAnd

Paraguin

1(c) 2011 Clayton S. Ferner

Page 2: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP

• Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer

2(c) 2011 Clayton S. Ferner

Page 3: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

The Paraguin Compiler

• The Paraguin Compiler is a compiler written by me (no group, no funding – just me by myself) at UNCW

• The intent is to create a similar abstraction as OpenMP but for use on a distributed-memory system

3(c) 2011 Clayton S. Ferner

Page 4: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

4(c) 2011 Clayton S. Ferner

Page 5: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP

• MPI is a message-passing interface that provides a means to implement parallel algorithms on distributed-memory systems (such as clusters)

• The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms.*

*The OpenMP® API specification for parallel programming (http://openmp.org/wp/about-openmp/)

5(c) 2011 Clayton S. Ferner

Page 6: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP (cont)

• Parallelization is directed by the programmer through the use of pragmas

• Pragma are used to pass information to the compiler, but are ignored (like comments) if the compiler does not recognize the pragma

• Pragmas can be inserted for a particular compiler without “breaking” the code for other compilers.

6(c) 2011 Clayton S. Ferner

Page 7: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Pragmas

• #pragma omp parallelstructured-block

• The block will be executed in parallel by all threads

7(c) 2011 Clayton S. Ferner

Page 8: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Pragmas

• #pragma omp forfor loop

• The loop will be executed in parallel by all threads

• The iterations are divided into “chunks” which the threads execute (although the programmer can control this).

• There is a barrier at the end of the for loop (i.e. threads will synchronize at the end)

8(c) 2011 Clayton S. Ferner

Page 9: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Pragmas

• #pragma omp parallel forfor loop

• Equivalent to doing#pragma omp parallel#pragma omp for

for loop

9(c) 2011 Clayton S. Ferner

Page 10: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Pragmas

• #pragma omp criticalstructured-block

• Defines a critical section• Only one thread may be executing the block

at any given time

10(c) 2011 Clayton S. Ferner

Page 11: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Pragmas

• #pragma omp barrier

• All threads will wait at the barrier until all other threads have reached the same barrier

11(c) 2011 Clayton S. Ferner

Page 12: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Examples

void simple(int n, float *a, float *b)

{

int i;

#pragma omp parallel for

for (i=0; i<n; i++) /* i is private by default */

b[i] = (a[i] + a[i-1]) / 2.0;

}

12(c) 2011 Clayton S. Ferner

Page 13: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Examples

Thread 0b[0] = … b[1] = … b[2] = … b[3] = …

Thread 1b[4] = … b[5] = … b[6] = … b[7] = …

Thread 2b[8] = … b[9] = …

b[10] = … b[11] = …

Thread 3b[12] = … b[13] = … b[14] = …

Assume n=15 and the number of threads is 4

13(c) 2011 Clayton S. Ferner

Page 14: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Examplesint main(){

int x = 2;

#pragma omp parallel num_threads(2) shared(x)

{

if (omp_get_thread_num() == 0)

x = 5;

else

/* Print 1: the following read of x has a race */

printf("1: Thread# %d: x = %d\n", omp_get_thread_num(),x );

#pragma omp barrier

if (omp_get_thread_num() == 0)

printf("2: Thread# %d: x = %d\n", omp_get_thread_num(),x );

else

printf("3: Thread# %d: x = %d\n", omp_get_thread_num(),x );

}

return 0;

}

14(c) 2011 Clayton S. Ferner

Page 15: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

OpenMP Examples

$ ./test

1: Thread# 3: x = 2

1: Thread# 2: x = 5

1: Thread# 1: x = 5

3: Thread# 2: x = 5

3: Thread# 1: x = 5

2: Thread# 0: x = 5

3: Thread# 3: x = 5

$15(c) 2011 Clayton S. Ferner

Page 16: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

16(c) 2011 Clayton S. Ferner

Page 17: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Paraguin Compiler

• The Paraguin Compiler is a parallelizing compiler that produces parallel code using MPI to run on a distributed-memory system (cluster)

• Based on SUIF Compiler System (suif.stanford.edu)

17(c) 2011 Clayton S. Ferner

Page 18: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Pragma Directives

• Similar to OpenMP, the compiler is directed through the use of pragma statements

• The goal is to create a similar abstraction as OpenMP but on a distributed-memory system

18(c) 2011 Clayton S. Ferner

Page 19: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Parallel Region

• Defining a parallel region– #pragma paraguin begin_parallel– #pragma paraguin end_parallel

• Statements between the begin and end parallel region are executed by all processors

• Statements outside the parallel region are executed by the master thread only (pid 0)

19(c) 2011 Clayton S. Ferner

Page 20: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Hello Worldint __guin_mypid = 0;

int main(int argc, char *argv[]) {

char hostname[256];

printf("Master process %d starting.\n", __guin_mypid);

;

#pragma paraguin begin_parallel

gethostname(hostname, 255);

printf("Hello world from process %3d on machine %s.\n", __guin_mypid, hostname);

;

#pragma paraguin end_parallel

printf("Goodbye world from process %d.\n", __guin_mypid);

}

20(c) 2011 Clayton S. Ferner

Page 21: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Hello World Results

Compiling

$ runparaguin hello.cProcessing file hello.spdParallelizing procedure: "main"

Running

$ mpirun –nolocal -np 8 hello.outHello world from process 3.Hello world from process 1.Hello world from process 7.Hello world from process 5.Hello world from process 4.Hello world from process 2.Hello world from process 6.Master process 0 starting.Hello world from process 0.Goodbye world from process 0.

21(c) 2011 Clayton S. Ferner

Page 22: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Hello World (cont.)

• Notice the semi colons in front of the pragma statements.

• SUIF attaches the pragmas to the most recently seen statement, which may be nested.

• In order to have them attach to a top level statement, we introduce a blank statement (‘;’) to which the pragma can be attached.

22(c) 2011 Clayton S. Ferner

Page 23: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Paraguin Predefined Variables

• Notice the declaration and initialization of the variable __guin_mypid.

• The predefined variables of paraguin may be declared, initialized, and referenced by the user program. They should not be modified beyond initialization.

• This is useful to allow the same program to be compiled using gcc.

23(c) 2011 Clayton S. Ferner

Page 24: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Paraguin Predefined VariablesIdentifier Type Description

__guin_NP int Number of Processors

__guin_blksz int Block size (number of partitions per processor)

__guin_mypid int Current Processor ID

__guin_pidr int Receiving threads processor id

__guin_pidw int Sending threads processor id

__guin_buffer char [] Buffer of data to be transmitted

__guin_position int Number of bytes in the buffer

__guin_status MPI_Status Status of the message

__guin_p int Current Partition Number

__guin_pr int Receiving partition number

__guin_pw int Sending partition number

24(c) 2011 Clayton S. Ferner

Page 25: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Parallel for#pragma paraguin forall C p i j k \ -1 -1 1 -1 0x0 \ 1 1 -1 1 0x0

• The next for loop nest will be partitioned to run on multiple processors. The data that follows the “forall” is a matrix of inequalities to determine which iterations are mapped to partitions

• p stands for the partition number• C stands for constant (or 1).• 0x0 is hex for zero (to prevent SUIF from turning it into a

string)

25(c) 2011 Clayton S. Ferner

Page 26: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Parallel for (cont.)// LU Decomposition;#pragma paraguin forall C p i j k \ -1 -1 1 -1 0x0 \ 1 1 -1 1 0x0

for (i = 0; i <= N; i++) for (j = i + 1; j <= N; j++) { X[j][i] = X[j][i] / X[i][i]; for (k = i + 1; k <= N; k++)

X[j][k]=X[j][k]-X[j][i]*X[i][k]; }

26(c) 2011 Clayton S. Ferner

Page 27: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Parallel for (cont.)

C p i j k \

-1 -1 1 -1 0x0 \

1 1 -1 1 0x0

• This matrix represents the affine expressions of inequalities:

-1 -1 1 -1 0

1 1 -1 -1 0

1

p

i

j

k

X ≤0

0

27(c) 2011 Clayton S. Ferner

Page 28: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Parallel for (cont.)

-1 -1 1 -1 0

1 1 -1 1 0

1

p

i

j

k

X ≤0

0

-1 – p + i – j ≤ 0 1 + p – i + j ≤ 0

p ≥ i - j - 1p ≤ i - j - 1

p = i - j - 1

28(c) 2011 Clayton S. Ferner

Page 29: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Parallel for (cont.)

p=0i = 1 j = 0i = 2 j = 1i = 3 j = 2

.

.

.

p = i - j - 1p = i - j - 1

p=1i = 2 j = 0i = 3 j = 1i = 4 j = 2

.

.

.

p=2i = 3 j = 0i = 4 j = 1i = 5 j = 2

.

.

.

p=3i = 4 j = 0i = 5 j = 1i = 6 j = 2

.

.

.

. . .

29(c) 2011 Clayton S. Ferner

Page 30: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Parallel for (cont.)

p=-5 p=-4 p=-3 p=-2 p=-1p=0

0,4 1,4 2,4 3,4 4,4 p=10,3 1,3 2,3 3,3 4,3 p=20,2 1,2 2,2 3,2 4,2 p=30,1 1,1 1,1 3,1 4,1

j 0,0 1,0 0,2 3,0 4,0i

p = i - j - 1p = i - j - 1

30(c) 2011 Clayton S. Ferner

Page 31: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication Example;

#pragma paraguin forall C p i j k \

0x0 -1 1 0x0 0x0 \

0x0 1 -1 0x0 0x0

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

c[i][j] = 0.0;

for (k = 0; k < N; k++) {

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

}

}

31(c) 2011 Clayton S. Ferner

Page 32: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication Example (cont.)

p=0 p=1 p=2 p=3 p=4

0,4 1,4 2,4 3,4 4,40,3 1,3 2,3 3,3 4,30,2 1,2 2,2 3,2 4,20,1 1,1 1,1 3,1 4,1

j 0,0 1,0 0,2 3,0 4,0i

p = ip = i

#pragma paraguin forall C p i j k \

0x0 -1 1 0x0 0x0 \

0x0 1 -1 0x0 0x0

32(c) 2011 Clayton S. Ferner

Page 33: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Mapping Partitions to Physical Processors

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &__guin_NP);

MPI_Comm_rank(MPI_COMM_WORLD, &__guin_mypid);

__guin_blksz = ceil((ubp - lbp + 1) / __guin_NP);

if (0 <= __guin_mypid & __guin_mypid <= __guin_NP - 1) {

for (__guin_p = __guin_blksz * __guin_mypid;

__guin_p <=

min(N, __guin_blksz * (1 + __guin_mypid) -1);

__guin_p++)

... Where lbp <= p <= ubpWhere lbp <= p <= ubp

33(c) 2011 Clayton S. Ferner

Page 34: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Mapping Partitions to Physical Processors

__guin_pid = 0 __guin_pid = 1 … __guin_pid = NP-1

p = __guin_blksz * __guin_mypid + 0

p = __guin_blksz * __guin_mypid + 0

… p = __guin_blksz * __guin_mypid + 0

p = __guin_blksz * __guin_mypid + 1

p = __guin_blksz * __guin_mypid + 1

… p = __guin_blksz * __guin_mypid + 1

p = __guin_blksz * __guin_mypid + 2

p = __guin_blksz * __guin_mypid + 2

… …

… … … N

p = __guin_blksz * (__guin_mypid + 1) - 1

p = __guin_blksz * (__guin_mypid + 1) - 1

NP

lbub pp 1sz__guin_blk This is a block assignment of partitions to

processors (as opposed to cyclic assignment).

34(c) 2011 Clayton S. Ferner

Page 35: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Mapping Partitions to Physical Processors

__guin_pid = 0 __guin_pid = 1 … __guin_pid = NP-1

p = blksz * 0 + 0 p = blksz * 1 + 0 … p = blksz * (NP – 1) + 0

p = blksz * 0 + 1 p = blksz * 1 + 1 … p = blksz * (NP – 1) + 1

p = blksz * 0 + 2 p = blksz * 1 + 2 … …

… … … N

p = blksz * (0 + 1) – 1 p = blksz * (1 + 1) - 1 …

NP

lbub pp 1blksz

The values of __guin_pid have been substituted in for __guin_pid. __guin_blksz as been replaced with blksz.

35(c) 2011 Clayton S. Ferner

Page 36: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Broadcasting Data

• Scatter is not implemented in Paraguin• One has to use broadcast to get the input to

other processors• This uses the broadcast operation of MPI

which is O(log2(NP)) not O(N).

#pragma paraguin bcast X

MPI_Bcast(X, ..., MPI_COMM_WORLD);

36(c) 2011 Clayton S. Ferner

Page 37: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Loop Carried Dependencies• Consider the code for the elimination step of Gaussian Elimination:for (i = 1; i <= N; i++)

for (j = i+1; j <= N; j++)

for (k = N+1; k >= i; k--)

a[j][k] = a[j][k] - a[i][k] * a[j][i] /

a[i][i];

• There is a data dependence between the lhs of the assignment and the a[i][k] reference on the rhs such that iteration iw, jw, kw writes a value to a[jw][kw] that is used in iteration ir = iw + 1, jr = iw, kr = kw.

• The is also a data dependence between the lhs and a[i][i] on the rhs, but we will only consider one dependence here.

Output DependenceOutput Dependence

37(c) 2011 Clayton S. Ferner

Page 38: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

i=0 j=1 k=3 : a[1][3] = … - a[0][3] * …

i=0 j=1 k=2 : a[1][2] = … - a[0][2] * …

i=0 j=1 k=1 : a[1][1] = … - a[0][1] * …

i=0 j=1 k=0 : a[1][0] = … - a[0][0] * …

i=1 j=2 k=3 : a[2][3] = … - a[1][3] * …

i=1 j=2 k=2 : a[2][2] = … - a[1][2] * …

i=1 j=2 k=1 : a[2][1] = … - a[1][1] * …

i=1 j=2 k=0 : a[2][0] = … - a[1][0] * …

Loop Carried Dependencies (cont)

38(c) 2011 Clayton S. Ferner

Page 39: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Loop Carried Dependencies

• Below is the pragma to specify the data dependence

#pragma paraguin dep 0x0 2 C iw jw kw ir jr kr \

0x0 0x0 1 0x0 -1 0x0 0x0 \

0x0 0x0 -1 0x0 1 0x0 0x0 \

0x0 0x0 0x0 1 0x0 0x0 -1 \

0x0 0x0 0x0 -1 0x0 0x0 1 \

-1 -1 0x0 0x0 1 0x0 0x0 \

1 1 0x0 0x0 -1 0x0 0x0

• Paraguin will insert the code for the processor writing the data to pack it up and send it to the processor that needs

• It also insert the code for the processor that needs that data to receive the message and unpack the data.

39(c) 2011 Clayton S. Ferner

Page 40: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Gather

• Gathering is getting the partial results back from the various processors to the master process.

#pragma paraguin gather <write reference> <X> <A>

40(c) 2011 Clayton S. Ferner

Page 41: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Gather

;#pragma paraguin gather 3 C i j k \ 1 1 -1 0x0 \ -1 -1 1 0x0

for (i = 0; i <= N; i++) { for (j = i + 1; j <= N; j++) {

X[j][i] = X[j][i] / X[i][i];

for (k = i + 1; k <= N; k++) X[j][k] = X[j][k] - X[j][i] * X[i][k]; }}

• Example: LU Decomposition

41(c) 2011 Clayton S. Ferner

Page 42: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Gather

#pragma paraguin gather 3 C i j k \

1 1 -1 0x0 \

-1 -1 1 0x0

• 3 indicates the 4th (starting at 0) array reference: X[j][k] • The system of inequalities indicate which values of the loop

variables produce the final values of that array: j=i+1 for all k.

42(c) 2011 Clayton S. Ferner

Page 43: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Gather

for (__guin_p = 1 + __guin_blksz * __guin_mypid;

__guin_p <= __suif_min(N, __guin_blksz +

__guin_blksz * __guin_mypid); __guin_p++){

i = __guin_p - 1;

j = i + 1;

for (k = 1 * __guin_p; k <= 100; k++)

MPI_Pack(&X[j][k], ..., MPI_COMM_WORLD);

}

MPI_Send(__guin_buffer, ... 0, ..., MPI_COMM_WORLD);

43(c) 2011 Clayton S. Ferner

Page 44: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Some Results

44(c) 2011 Clayton S. Ferner

Page 45: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Gaussian Elimination#pragma paraguin forall C p i j k \

0x0 -1 0x0 1 0x0 \

0x0 1 0x0 -1 0x0

#pragma paraguin dep 0x0 2 C iw jw kw ir jr kr \

0x0 0x0 1 0x0 -1 0x0 0x0 \

0x0 0x0 -1 0x0 1 0x0 0x0 \

0x0 0x0 0x0 1 0x0 0x0 -1 \

0x0 0x0 0x0 -1 0x0 0x0 1 \

-1 -1 0x0 0x0 1 0x0 0x0 \

1 1 0x0 0x0 -1 0x0 0x0

#pragma paraguin dep 0x0 4 C iw jw kw ir jr kr \

0x0 0x0 -1 0x0 1 0x0 0x0 \

0x0 0x0 1 0x0 -1 0x0 0x0 \

0x0 0x0 0x0 -1 1 0x0 0x0 \

0x 0x0 0x0 1 -1 0x0 0x0 \

-1 -1 0x0 0x0 1 0x0 0x0 \

1 1 0x0 0x0 -1 0x045(c) 2011 Clayton S. Ferner

Page 46: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Gaussian Elimination (cont.)#pragma paraguin gather 0x0 C i j k \

-1 -1 1 0x0 \

1 1 -1 0x0

#pragma paraguin gather 0x0 C i j k \

0x0 -1 0x0 1 \

0x0 1 0x0 -1

for (i = 1; i <= N; i++)

for (j = i+1; j <= N; j++)

for (k = N+1; k >= i; k--)

a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i];

46(c) 2011 Clayton S. Ferner

Page 47: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Gaussian Elimination (cont.)

47(c) 2011 Clayton S. Ferner

Page 48: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

LU Decomposition#pragma paraguin forall C p i j k \

-1 -1 1 -1 0x0 \

1 1 -1 1 0x0

#pragma paraguin dep 3 2 C iw jw kw ir jr \

1 1 0x0 0x0 -1 0x0 \

-1 -1 0x0 0x0 1 0x0 \

0x0 0x0 1 0x0 -1 0x0 \

0x0 0x0 -1 0x0 1 0x0 \

0x0 0x0 0x0 1 -1 0x0 \

0x0 0x0 0x0 -1 1 0x0

#pragma paraguin dep 3 6 C iw jw kw ir jr kr \

-1 -1 0x0 0x0 1 0x0 0x0 \

1 1 0x0 0x0 -1 0x0 0x0 \

0x0 0x0 1 0x0 -1 0x0 0x0 \

0x0 0x0 -1 0x0 1 0x0 0x0 \

0x0 0x0 0x0 1 0x0 0x0 -1 \

0x0 0x0 0x0 -1 0x0 0x0 148(c) 2011 Clayton S. Ferner

Page 49: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

LU Decomposition (cont.)#pragma paraguin gather 0 C i j \ 0x0 0x0 0x0

#pragma paraguin gather 3 C i j k \ 1 1 -1 0x0 \ -1 -1 1 0x0

for (i = 0; i <= N; i++) for (j = i + 1; j <= N; j++) { X[j][i] = X[j][i] / X[i][i]; for (k = i + 1; k <= N; k++)

X[j][k]=X[j][k]-X[j][i]*X[i][k]; }

49(c) 2011 Clayton S. Ferner

Page 50: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

LU Decomposition (cont)

50(c) 2011 Clayton S. Ferner

Page 51: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Redundant Data in Messages

• We discovered that the messaged sent between processors for the Gaussian Elimination contained redundant data

• Jerry Martin (MS student 2010) studied detecting and reducing this redundant data

51(c) 2011 Clayton S. Ferner

Page 52: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Redundant Data in Messages (cont)

... <pid0, p2>: pack a[2][2] - Value: 63.000000<pid0, p2>: pack a[2][3] - Value: 28.000000 <pid0, p2>: pack a[2][4] - Value: 91.000000 <pid0, p2>: pack a[2][5] - Value: 60.000000 <pid0, p2>: pack a[2][6] - Value: 64.000000 ... <pid0, p2>: pack a[2][2] - Value: 63.000000 <pid0, p2>: pack a[2][3] - Value: 28.000000 <pid0, p2>: pack a[2][4] - Value: 91.000000 <pid0, p2>: pack a[2][5] - Value: 60.000000 <pid0, p2>: pack a[2][6] - Value: 64.000000 ... <pid0>: send to <pid1>

52(c) 2011 Clayton S. Ferner

Page 53: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Suppressing Redundant Data

With redundant data in messages

Without redundant data in messages

53(c) 2011 Clayton S. Ferner

Page 54: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Suppressing Redundant Data (cont)

With redundant data in messages

Without redundant data in messages

54(c) 2011 Clayton S. Ferner

Page 55: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Communication Pattern of Gaussian Elimination

p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3 p4 p5 p6 p7…55(c) 2011 Clayton S. Ferner

Page 56: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Loop Carried Dependencies and Distributed-Memory Clusters

• Notice that both Gaussian Elimination and LU Decomposition do not do better that sequential execution regardless of the number of processors

• In fact, the performance gets worse as the number of processors increases

• The issue is that we can’t expect to obtain speedup on a distributed-memory cluster when we have communication between processors.

• The communication is just too slow

56(c) 2011 Clayton S. Ferner

Page 57: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Communication Pattern that works on a distributed-memory system

p0

p0 p1 p2 p3 p4 p5 p6 p7

p0

Pattern: Beyond scattering the input and gathering the results, processors work independently.

57(c) 2011 Clayton S. Ferner

Page 58: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication; #pragma paraguin begin_parallel #pragma paraguin forall C p i j k \ 0x0 -1 1 0x0 0x0 \ 0x0 1 -1 0x0 0x0

#pragma paraguin bcast a b

#pragma paraguin gather 1 C i j k \ 0x0 0x0 0x0 1 \ 0x0 0x0 0x0 -1

// We need to gather all c[i][j]. However, array reference // one is inside the k loop. If we put in an empty gather // then we'll have N copies of each c[i][j] send to the // master. To send just one, then we use k = 0.

for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }

; #pragma paraguin end_parallel

58(c) 2011 Clayton S. Ferner

Page 59: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication;

#pragma paraguin begin_parallel

#pragma paraguin forall C p i j k \

0x0 -1 1 0x0 0x0 \

0x0 1 -1 0x0 0x0

#pragma paraguin bcast a b

#pragma paraguin gather 1 C i j k \

0x0 0x0 0x0 1 \

0x0 0x0 0x0 -1

59(c) 2011 Clayton S. Ferner

Page 60: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication // We need to gather all c[i][j]. However, array reference

// one is inside the k loop. If we put in an empty gather

// then we'll have N copies of each c[i][j] send to the

// master. To send just one, then we use k = 0.

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

c[i][j] = 0.0;

for (k = 0; k < N; k++) {

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

}

}

;

#pragma paraguin end_parallel

60(c) 2011 Clayton S. Ferner

Page 61: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication

61(c) 2011 Clayton S. Ferner

Page 62: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Traveling Salesman Problem (TSP)• Traveling Salesman

Problem is to find the Hamiltonian cycle of a set of cities that minimizes the distance traveled.

• Doing a brute force search of the solution space requires us to consider all permutations of N cities.

• This is be N! permutations

• We can fix the first and last city to be city 0 since that will remove cyclic variations of the same solution

• E.g. 0->1->2->3->4->0 is the same as 0->4->3->2->1->0

62(c) 2011 Clayton S. Ferner

Page 63: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Traveling Salesman Problem (TSP)N = n*n - 3*n + 2; // (n-1)(n-2)

;

#pragma paraguin bcast n

#pragma paraguin bcast N

#pragma paraguin bcast D

#pragma paraguin forall C N pid p \

0x0 0x0 -1 1 \

0x0 0x0 1 -1

for (p = 0; p < N; p++) {

perm[1] = p / (n-2) + 1;

perm[2] = p % (n-2) + 1;

if (perm[2] >= perm[1])

perm[2]++;

initialize(perm, n, 3);

do {

dist = computeDist(D, n, perm);

if (minDist < 0 || minDist > dist) {

// … Details omitted.

// Record the minumum distance and

// permutation

}

} while (increment(perm,n));

}

• Creating permutations does not lend itself to easy parallelization

• We can make a loop that iterates (n-1)(n-2) times a base the first two cites on the loop variable

• City 0 is fixed• City 1 = p / (n – 2) + 1• City 2 = p % (n – 2) + 1

63(c) 2011 Clayton S. Ferner

Page 64: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Traveling Salesman Problem (TSP)N = n*n - 3*n + 2; // (n-1)(n-2)

;

#pragma paraguin bcast n

#pragma paraguin bcast N

#pragma paraguin bcast D

#pragma paraguin forall C N pid p \

0x0 0x0 -1 1 \

0x0 0x0 1 -1

for (p = 0; p < N; p++) {

perm[1] = p / (n-2) + 1;

perm[2] = p % (n-2) + 1;

64(c) 2011 Clayton S. Ferner

Page 65: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

TSP

65(c) 2011 Clayton S. Ferner

Page 66: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Hybrid

• Hybrid Parallel programs are ones that make use of distributed-memory systems of clusters as well as the multiple cores within each computer (node) of the cluster.

• We can use MPI to schedule processes to run on multiple nodes and then use OpenMP to schedule threads one the cores within a node.

• The threads of separate cores use a shared-memory model whereas between nodes, MPI uses a distributed-memory model.

66(c) 2011 Clayton S. Ferner

Page 67: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Doing Hybrid in Paraguin

• The Paraguin compiler is a source-to-source compiler. It creates C code with MPI calls from C code. This new code is compiled using the mpicc script, which uses gcc.

• gcc also has openMP support.• The Paraguin compiler will simply pass

through pragmas that it does not recognize creating a hybrid program.

67(c) 2011 Clayton S. Ferner

Page 68: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication (Hybrid)

…#pragma paraguin forall C p i j k \

0x0 -1 1 0x0 0x0 \

0x0 1 -1 0x0 0x0

…#pragma omp parallel for private(i,j,k) schedule(static)

num_threads(NUM_THREADS)

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

c[i][j] = 0.0;

for (k = 0; k < N; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

}

68(c) 2011 Clayton S. Ferner

Page 69: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Matrix Multiplication (Hybrid)

69(c) 2011 Clayton S. Ferner

Page 70: Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Questions?

70(c) 2011 Clayton S. Ferner