CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof.

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

APPLIED PARALLEL ALGORITHMS 1

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 10th, 2011

http://www.csc.lsu.edu/


Dr. Hartmut Kaiser

Center for Computation & Technology

R315 Johnston

[email protected]

2



Puzzle of the Day

• What’s the difference between the following valid C function declarations:

void foo();void foo(void);void foo(…);



Puzzle of the Day


• What’s the difference between the following valid C++ function declarations:

void foo();void foo(void);void foo(…);

void foo(); any number of parametersvoid foo(void); no parametervoid foo(…); any number of parameters



Puzzle of the Day


void foo(); any number of parametersvoid foo(void); no parametersvoid foo(…); any number of parameters

• What’s the difference between the following valid C++ function declarations:

void foo(); no parametersvoid foo(void); no parametersvoid foo(…); any number of parameters



6

Topics

• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication



7

Topics




8

Parallel Programming

• Goals– Correctness– Reduction in execution time– Efficiency– Scalability– Increased problem size and richness of models

• Objectives– Expose parallelism

• Algorithm design

– Distribute work uniformly• Data decomposition and allocation• Dynamic load balancing

– Minimize overhead of synchronization and communication• Coarse granularity• Big messages

– Minimize redundant work• Still sometimes better than communication



9

Basic Parallel (MPI) Program Steps

• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results

– Optional for non-embarrassingly parallel (cooperative)

• Detect “stop” condition– Maybe implicit with a barrier etc.

• Aggregate final results– Often a reduction operator

• Output results and error code• Terminate and return to OS



10

“embarrassingly parallel”

• Common phrase– poorly defined, – widely used

• Suggests lots and lots of parallelism – with essentially no inter task communication or coordination– Highly partitionable workload with minimal overhead

• “almost embarrassingly parallel”– Same as above, but– Requires master to launch many tasks– Requires master to collect final results of tasks– Sometimes still referred to as “embarrassingly parallel”



11

Topics




Mandelbrot set

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.

Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

12



Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson

& M. Allen,


Mandelbrot Set

Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function

where zk+1 is the (k + 1)th iteration of the complex number z = (a + bi) and c is a complex number giving position of point in the complex plane. The initial value for z is zero.

Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by

13



Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,


Sequential routine computing value of one point returning number of iterations

structure complex {float real;float imag;

};int cal_pixel(complex c){

int count, max;complex z;float temp, lengthsq;max = 256;z.real = 0; z.imag = 0;count = 0; /* number of iterations */do {

temp = z.real * z.real - z.imag * z.imag + c.real;z.imag = 2 * z.real * z.imag + c.imag;z.real = temp;lengthsq = z.real * z.real + z.imag * z.imag;count++;

} while ((lengthsq < 4.0) && (count < max));return count;

}

14



Parallelizing Mandelbrot Set Computation

Static Task Assignment

Simply divide the region into fixed number of parts, each computed by a separate processor.

Not very successful because different regions require different numbers of iterations and time.

Dynamic Task Assignment

Have processor request regions after computing previousregions



15





Dynamic Task AssignmentWork Pool/Processor Farms

16



17

Flowchart for Mandelbrot Set Generation

“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment


Initialize MPI EnvironmentInitialize MPI Environment … Initialize MPI

EnvironmentInitialize MPI Environment

Create Local Workload buffer


…







Isolate work regions








Calculate Mandelbrot set

values across work region



… …













Write result from task 0 to file

Write result from task 0 to file

Recv. results from “workers”

Recv. results from “workers”

Send result to “master”





Send result to “master”…

Concatenate results to fileConcatenate results to file

EndEnd



18

Mandelbrot Sets (source code)#include<stdio.h>#include<assert.h>#include<stdlib.h>#include<mpi.h>typedef struct complex{ double real; double imag;} Complex;int cal_pixel(Complex c){ int count, max_iter; Complex z; double temp, lengthsq; max_iter = 256; z.real = 0; z.imag = 0; count = 0; do{ temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count ++; } while ((lengthsq < 4.0) && (count < max_iter)); return(count);} Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

cal_pixel () runs on every worker process calculates the :

for every pixel

cal_pixel () runs on every worker process calculates the :

for every pixel



19

Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){ FILE *file; int i, j; int tmp; Complex c; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status;

/***** Initializing MPI Environment*****/

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &mype);

/***** Pass in the dimension (X,Y) of the area to cover *****/

if (argc != 3){ int err = 0; printf("argc %d\n", argc); if (mype == MASTERPE){ printf("usage: mandelbrot nx ny"); MPI_Abort(MPI_COMM_WORLD,err ); } } /* get command line args */ nx = atoi(argv[1]); ny = atoi(argv[2]);

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/


Check if the input arguments : x,y dimensions of the region to be processed are passed

Check if the input arguments : x,y dimensions of the region to be processed are passed



20

Mandelbrot Sets (source code)

/* assume divides equally */ nrows_l = nx/nprocs; mystrt = mype*nrows_l; myend = mystrt + nrows_l - 1;

/* create buffer for local work only */ data_l = (double *) malloc(nrows_l * ny * sizeof(double)); data_l_tmp = data_l;

/* calc each procs coordinates and call local mandelbrot value generation function */ for (i = mystrt; i <= myend; ++i){ c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){ c.imag = j/((double) ny) * 4. - 2. ; tmp = cal_pixel(c); *data_l++ = (double) tmp; } } data_l = data_l_tmp;


Determining the dimensions of the work to be performed by each concurrent task.

Determining the dimensions of the work to be performed by each concurrent task.

Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated

Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated



21

Mandelbrot Sets (source code) if (mype == MASTERPE){ file = fopen("mandelbrot.bin_0000", "w"); printf("nrows_l, ny %d %d\n", nrows_l, ny); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); for (i = 1; i < nprocs; ++i){ MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status); printf("received message from proc %d\n", i); file = fopen("mandelbrot.bin_0000", "a"); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); } }else{ MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD); }

MPI_Finalize();}


Master process opens a file to store output into and stores its values in the file

Master then waits to receive values computed by each of the worker processes

Master process opens a file to store output into and stores its values in the file

Master then waits to receive values computed by each of the worker processes

Worker processes send computed mandelbrot values of their region to the master processWorker processes send computed mandelbrot values of their region to the master process



22

Demo : Mandelbrot Sets



Demo: Mandelbrot Sets

23



24

Topics




25



Monte Carlo Simulation

• Used when it is infeasible or impossible to compute an exact result with a deterministic algorithm

• Especially useful in – Studying systems with a large number of coupled degrees

of freedom• Fluids, disordered materials, strongly coupled solids, cellular

structures

– For modeling phenomena with significant uncertainty in inputs

• The calculation of risk in business

– These methods are also widely used in mathematics • The evaluation of definite integrals, particularly multidimensional

integrals with complicated boundary conditions

26



Monte Carlo Simulation

• No single approach, multitude of different methods

• Usually follows pattern– Define a domain of possible inputs – Generate inputs randomly from the domain – Perform a deterministic computation using the

inputs – Aggregate the results of the individual

computations into the final result

• Example: calculate Pi

27



28

Monte Carlo: Algorithm for Pi• The value of PI can be calculated in a number of

ways. Consider the following method of approximating PI: Inscribe a circle in a square

• Randomly generate points in the square • Determine the number of points in the square that

are also in the circle • Let r be the number of points in the circle divided

by the number of points in the square • PI ~ 4 r • Note that the more points generated, the better

the approximation • Algorithm :

npoints = 10000

circle_count = 0

do j = 1,npoints

generate 2 random numbers between 0 and 1

xcoordinate = random1 ; ycoordinate = random2

if (xcoordinate, ycoordinate) inside circle

then circle_count = circle_count + 1

end do

PI = 4.0*circle_count/npoints



29



30

OpenMP Pi Calculation

Initialize variables

Initialize OpenMP parallel environment

Calculate PI

Print value of pi

N WorkerThreadsMaster Thread

Generate random X,Y Generate random X,YGenerate random X,Y Generate random X,YGenerate random X,Y

Calculate Z=X^2+Y^2 Calculate Z =X^2+Y^2Calculate Z =X^2+Y^2

If point lies within the

circle

Calculate Z =X^2+Y^2Calculate Z =X^2+Y^2


circle


circle

Count ++ Count ++Count ++

Reduction ∑Reduction ∑

Y

N N N

Y Y



OpenMP Calculating Pi

31

#include <omp.h>#include <stdlib.h>#include <stdio.h>#include <time.h>#define SEED 42

main(int argc, char* argv){ int niter=0; double x,y; int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */ double z; double pi; time_t rawtime; struct tm * timeinfo;

printf("Enter the number of iterations used to estimate pi: "); scanf("%d",&niter); time ( &rawtime ); timeinfo = localtime ( &rawtime );

Seed for generating random numberSeed for generating random number

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML



OpenMP Calculating Pi

32

printf ( "The current date/time is: %s", asctime (timeinfo) ); /* initialize random numbers */ srand(SEED);#pragma omp parallel for private(x,y,z,tid) reduction(+:count) for ( i=0; i<niter; i++) { x = (double)rand()/RAND_MAX; y = (double)rand()/RAND_MAX; z = (x*x+y*y); if (z<=1) count++; if (i==(niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/2)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Initialize random number generator; srand is used to seed the random number generated by rand()

Initialize random number generator; srand is used to seed the random number generated by rand()

Randomly generate x,y pointsRandomly generate x,y points

Initialize OpenMP parallel for with reduction(∑)

Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count

Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count



Calculating Pi

33

if (i==(2*niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(5*niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==niter-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } } time ( &rawtime ); timeinfo = localtime ( &rawtime ); printf ( "The current date/time is: %s", asctime (timeinfo) ); printf(" the total count is %i\n",count); pi=(double)count/niter*4; printf("# of trials= %d , estimate of pi is %g \n",niter,pi); return 0;}

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Calculate PI based on the aggregate count of the points that lie within the circle

Calculate PI based on the aggregate count of the points that lie within the circle



Demo : OpenMP Pi

34

[cdekate@celeritas l13]$ ./omcpiEnter the number of iterations used to estimate pi: 100000The current date/time is: Tue Mar 4 05:53:52 2008 thread 0 just did iteration 16665 the count is 13124 thread 1 just did iteration 33332 the count is 6514 thread 1 just did iteration 49999 the count is 19609 thread 2 just did iteration 66665 the count is 13048 thread 3 just did iteration 83332 the count is 6445 thread 3 just did iteration 99999 the count is 19489The current date/time is: Tue Mar 4 05:53:52 2008 the total count is 78320# of trials= 100000 , estimate of pi is 3.1328[cdekate@celeritas l13]$



35

Creating Custom Communicators

• Communicators define groups and the access patterns among them

• Default communicator is MPI_COMM_WORLD• Some algorithms demand more sophisticated control of

communications to take advantage of reduction operators

• MPI permits creation of custom communicators• MPI_Comm_create



36

MPI Monte Carlo Pi Computation

Initialize MPIEnvironment

Receive Request

Compute Random Array

Send Array to Requestor

Last Request?

Finalize MPI

Y

N

Server

Initialize MPI Environment

WorkerMaster

Receive Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Finalize MPI

N

Y

Propagate Number of Points (Allreduce)


Broadcast Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Print Statistics

N

Y

Propagate Number of Points (Allreduce)

Finalize MPI

Output Partial Result



37

Monte Carlo : MPI - Pi (source code)#include <stdio.h>#include <math.h>#include "mpi.h“#define CHUNKSIZE 1000#define INT_MAX 1000000000#define REQUEST 1#define REPLY 2int main( int argc, char *argv[] ){ int iter; int in, out, i, iters, max, ix, iy, ranks[1], done, temp; double x, y, Pi, error, epsilon; int numprocs, myid, server, totalin, totalout, workerid; int rands[CHUNKSIZE], request; MPI_Comm world, workers; MPI_Group world_group, worker_group; MPI_Status status;

MPI_Init(&argc,&argv); world = MPI_COMM_WORLD; MPI_Comm_size(world,&numprocs); MPI_Comm_rank(world,&myid);

Initialize MPI environment



38

Monte Carlo : MPI - Pi (source code)

server = numprocs-1; /* last proc is server */ if (myid == 0) sscanf( argv[1], "%lf", &epsilon );

MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); MPI_Comm_group( world, &world_group ); ranks[0] = server; MPI_Group_excl( world_group, 1, ranks, &worker_group );

MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group);

if (myid == server) { do {

MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {

for (i = 0; i < CHUNKSIZE; ) { rands[i] = random(); if (rands[i] <= INT_MAX) i++; }/* Send random number array*/MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); }

} while( request>0 ); } else { /* Begin Worker Block */

request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */ MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid ); iter = 0;

Broadcast Error Bounds: epsilon

Create a custom communicator

Server process : 1. Receives request to generate a random ,2. Computes the random number array, 3. Send array to requestor

Worker process : Request the server to generate a random number array



39

Monte Carlo : MPI - Pi (source code)while (!done) { iter++; request = 1; /* Recv. random array from server*/

MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status ); for (i=0; i<CHUNKSIZE-1; ) { x = (((double) rands[i++])/max) * 2 - 1;

y = (((double) rands[i++])/max) * 2 - 1;if (x*x + y*y < 1.0) in++;else out++;

} MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers); MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers); Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi-3.141592653589793238462643); done = (error < epsilon || (totalin+totalout) > 1000000); request = (done) ? 0 : 1; if (myid == 0) { /* If “Master” : Print current value of PI */

printf( "\rpi = %23.20f", Pi );MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );

} else { /* If “Worker” : Request new array if not finished */

if (request) MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);

} }

MPI_Comm_free(&workers); }

Worker : Receive random number array from the Server

Worker: For each pair of x,y in the random number array, calculate the coordinates

Worker: For each pair of x,y in the random number array, calculate the coordinates

Determine if the number is inside or out of the circleDetermine if the number is inside or out of the circle

Print current value of PI and request for more work

Compute the value of pi and Check if error is within threshholdCompute the value of pi and Check if error is within threshhold



40

Monte Carlo : MPI - Pi (source code)

if (myid == 0) { /* If “Master” : Print Results */

printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n", totalin+totalout, totalin, totalout );getchar();

} MPI_Finalize();}

Print the final value of PI



41

Demo : MPI Monte Carlo, Pi

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696



42

Topics




Vector Dot Product

• Multiplication of 2 vectors followed by Summation

43

A[i]

X1

X2

X3

X4

X5

… …

Xn

B[i]

Y1

Y2

Y3

Y4

Y5

… …

Yn

∙ =

n

i 1

A[i] * B[i]

X1* Y1

X2* Y2

X3* Y3

X4* Y4

X5* Y5

… …

Xn* Yn



44

OpenMP Dot Product : using Reduction

Initialize variables

Initialize OpenMP parallel environment

Calculate local computations

Calculate local computationsCalculate local computations

Calculate local computationsCalculate local computations

REDUCTION : ∑

Print value of Dot Product

N WorkerThreadsMaster Thread

Master Thread

Workload and schedule is determined by OpenMP

during runtime

Workload and schedule is determined by OpenMP

during runtime



OpenMP Dot Product

45

#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; }#pragma omp parallel for default(shared) private(i) \ schedule(static,chunk) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i]

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i]

SRC : https://computing.llnl.gov/tutorials/openMP/



Demo: Dot Product using Reduction

46

[cdekate@celeritas l12]$ ./reduction a[i] b[i] a[i]*b[i] 0.000000 0.000000 0.000000 1.000000 2.000000 2.000000 2.000000 4.000000 8.000000 3.000000 6.000000 18.000000 4.000000 8.000000 32.000000 5.000000 10.000000 50.000000 6.000000 12.000000 72.000000 7.000000 14.000000 98.000000 8.000000 16.000000 128.000000 9.000000 18.000000 162.000000 10.000000 20.000000 200.000000 11.000000 22.000000 242.000000 12.000000 24.000000 288.000000 13.000000 26.000000 338.000000 14.000000 28.000000 392.000000 15.000000 30.000000 450.000000Final result= 2480.000000[cdekate@celeritas l12]$



47

MPI Dot Product Computation

Initialize Variables

WorkerMaster

Initialize MPI environment

Receive Size of vectors

Receive local workload for Vector A

Receive local workload for Vector B

Initialize Variables


Broadcast Size of Vectors

Get Vector A &Distribute Partitioned Vector A

Get Vector B & Distribute Partitioned Vector B

Calculate dot-product for local workloads

Print Result

REDUCTION ∑

Calculate dot-product for local workloads



MPI Dot Product

48

#include <stdio.h>#include "mpi.h"#define MAX_LOCAL_ORDER 100main(int argc, char* argv[]) { float local_x[MAX_LOCAL_ORDER]; float local_y[MAX_LOCAL_ORDER]; int n; int n_bar; /* = n/p */ float dot; int p; int my_rank; void Read_vector(char* prompt, float local_v[], int n_bar, int p, int my_rank); float Parallel_dot(float local_x[], float local_y[], int n_bar); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &p); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if (my_rank == 0) { printf("Enter the order of the vectors\n"); scanf("%d", &n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);


Broadcast the order of vectors across the workers

Parallel Programming with MPIbyPeter Pacheco



MPI Dot Product

49

n_bar = n/p; Read_vector("the first vector", local_x, n_bar, p, my_rank); Read_vector("the second vector", local_y, n_bar, p, my_rank);

dot = Parallel_dot(local_x, local_y, n_bar);

if (my_rank == 0) printf("The dot product is %f\n", dot);

MPI_Finalize();} /* main */

void Read_vector( char* prompt /* in */, float local_v[] /* out */, int n_bar /* in */, int p /* in */, int my_rank /* in */) { int i, q;

Receive and distribute the two vectors

Calculate the parallel dot product for local workloads

Master: Print the result of the dot product




MPI Dot Product

50

float temp[MAX_LOCAL_ORDER]; MPI_Status status;

if (my_rank == 0) { printf("Enter %s\n", prompt); for (i = 0; i < n_bar; i++) scanf("%f", &local_v[i]); for (q = 1; q < p; q++) { for (i = 0; i < n_bar; i++) scanf("%f", &temp[i]); MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD); } } else { MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); }} /* Read_vector */

float Serial_dot( float x[] /* in */,

MASTER: Get the input from the User prepare the local workloadMASTER: Get the input from the User prepare the local workload

Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing

Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing

Worker : Receive the local workload to be processed

Serial_dot() : calculates the dot product on local arraysSerial_dot() : calculates the dot product on local arrays

Parallel Programming with MPI byPeter Pacheco



MPI Dot Product

51

float y[] /* in */, int n /* in */) { int i; float sum = 0.0; for (i = 0; i < n; i++) sum = sum + x[i]*y[i]; return sum;} /* Serial_dot */float Parallel_dot( float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */) { float local_dot; float dot = 0.0; local_dot = Serial_dot(local_x, local_y, n_bar); MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); return dot;} /* Parallel_dot */

Serial_dot() : calculates the dot product on local arraysSerial_dot() : calculates the dot product on local arrays

Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workloadParallel_dot() : Calls the Serial_dot() to perform the dot product for local workload

Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM)




Demo: MPI Dot Product

52

[cdekate@celeritas l13]$ mpirun …. ./mpi_dotEnter the order of the vectors16Enter the first vector0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Enter the second vector0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30The dot product is 2480.000000[cdekate@celeritas l13]$



53

Topics




54

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Vector Multiplication



55

Matrix-Vector Multiplicationc = A xb



56

Implementing Matrix MultiplicationSequential Code

Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be

for (i = 0; i < n; i++)for (j = 0; j < n; j++) {

c[i][j] = 0;for (k = 0; k < n; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.




Implementing Matrix Multiplication

• With n processors (and n x n matrices), we can obtain:• Time complexity of O(n2) with n processors• Each instance of inner loop is independent and can be done by a

separate processor

• Time complexity of O(n) with n2 processors• One element of A and B assigned to each processor.• Cost optimal since O(n3) = n x O(n2) = n2 x O(n).

• Time complexity of O(log n) with n3 processors• By parallelizing the inner loop. • Not cost-optimal since O(n3) < n3 x O(log n).

• O(log n) lower bound for parallel matrix multiplication.

57



58

Block Matrix Multiplication


Partitioning into sub-matricies



59


Matrix Multiplication



60

Performance Improvement

Using tree construction n numbers can be added in O(log n) steps (using n3 processors):

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.



61

OpenMP: Flowchart for Matrix Multiplication

Initialize variables & matricesInitialize variables & matrices

Initialize OpenMP EnvironmentInitialize OpenMP Environment

Compute the Matrix product for the local workload


Print ResultsPrint Results





Schedule and workload chunksize are determined based on user preferences

during compile/run time

Since each thread works on portion of the array and updates different parts of the same

array synchronization is not needed



OpenMP Matrix Multiplication

62

#include <stdio.h>#include <omp.h>

/* Main Program */

main(){ int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k; NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4; float Matrix_A[NoofRows_A][NoofCols_A]; float Matrix_B[NoofRows_B][NoofCols_B]; float Result[NoofRows_A][NoofCols_B];

for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) Matrix_A[i][j] = i + j; } /* Matrix_B Elements */ for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) Matrix_B[i][j] = i + j; } printf("The Matrix_A Is \n");

Initialize the two Matrices A[][] & B[][] with sum of their index valuesInitialize the two Matrices A[][] & B[][] with sum of their index values




OpenMP Matrix Multiplication

63

for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) printf("%f \t", Matrix_A[i][j]); printf("\n"); } printf("The Matrix_B Is \n"); for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) printf("%f \t", Matrix_B[i][j]); printf("\n"); } for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_B; j++) { Result[i][j] = 0.0; } }#pragma omp parallel for private(j,k) for (i = 0; i < NoofRows_A; i = i + 1) for (j = 0; j < NoofCols_B; j = j + 1) for (k = 0; k < NoofCols_A; k = k + 1) Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j]; printf("\nThe Matrix Computation Result Is \n");

Initialize the results matrix with 0.0Initialize the results matrix with 0.0

Print the Matrices for debugging purposes

Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMP environment variables and the number of threads




OpenMP Matrix Multiplicaton

64

for (i = 0; i < NoofRows_A; i = i + 1) { for (j = 0; j < NoofCols_B; j = j + 1) printf("%f ", Result[i][j]); printf("\n"); }}




DEMO : OpenMP Matrix Multiplication

65

[cdekate@celeritas l13]$ ./omp_mmThe Matrix_A Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000The Matrix_B Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000

The Matrix Computation Result Is14.000000 20.000000 26.000000 32.00000020.000000 30.000000 40.000000 50.00000026.000000 40.000000 54.000000 68.00000032.000000 50.000000 68.000000 86.000000[cdekate@celeritas l13]$



66

Flowchart for MPI Matrix Multiplication

“master” “workers”




… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

wait for “workers“ to finish task

wait for “workers“ to finish task

Calculate matrix product





Calculate matrix product…

Send resultSend result Send resultSend result … Send resultSend result

Recv. resultsRecv. results

Print resultsPrint results

EndEnd



67

Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 4 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */

MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Initialize the MPI environment

Source : http://www.llnl.gov/computing/t

utorials/mpi/samples/C/mpi_mm.c



68

Matrix Multiplication (source code)if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1; if (taskid == MASTER){ for (i=0; i<NRA; i++) for (j=0; j<NCA; j++){ a[i][j]= i+j+1; b[i][j]= i+j+1; } printf("Matrix A :: \n"); for (i=0; i<NRA; i++){ printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", a[i][j]); } printf("Matrix B :: \n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", b[i][j]); averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER;



MASTER: Initialize the matrix A & B

Print the two matrices for Debugging purposes

Calculate the number of rows to be processed by each workerCalculate the number of rows to be processed by each worker

Calculate the number of overflow rows to be processed additionally by each workerCalculate the number of overflow rows to be processed additionally by each worker



69

Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }

/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;

/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }

MASTER : Send the workload chunk across to each of the worker

MASTER: Receive the workload chunk from the workersc[][] contains the matrix products calculated for each workload chunk by the corresponding worker





70

Matrix Multiplication (source code)/**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++)

/* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);

/* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize();}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

WORKER: Receive the workload to be processed by each worker

Calculate the matrix product and store the result in c[][]Calculate the matrix product and store the result in c[][]

Send the computed results array to the Master

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C

/mpi_mm.c



71

Demo : Matrix Multiplication[cdekate@celeritas matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Matrix A :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00Matrix B :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00Sending 2 rows to task 1 offset=0Sending 1 rows to task 2 offset=2Sending 1 rows to task 3 offset=3Received results from task 1Received results from task 2Received results from task 3Result Matrix: 30.00 40.00 50.00 60.00 40.00 54.00 68.00 82.00 50.00 68.00 86.00 104.00 60.00 82.00 104.00 126.00[cdekate@celeritas matrix_multiplication]$



72


CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof.

Documents