CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 10 th , 2011
72
Embed
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
APPLIED PARALLEL ALGORITHMS 1
Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 10th, 2011
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
9
Basic Parallel (MPI) Program Steps
• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results
– Optional for non-embarrassingly parallel (cooperative)
• Detect “stop” condition– Maybe implicit with a barrier etc.
• Aggregate final results– Often a reduction operator
• Output results and error code• Terminate and return to OS
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
10
“embarrassingly parallel”
• Common phrase– poorly defined, – widely used
• Suggests lots and lots of parallelism – with essentially no inter task communication or coordination– Highly partitionable workload with minimal overhead
• “almost embarrassingly parallel”– Same as above, but– Requires master to launch many tasks– Requires master to collect final results of tasks– Sometimes still referred to as “embarrassingly parallel”
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson
& M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Mandelbrot Set
Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function
where zk+1 is the (k + 1)th iteration of the complex number z = (a + bi) and c is a complex number giving position of point in the complex plane. The initial value for z is zero.
Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
19
Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){ FILE *file; int i, j; int tmp; Complex c; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status;
Determining the dimensions of the work to be performed by each concurrent task.
Determining the dimensions of the work to be performed by each concurrent task.
Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated
Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated
Master process opens a file to store output into and stores its values in the file
Master then waits to receive values computed by each of the worker processes
Master process opens a file to store output into and stores its values in the file
Master then waits to receive values computed by each of the worker processes
Worker processes send computed mandelbrot values of their region to the master processWorker processes send computed mandelbrot values of their region to the master process
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Monte Carlo Simulation
• No single approach, multitude of different methods
• Usually follows pattern– Define a domain of possible inputs – Generate inputs randomly from the domain – Perform a deterministic computation using the
main(int argc, char* argv){ int niter=0; double x,y; int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */ double z; double pi; time_t rawtime; struct tm * timeinfo;
printf("Enter the number of iterations used to estimate pi: "); scanf("%d",&niter); time ( &rawtime ); timeinfo = localtime ( &rawtime );
Seed for generating random numberSeed for generating random number
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Calculating Pi
32
printf ( "The current date/time is: %s", asctime (timeinfo) ); /* initialize random numbers */ srand(SEED);#pragma omp parallel for private(x,y,z,tid) reduction(+:count) for ( i=0; i<niter; i++) { x = (double)rand()/RAND_MAX; y = (double)rand()/RAND_MAX; z = (x*x+y*y); if (z<=1) count++; if (i==(niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/2)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML
Initialize random number generator; srand is used to seed the random number generated by rand()
Initialize random number generator; srand is used to seed the random number generated by rand()
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Calculating Pi
33
if (i==(2*niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(5*niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==niter-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } } time ( &rawtime ); timeinfo = localtime ( &rawtime ); printf ( "The current date/time is: %s", asctime (timeinfo) ); printf(" the total count is %i\n",count); pi=(double)count/niter*4; printf("# of trials= %d , estimate of pi is %g \n",niter,pi); return 0;}
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo : OpenMP Pi
34
[cdekate@celeritas l13]$ ./omcpiEnter the number of iterations used to estimate pi: 100000The current date/time is: Tue Mar 4 05:53:52 2008 thread 0 just did iteration 16665 the count is 13124 thread 1 just did iteration 33332 the count is 6514 thread 1 just did iteration 49999 the count is 19609 thread 2 just did iteration 66665 the count is 13048 thread 3 just did iteration 83332 the count is 6445 thread 3 just did iteration 99999 the count is 19489The current date/time is: Tue Mar 4 05:53:52 2008 the total count is 78320# of trials= 100000 , estimate of pi is 3.1328[cdekate@celeritas l13]$
MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {
for (i = 0; i < CHUNKSIZE; ) { rands[i] = random(); if (rands[i] <= INT_MAX) i++; }/* Send random number array*/MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); }
request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */ MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid ); iter = 0;
Broadcast Error Bounds: epsilon
Create a custom communicator
Server process : 1. Receives request to generate a random ,2. Computes the random number array, 3. Send array to requestor
Worker process : Request the server to generate a random number array
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
51
float y[] /* in */, int n /* in */) { int i; float sum = 0.0; for (i = 0; i < n; i++) sum = sum + x[i]*y[i]; return sum;} /* Serial_dot */float Parallel_dot( float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */) { float local_dot; float dot = 0.0; local_dot = Serial_dot(local_x, local_y, n_bar); MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); return dot;} /* Parallel_dot */
Serial_dot() : calculates the dot product on local arraysSerial_dot() : calculates the dot product on local arrays
Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workloadParallel_dot() : Calls the Serial_dot() to perform the dot product for local workload
Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM)
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
54
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
56
Implementing Matrix MultiplicationSequential Code
Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be
for (i = 0; i < n; i++)for (j = 0; j < n; j++) {
c[i][j] = 0;for (k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Implementing Matrix Multiplication
• With n processors (and n x n matrices), we can obtain:• Time complexity of O(n2) with n processors• Each instance of inner loop is independent and can be done by a
separate processor
• Time complexity of O(n) with n2 processors• One element of A and B assigned to each processor.• Cost optimal since O(n3) = n x O(n2) = n2 x O(n).
• Time complexity of O(log n) with n3 processors• By parallelizing the inner loop. • Not cost-optimal since O(n3) < n3 x O(log n).
• O(log n) lower bound for parallel matrix multiplication.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
58
Block Matrix Multiplication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
59
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
60
Performance Improvement
Using tree construction n numbers can be added in O(log n) steps (using n3 processors):
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.
for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) Matrix_A[i][j] = i + j; } /* Matrix_B Elements */ for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) Matrix_B[i][j] = i + j; } printf("The Matrix_A Is \n");
Initialize the two Matrices A[][] & B[][] with sum of their index valuesInitialize the two Matrices A[][] & B[][] with sum of their index values
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Matrix Multiplication
63
for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) printf("%f \t", Matrix_A[i][j]); printf("\n"); } printf("The Matrix_B Is \n"); for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) printf("%f \t", Matrix_B[i][j]); printf("\n"); } for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_B; j++) { Result[i][j] = 0.0; } }#pragma omp parallel for private(j,k) for (i = 0; i < NoofRows_A; i = i + 1) for (j = 0; j < NoofCols_B; j = j + 1) for (k = 0; k < NoofCols_A; k = k + 1) Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j]; printf("\nThe Matrix Computation Result Is \n");
Initialize the results matrix with 0.0Initialize the results matrix with 0.0
Print the Matrices for debugging purposes
Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMP environment variables and the number of threads
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
67
Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 4 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */
taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */
double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
68
Matrix Multiplication (source code)if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1; if (taskid == MASTER){ for (i=0; i<NRA; i++) for (j=0; j<NCA; j++){ a[i][j]= i+j+1; b[i][j]= i+j+1; } printf("Matrix A :: \n"); for (i=0; i<NRA; i++){ printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", a[i][j]); } printf("Matrix B :: \n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", b[i][j]); averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER;
Source : http://www.llnl.gov/computing/t
utorials/mpi/samples/C/mpi_mm.c
MASTER: Initialize the matrix A & B
Print the two matrices for Debugging purposes
Calculate the number of rows to be processed by each workerCalculate the number of rows to be processed by each worker
Calculate the number of overflow rows to be processed additionally by each workerCalculate the number of overflow rows to be processed additionally by each worker
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
69
Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }
/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;
/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }
MASTER : Send the workload chunk across to each of the worker
MASTER: Receive the workload chunk from the workersc[][] contains the matrix products calculated for each workload chunk by the corresponding worker