Introduction to Parallel Programming • Language notation: message passing • Distributed-memory machine – (e.g., workstations on a network) • 5 parallel algorithms of increasing complexity: – Matrix multiplication – Successive overrelaxation – All-pairs shortest paths – Linear equations – Search problem
73
Embed
Introduction to Parallel Programming Language notation: message passing Distributed-memory machine –(e.g., workstations on a network) 5 parallel algorithms.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction to Parallel Programming
• Language notation: message passing• Distributed-memory machine
– (e.g., workstations on a network)
• 5 parallel algorithms of increasing complexity:– Matrix multiplication– Successive overrelaxation– All-pairs shortest paths – Linear equations– Search problem
Message Passing
• SEND (destination, message)– blocking: wait until message has arrived (like a fax)– non blocking: continue immediately (like a mailbox)
• RECEIVE (source, message)
• RECEIVE-FROM-ANY (message)– blocking: wait until message is available– non blocking: test if message is available
Syntax• Use pseudo-code with C-like syntax
• Use indentation instead of { ..} to indicate block structure
• Arrays can have user-defined index ranges
• Default: start at 1– int A[10:100] runs from 10 to 100
– int A[N] runs from 1 to N
• Use array slices (sub-arrays)– A[i..j] = elements A[ i ] to A[ j ]
– A[i, *] = elements A[i, 1] to A[i, N] i.e. row i of matrix A
– A[*, k] = elements A[1, k] to A[N, k] i.e. column k of A
Parallel Matrix Multiplication
• Given two N x N matrices A and B
• Compute C = A x B
• Cij = Ai1B1j + Ai2B2j + .. + AiNBNj
A B C
Sequential Matrix Multiplication
for (i = 1; i <= N; i++)for (j = 1; j <= N; j++)
C [i,j] = 0;for (k = 1; k <= N; k++)
C[i,j] += A[i,k] * B[k,j];
The order of the operations is over specifiedEverything can be computed in parallel
Parallel Algorithm 1
Each processor computes 1 element of C
Requires N2 processors
Each processor needs 1 row of A and 1 column of B
Structure
• Master distributes the work and receives the results
• Slaves get work and execute it• Slaves are
numbered consecutively from 1 to P
• How to start up master/slave processes depends on Operating System (not discussed here)
Master
Slave
A[1,*]
B[*,1] C[1,1]
Slave
A[N,*]
B[*,N]
….1 N2
C[N,N]
• Master distributes work and receives results
• Slaves (1 .. P) get work and execute it
• How to start up master/slave processes depends on Operating System
Master (processor 0): int proc = 1;
for (i = 1; i <= N; i++)for (j = 1; j <= N; j++)
SEND(proc, A[i,*], B[*,j], i, j); proc++;for (x = 1; x <= N*N; x++)
During iteration k, cpus Pk+1 … Pn-1 need part of row k
This row is stored on CPU Pk
-> need partial broadcast (multicast)
Communication
Performance problems
• Communication overhead (multicast)• Load imbalance
CPUs P0…PK are idle during iteration kBad load balance means bad speedups,
as some CPUs have too much work• In general, number of CPUs is less than n
Choice between block-striped & cyclic-striped distribution• Block-striped distribution has high load-imbalance• Cyclic-striped distribution has less load-imbalance
Block-striped distribution
• CPU 0 gets first N/2 rows• CPU 1 gets last N/2 rows
• CPU 0 has much less work to do• CPU 1 becomes the bottleneck
Cyclic-striped distribution
• CPU 0 gets odd rows• CPU 1 gets even rows
• CPU 0 and 1 have more or less the same amount of work
A Search Problem
Given an array A[1..N] and an item x, check if x is present in A
int present = false;for (i = 1; !present && i <= N; i++)
if ( A [i] == x) present = true;
Don’t know in advance which data we need to access
Parallel Search on 2 CPUs
int lb, ub;int A[lb:ub];
for (i = lb; i <= ub; i++)if (A [i] == x)
print(“ Found item");SEND(1-cpuid); /* send other CPU empty message*/exit();
/* check message from other CPU: */if (NONBLOCKING_RECEIVE(1-cpuid)) exit()
Performance Analysis
How much faster is the parallel program than the sequential program for N=100 ?
Performance Analysis
How much faster is the parallel program than the sequential program for N=100 ?
1. if x not present => factor 2
Performance Analysis
How much faster is the parallel program than the sequential program for N=100 ?
1. if x not present => factor 2
2. if x present in A[1 .. 50] => factor 1
Performance Analysis
How much faster is the parallel program than the sequential program for N=100 ?
1. if x not present => factor 2
2. if x present in A[1 .. 50] => factor 1
3. if A[51] = x => factor 51
Performance Analysis
How much faster is the parallel program than the sequential program for N=100 ?
1. if x not present => factor 2
2. if x present in A[1 .. 50] => factor 1
3. if A[51] = x => factor 51
4. if A[75] = x => factor 3
Performance Analysis
How much faster is the parallel program than the sequential program for N=100 ?
1. if x not present => factor 2
2. if x present in A[1 .. 50] => factor 1
3. if A[51] = x => factor 51
4. if A[75] = x => factor 3
In case 2 the parallel program does more work than the sequential program => search overhead
Performance Analysis
How much faster is the parallel program than the sequential program for N=100 ?
1. if x not present => factor 22. if x present in A[1 .. 50] => factor 13. if A[51] = x => factor 514. if A[75] = x => factor 3
In case 2 the parallel program does more work than the sequential program => search overhead
In cases 3 and 4 the parallel program does less work => negative search overhead
Discussion
Several kinds of performance overhead
• Communication overhead: communication/computation ratio must be low
• Load imbalance: all processors must do same amount of work