Design of parallel algorithms

Design of parallel algorithms

Matrix operations

J. Porras

Matrix x vector

• Sequential approach MAT_VECT(A,x,y)

for(i=0;i<n;i++) {y[i] = 0;for(j=0;j<n:j++) {

y[i] = y[i] + A[i,j] * x[j]}

}

• Work = n2

Parallelization of matrix operationsMatrix x vector

• Three ways to implement – rowwise striping– columnwise striping– checkerboarding

• DRAW each of these approaches !

Rowwise striping

• N x N is distributed into n processors (one row each)

• N x 1 vector is distributed into n processors (one element each)

• All processors need the whole vector so all-to-all broadcast is required

Rowwise striping

• All-to-all broadcast requires n).

• One row takes n) time for multiplications

• Rows are calculated in parallel thus the total time is n) and the work n2).– Algorithm is cost-optimal

Block striping

• Assume that p < n and the matrix in partitioned by using block striping

• All processors contain n/p rows and n/p elements of the vector

• All processors require the whole vector thus all-to-all broadcast is required (message size n/p)

Block striping in hypercube

• all-to-all broadcast in hypercube with n/p-sized message takes

tslog p + tw(n/p)(p-1)

• If p is considered large enoughtslog p + twn

• Multiplication requires n2/p time (n/p rows to multiply with the vector)

Block striping in hypercube

• Parallel execution time TP = n2/p + tslog p + twn

• Cost pTP n2 + ts plog p + twnp

• Algorithm is costoptimal if

p = O(n)

Block striping in mesh

• All-to-all broadcast in mesh with wraparounds takes 2ts(p-1) + tw(n/p)(p-1)

• Parallel execution requiresTP = n2/p + 2ts (p-1) + twn

Scalability of block striping

• Overhead (T0 = pTp – W)

T0 = ts plog p + twnp

• Isoeffiency (W = KT0) for hypercube

W = K ts p log p

W = K tw np

• Since W = n2, W = K2 tw

2 p2


• Because p = O(n), n = p)n2 = p2)W = p2)

• Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency


• Isoeffiency in hypercube is (p2).

• Similar analysis can be done for the mesh architecture and get the same value (p2).

• Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh

Checkerboard

• N x N matrix in partitioned into N2

processors (one element per processor)• N x 1 vector is located on a last column (or

on a diagonal)• Vector is distributed into corresponding

processors• Calculate multiplications in parallel and

collect results with single node accumulation into the last processor

Checkerboard

• Three communication stapes are required– One-to-one communication to send the vector

onto diagonal– One-to-all broadcast to distributed the

elements of the vector– Single-node accumulation to sum the partial

results

Checkerboard

• Mesh requires (n) time for all the operations (SF) and hypercube (log n)

• Multiplication happens in constant time

• Parallel execution time is (n) in mesh and (log n) in hypercube architecture

• Cost is (n3) for the mesh and (n2log n)for the hypercube

• Algorithms are not cost-optimal

Checkerboard p < n2

• Cost-optimality can be achieved if the granularity is increased ??

• Consider two dimensional mesh of p processors in which each processor stores (n/p) x (n/p block of the matrix

• Simlarly for the vector (n/p)

Checkerboard p < n2

• Vector elements are sent to the diagonal

• Vector elements are distributed for the other processors

• Each processor performs n2/p multiplications and calculates n/p additions

• Partial sums are collected with single node accumulation

Scalability of checkerboard p < n2

• Assume that the processors are connected in a two dimensional p x p cut-through routing mesh (no wraparounds)

• Sent to diagonal takes

ts + twn / p + th p

• One-to-all in columns takes(ts + twn / p) log (p) + th p


• Single-node accumulation takes(ts + twn / p) log (p) + th p

• Multiplicatios in each processor takes n2/p.

• Thus

TP = n2/p + tslog p +(tw n / p) log p + 3th p

• T0 = pTP - W gives for the overhead:

T0 = tsplog p + tw n p log p + 3th p3/2


• Isoeffiency for ts:

W = Kts p log p

• Isoeffiency for tw:

W = n2 = K tw n p log p

n = K tw p log p

n2 = K2 tw 2 p log 2 p

W = K2 tw2p log2 p

• Isoeffiency for th:

W = 3 K th p3/2


• If p = O(n2), :p = O(n2)n2 = p)W = p)

• tw and th dominate ts


• Concentrate on th (p3/2) and tw:n (plog2 p)

• Because p3/2 > plog2p only for p > 65536 both of the terms could dominate

• Assume that the term (plog2 p) dominates


• Maximum number of processors that can be used costoptimally for the problem size W is determined by

plog2 p = O( n2 )

log p + 2 log log p = O( log n )

log p = O (log n)


• Substitute log n for log p:n

• p log2 n = O (n2 ) p = O ( n2 / log2 n )

• p gives the upper limit for the number of processors that can be used cost-optimally

SF and CT

• Parallel execution takes n2 / p + 2ts p + 3tw

n time on p processor mesh with SF routing (isoeffiency (p2) dueto tw )

• CT routing performs much better

• Note that this is true for cases with several elements per processor

• HOW about fine-grain case ?

Striped and checkerboard

• Comparison shows that checkerboard is faster than striped approach with the same amount of processors

• If p > n, striped approach is not available

• How about the effect of architecture ?

• Scalability ?

• Isoefficiency ?

Sequential matrix multiplication

• Procedure MAT_MULT(A,B,C)for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j]

• n3 work (strassen’s algorithm has better complexity)

Block approach

• n/q * n/q submatrices

• Procedure BLOCK_MAT_MULT(A,B,C)for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do Ci,j := Ci,j + Ai,k Bk,j

• Same complexity n3

Simple parallel approach

• Matrices A and B partitioned into p blocks of size(n/p1/2) x (n/p1/2)

• Map into p1/2 x p1/2 mesh

• Processors P0,0 ... Pp-1,p-1

• Pi,j stores Ai,j and Bi,j and computes Ci,j

• Ci,j requires Ai,k and Bk,j

• A needs to communicate within rows • B communicates within columns

Performance on hypercube

• Requires 2 broadcasts (rows and columns)

• message size n2/p

• tc = 2(ts log(p)+tw(n2/p)(p-1))

• tm= p (n/p)3=n3/p

• Tp = n3/p + ts log p + 2twn2/ p , p » 1

Performance on mesh

• Store-and-forward routing

• tc = 2(tsp + twn2/ p)

• tm= p (n/ p)3=n3/p

• tp = n3/p + 2ts p + 2twn2/ p

Cannon´s algorithm

• Partition to blocks as usual

• Processors P0,0 - P p-1, p-1

• Pi,j contains Ai,j and Bi,j

• rotate block !!

• A blocks to the left

• B blocks upwards

Fox’s algorithm

• Partition to blocks as usual

• Pi,j contains Ai,j and Bi,j

• Uses one-to-all broadcastsp iterations

• (1) broadcast selected block to row

• (2) multiply by B

• (3) send B upwards

• (4) select Ai,(j+1)mod(p)

DNS

• Dekel, Nassimi and Sahni

• n3 processors available

• use 3D structure

• Pi,j,k solves A[i,k]xB[k,j]

• C[i,j] = Pi,j,0 +...+ Pi,j,n-1

(log n) time

DNS for hypercube

• 3D structure is mapped into hypercube where n3 = 23d processors

• Processor Pi,j,o contains A[i,j] and B[i,j]

• 3 steps

• (1) move A & B to correct plane

• (2) replicate on each plane

• (3) single node accumulation

DNS < n3 processors

• Processors p = q3, q < n

• Partition matrices into (n/q)*(n/q) blocks

• Matrices contain q x q submatrices

• Since 1<=q<=n, p=[1,n3]

Design of parallel algorithms

Documents

n processors

ts p log p w

tslog p twnpp

n2 processors

np elements

qn time

processor stores np

ts plog p twnpisoeffiency