Top Banner
Design of parallel algorithms Matrix operations J. Porras
46

Design of parallel algorithms

Jan 08, 2016

Download

Documents

Oksana Oksana

Design of parallel algorithms. Matrix operations J. Porras. Matrix x vector. Sequential approach MAT_VECT(A,x,y) for(i=0;i
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Design of parallel algorithms

Design of parallel algorithms

Matrix operations

J. Porras

Page 2: Design of parallel algorithms

Matrix x vector

• Sequential approach MAT_VECT(A,x,y)

for(i=0;i<n;i++) {y[i] = 0;for(j=0;j<n:j++) {

y[i] = y[i] + A[i,j] * x[j]}

}

• Work = n2

Page 3: Design of parallel algorithms

Parallelization of matrix operationsMatrix x vector

• Three ways to implement – rowwise striping– columnwise striping– checkerboarding

• DRAW each of these approaches !

Page 4: Design of parallel algorithms

Rowwise striping

• N x N is distributed into n processors (one row each)

• N x 1 vector is distributed into n processors (one element each)

• All processors need the whole vector so all-to-all broadcast is required

Page 5: Design of parallel algorithms
Page 6: Design of parallel algorithms

Rowwise striping

• All-to-all broadcast requires n).

• One row takes n) time for multiplications

• Rows are calculated in parallel thus the total time is n) and the work n2).– Algorithm is cost-optimal

Page 7: Design of parallel algorithms

Block striping

• Assume that p < n and the matrix in partitioned by using block striping

• All processors contain n/p rows and n/p elements of the vector

• All processors require the whole vector thus all-to-all broadcast is required (message size n/p)

Page 8: Design of parallel algorithms

Block striping in hypercube

• all-to-all broadcast in hypercube with n/p-sized message takes

tslog p + tw(n/p)(p-1)

• If p is considered large enoughtslog p + twn

• Multiplication requires n2/p time (n/p rows to multiply with the vector)

Page 9: Design of parallel algorithms

Block striping in hypercube

• Parallel execution time TP = n2/p + tslog p + twn

• Cost pTP n2 + ts plog p + twnp

• Algorithm is costoptimal if

p = O(n)

Page 10: Design of parallel algorithms

Block striping in mesh

• All-to-all broadcast in mesh with wraparounds takes 2ts(p-1) + tw(n/p)(p-1)

• Parallel execution requiresTP = n2/p + 2ts (p-1) + twn

Page 11: Design of parallel algorithms

Scalability of block striping

• Overhead (T0 = pTp – W)

T0 = ts plog p + twnp

• Isoeffiency (W = KT0) for hypercube

W = K ts p log p

W = K tw np

• Since W = n2, W = K2 tw

2 p2

Page 12: Design of parallel algorithms

Scalability of block striping

• Because p = O(n), n = p)n2 = p2)W = p2)

• Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency

Page 13: Design of parallel algorithms

Scalability of block striping

• Isoeffiency in hypercube is (p2).

• Similar analysis can be done for the mesh architecture and get the same value (p2).

• Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh

Page 14: Design of parallel algorithms

Checkerboard

• N x N matrix in partitioned into N2

processors (one element per processor)• N x 1 vector is located on a last column (or

on a diagonal)• Vector is distributed into corresponding

processors• Calculate multiplications in parallel and

collect results with single node accumulation into the last processor

Page 15: Design of parallel algorithms
Page 16: Design of parallel algorithms
Page 17: Design of parallel algorithms

Checkerboard

• Three communication stapes are required– One-to-one communication to send the vector

onto diagonal– One-to-all broadcast to distributed the

elements of the vector– Single-node accumulation to sum the partial

results

Page 18: Design of parallel algorithms

Checkerboard

• Mesh requires (n) time for all the operations (SF) and hypercube (log n)

• Multiplication happens in constant time

• Parallel execution time is (n) in mesh and (log n) in hypercube architecture

• Cost is (n3) for the mesh and (n2log n)for the hypercube

• Algorithms are not cost-optimal

Page 19: Design of parallel algorithms

Checkerboard p < n2

• Cost-optimality can be achieved if the granularity is increased ??

• Consider two dimensional mesh of p processors in which each processor stores (n/p) x (n/p block of the matrix

• Simlarly for the vector (n/p)

Page 20: Design of parallel algorithms

Checkerboard p < n2

• Vector elements are sent to the diagonal

• Vector elements are distributed for the other processors

• Each processor performs n2/p multiplications and calculates n/p additions

• Partial sums are collected with single node accumulation

Page 21: Design of parallel algorithms

Scalability of checkerboard p < n2

• Assume that the processors are connected in a two dimensional p x p cut-through routing mesh (no wraparounds)

• Sent to diagonal takes

ts + twn / p + th p

• One-to-all in columns takes(ts + twn / p) log (p) + th p

Page 22: Design of parallel algorithms

Scalability of checkerboard p < n2

• Single-node accumulation takes(ts + twn / p) log (p) + th p

• Multiplicatios in each processor takes n2/p.

• Thus

TP = n2/p + tslog p +(tw n / p) log p + 3th p

• T0 = pTP - W gives for the overhead:

T0 = tsplog p + tw n p log p + 3th p3/2

Page 23: Design of parallel algorithms

Scalability of checkerboard p < n2

• Isoeffiency for ts:

W = Kts p log p

• Isoeffiency for tw:

W = n2 = K tw n p log p

n = K tw p log p

n2 = K2 tw 2 p log 2 p

W = K2 tw2p log2 p

• Isoeffiency for th:

W = 3 K th p3/2

Page 24: Design of parallel algorithms

Scalability of checkerboard p < n2

• If p = O(n2), :p = O(n2)n2 = p)W = p)

• tw and th dominate ts

Page 25: Design of parallel algorithms

Scalability of checkerboard p < n2

• Concentrate on th (p3/2) and tw:n (plog2 p)

• Because p3/2 > plog2p only for p > 65536 both of the terms could dominate

• Assume that the term (plog2 p) dominates

Page 26: Design of parallel algorithms

Scalability of checkerboard p < n2

• Maximum number of processors that can be used costoptimally for the problem size W is determined by

plog2 p = O( n2 )

log p + 2 log log p = O( log n )

log p = O (log n)

Page 27: Design of parallel algorithms

Scalability of checkerboard p < n2

• Substitute log n for log p:n

• p log2 n = O (n2 ) p = O ( n2 / log2 n )

• p gives the upper limit for the number of processors that can be used cost-optimally

Page 28: Design of parallel algorithms

SF and CT

• Parallel execution takes n2 / p + 2ts p + 3tw

n time on p processor mesh with SF routing (isoeffiency (p2) dueto tw )

• CT routing performs much better

• Note that this is true for cases with several elements per processor

• HOW about fine-grain case ?

Page 29: Design of parallel algorithms

Striped and checkerboard

• Comparison shows that checkerboard is faster than striped approach with the same amount of processors

• If p > n, striped approach is not available

• How about the effect of architecture ?

• Scalability ?

• Isoefficiency ?

Page 30: Design of parallel algorithms

Sequential matrix multiplication

• Procedure MAT_MULT(A,B,C)for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j]

• n3 work (strassen’s algorithm has better complexity)

Page 31: Design of parallel algorithms

Block approach

• n/q * n/q submatrices

• Procedure BLOCK_MAT_MULT(A,B,C)for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do Ci,j := Ci,j + Ai,k Bk,j

• Same complexity n3

Page 32: Design of parallel algorithms

Simple parallel approach

• Matrices A and B partitioned into p blocks of size(n/p1/2) x (n/p1/2)

• Map into p1/2 x p1/2 mesh

• Processors P0,0 ... Pp-1,p-1

• Pi,j stores Ai,j and Bi,j and computes Ci,j

• Ci,j requires Ai,k and Bk,j

• A needs to communicate within rows • B communicates within columns

Page 33: Design of parallel algorithms

Performance on hypercube

• Requires 2 broadcasts (rows and columns)

• message size n2/p

• tc = 2(ts log(p)+tw(n2/p)(p-1))

• tm= p (n/p)3=n3/p

• Tp = n3/p + ts log p + 2twn2/ p , p » 1

Page 34: Design of parallel algorithms

Performance on mesh

• Store-and-forward routing

• tc = 2(tsp + twn2/ p)

• tm= p (n/ p)3=n3/p

• tp = n3/p + 2ts p + 2twn2/ p

Page 35: Design of parallel algorithms

Cannon´s algorithm

• Partition to blocks as usual

• Processors P0,0 - P p-1, p-1

• Pi,j contains Ai,j and Bi,j

• rotate block !!

• A blocks to the left

• B blocks upwards

Page 36: Design of parallel algorithms
Page 37: Design of parallel algorithms
Page 38: Design of parallel algorithms
Page 39: Design of parallel algorithms

Fox’s algorithm

• Partition to blocks as usual

• Pi,j contains Ai,j and Bi,j

• Uses one-to-all broadcastsp iterations

• (1) broadcast selected block to row

• (2) multiply by B

• (3) send B upwards

• (4) select Ai,(j+1)mod(p)

Page 40: Design of parallel algorithms
Page 41: Design of parallel algorithms
Page 42: Design of parallel algorithms

DNS

• Dekel, Nassimi and Sahni

• n3 processors available

• use 3D structure

• Pi,j,k solves A[i,k]xB[k,j]

• C[i,j] = Pi,j,0 +...+ Pi,j,n-1

(log n) time

Page 43: Design of parallel algorithms

DNS for hypercube

• 3D structure is mapped into hypercube where n3 = 23d processors

• Processor Pi,j,o contains A[i,j] and B[i,j]

• 3 steps

• (1) move A & B to correct plane

• (2) replicate on each plane

• (3) single node accumulation

Page 44: Design of parallel algorithms
Page 45: Design of parallel algorithms
Page 46: Design of parallel algorithms

DNS < n3 processors

• Processors p = q3, q < n

• Partition matrices into (n/q)*(n/q) blocks

• Matrices contain q x q submatrices

• Since 1<=q<=n, p=[1,n3]