Top Banner
Parallel Algorithms and Computing Selected topics Parallel Architecture
129

Parallel Algorithms and Computing Selected topics

Feb 07, 2016

Download

Documents

sirvat

Parallel Algorithms and Computing Selected topics. Parallel Architecture. References. An introduction to parallel algorithms Joseph Jaja Introduction to parallel computing Vipin Kumar, Ananth Grama, Anshul Gupta, George KArypis Parallel sorting algorithms Selim G. Akl. Models. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Algorithms and Computing Selected topics

Parallel Algorithms and Computing

Selected topics

Parallel Architecture

Page 2: Parallel Algorithms and Computing Selected topics

2

References

An introduction to parallel algorithmsJoseph Jaja

Introduction to parallel computingVipin Kumar, Ananth Grama, Anshul Gupta, George KArypis

Parallel sorting algorithmsSelim G. Akl

Page 3: Parallel Algorithms and Computing Selected topics

3

Models

Three models:Graphs (DAG : Directed Acyclic Graph)

Parallel Randon Access Machine

Network

Page 4: Parallel Algorithms and Computing Selected topics

4

GraphsNot studied here

Page 5: Parallel Algorithms and Computing Selected topics

Parallel Architecture

Parallel random access machine

Page 6: Parallel Algorithms and Computing Selected topics

6

Parallel Randon Access Machine

Flynn classifies parallel machines based on:

– Data flow– Instruction flow

Each flow can be: – Single– Multiple

Page 7: Parallel Algorithms and Computing Selected topics

7

Parallel Randon Access Machine Flynn classification

SINGLE MULTIPLE

SINGLE SISD SIMD

MULTIPLE MISD MIMD

Data flowIn

stru

ctio

n flo

w

Page 8: Parallel Algorithms and Computing Selected topics

8

Parallel Randon Access Machine

Extend the traditional RAM (Random Access Memory) machine

Interconnection network between global memory and processors

Multiple processors

Mémoire Globale (Shared – Memory)

P1 P2 Pp

Page 9: Parallel Algorithms and Computing Selected topics

9

Parallel Randon Access Machine

Characteristics

Processors Pi (i (0 i p-1 )– each with a local memory– i is a unique identity for processor Pi

A global shared memory – it can be accessed by all processors

Page 10: Parallel Algorithms and Computing Selected topics

10

Parallel Randon Access Machine

Types of operations: Synchronous

– Processors work in locked step

at each step, a processor is active or idlesuited for SIMD and MIMD architectures

Asynchronous– processors have local clocks – needs to synchronize the processors

suited for MIMD architecture

Page 11: Parallel Algorithms and Computing Selected topics

11

Parallel Randon Access Machine Example of synchronous operation

Algorithm : Processor i (i=0 … 3)

Input : A, B i processor id

Output : (1) CBegin

If ( B==0) C = AElse C = A/B

End

Page 12: Parallel Algorithms and Computing Selected topics

12

Parallel Randon Access Machine

Step 1

A : 7B : 0C : 7 (Actif, B=0)

A : 2B : 1C : 0 (Inactif, (B0)

A : 4B : 2 C : 0 (Inactif, (B0)

A : 5B : 0C : 5 (Actif, (B=0)

Processeur 3Processeur 2Processeur 1Processeur 0

Initial

A : 7B : 0C : 0

A : 2B : 1C : 0

A : 4B : 2 C : 0

A : 5B : 0C : 0

Processeur 3Processeur 2Processeur 1Processeur 0

(idle B 0) (idle B 0) (active B = 0)(active B = 0)

Page 13: Parallel Algorithms and Computing Selected topics

13

Parallel Randon Access Machine

Step 2

A : 7B : 0C : 7

A : 2B : 1C : 2

A : 4B : 2 C : 2

A : 5B : 0C : 5

Processeur 3Processeur 2Processeur 1Processeur 0

(active B 0) (active B 0) (idle B = 0)(idle B = 0)

Page 14: Parallel Algorithms and Computing Selected topics

14

Parallel Randon Access Machine

Read / Write conflicts

EREW : Exclusive - Read, Exclusive -Write– no concurrent ( read or write) operation on a variable

CREW : Concurrent – Read, Exclusive – Write – concurrent reads allowed on same variable– exclusive write only

Page 15: Parallel Algorithms and Computing Selected topics

15

Parallel Randon Access Machine

ERCW : Exclusive Read – Concurrent Write

CRCW : Concurrent – Read, Concurrent – Write

Page 16: Parallel Algorithms and Computing Selected topics

16

Parallel Randon Access Machine

Concurrent write on a variable X Common CRCW : only if all processors write the

same value on X

SUM CRCW : write the sum all variables on X

Random CRCW : choose one processor at random and write its value on X

Priority CRCW : processor with hign priority writes on X

Page 17: Parallel Algorithms and Computing Selected topics

17

Parallel Randon Access MachineExample: Concurrent write on X by processors P1 (50 X) , P2 (60 X), P3 (70 X)

Common CRCW ou ERCW : Failure

SUM CRCW : X is the sum (180) of the written values Random CRCW : final value of

X { 50, 60, 70 }

Page 18: Parallel Algorithms and Computing Selected topics

18

Parallel Randon Access Machine

Basic Input/Output operations

On global memory– global read (X, x)– global write (Y, y)

On local memory– read (X, x)– write (Y, y)

Page 19: Parallel Algorithms and Computing Selected topics

19

Example 1: Matrix-Vector product

Matrix-Vector produt Y = AX

– A is a nXn matrix – X = [ x1, x2, …, xn] a vector of n elements– p processeurs ( pn ) and r = n/p

Each processor is assigned a bloc of r= n/p elements

Page 20: Parallel Algorithms and Computing Selected topics

20

Example 1: Matrix-Vector product

Y1Y2….

Yn

A1,1 A1,2 … A1,nA2,1 A2,2 … A2,n ……..

A n,1 An,2 ... An,n

=

X1X2….

Xn

X

Global memory

P1 P2 Pp

Processors

Page 21: Parallel Algorithms and Computing Selected topics

21

Example 1: Matrix-Vector product

Partition A in p blocks Ai

Compute p partial products in parallel

Processor Pi compute the partial product Yi = Ai * X

A =

A1A2….

Ap

A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n …….

A(p-1)r,1 A(p-1),2 … A(p-1),n ……..A pr,1 Apr,2 ….Apr,n

=

A1

Ap

r lignes

r lignes

Page 22: Parallel Algorithms and Computing Selected topics

22

Example 1: Matrix-Vector product

Processeur Pi computes Yi = Ai * X

A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n

X1X2….Xn

X

Ar+1,1 Ar+1,2 … Ar+1,n …….A2r,1 A2r,2 … A2r,n

X1X2….Xn

X

A(p-1)r,1 A(p-1)r,2 …A(p-1)r,n …….Apr,1 Apr,2 … Apr,n

X1X2….Xn

X

Y1Y2….Yr

Y(p-1)r+1Y(p-1)r+2….Ypr

Yr+1Yr+2….Y2r

P1

P2

Pp

Page 23: Parallel Algorithms and Computing Selected topics

23

Example 1: Matrix-Vector product

Solution requires : p concurrents reads of vector X

each processor Pi makes an exclusive read of block Ai = A [((i-1)r +1) : ir, 1:n]

Each processor Pi makes an exclusive write on block Yi = Y[((i-1)r +1) : ir ] Required architecture : PRAM CREW

Page 24: Parallel Algorithms and Computing Selected topics

24

Example 1: Matrix-Vector product

Algorithm: processor Pi (i=1,2, …, n)Input

A : nxn matruix in global memory X : a vector in global memory Output

y = AX (y is a vector in global memory)Local variables

i : Pi processor id p: number of processors

n : dimension of A and X

Begin1. Global read ( x, z)2. global read (A((i-1)r + 1 : ir, 1:n), B)3. calculer W = Bz4. global write (w, y(i-1)r+1 : ir))

End

Page 25: Parallel Algorithms and Computing Selected topics

25

Example 1: Matrix-Vector product

Analysis Computation cost

Ligne 3: O( n2/p) opérations arithmétiques by Pi r lignes X n opérations ( avec r = n/p) Communication cost

Ligne 1 : O(n) numbers transferred from global to local memory by PiLigne 2 : O(n2/p) numbers transferred from global to local memory by Pi

Ligne 4 : O(n/p) numbers transferred from global to local memory by Pi

Overall: Algorithm run in O(n2/p) time

Page 26: Parallel Algorithms and Computing Selected topics

26

Example 1: Matrix-Vector product

Other way to partition the matrix is vertically Ai and X are split into blocks

– A1, A2, … Ap– X1, X2 … Xp

Solution in two phases :– Compute partial products

Z1 =A1X1, … Zp = ApXp– Synchronize the processors– Add partial results to get YY= AX = Z1 + Z2 + … + Zp

Page 27: Parallel Algorithms and Computing Selected topics

27

Example 1: Matrix-Vector product

Y1Y2

….

Yn

A1,1 … A1,r A2,1 … A2,r An,1 … An,r

X1 … Xr *

r columnsProcessor P1

A1,(p-1)r +1 ... A1,prA2,(p-1)r +1 ... A2,pr

An,(p-1)r +1 ... An,prX(p-1)r +1 ... Xpr*

r columnsProcessor Pp

……..

Synchronization

Page 28: Parallel Algorithms and Computing Selected topics

28

Example 1: Matrix-Vector product

Algorithm: processor Pi (i=1,2, …, n)

Begin1. Global read ( x( (i-1)r +1 : ir) , z)2. global read (A(1:n, (i-1)r + 1 : ir), B)3. compute W = Bz4. Synchronize processors Pi (i=1, 2, …, n)5. global write (w, y(i-1)r+1 : ir))

End

Input A : nxn matruix in global memory

X : a vector in global memory Output

y = AX (* y: vector in global memory *)Local variables

i : Pi processor id p: number of processors

n : dimension of A and X

Page 29: Parallel Algorithms and Computing Selected topics

29

Example 1: Matrix-Vector product

Analysis Work out the details Overall: Algorithm run in O(n2/p) time

Page 30: Parallel Algorithms and Computing Selected topics

30

Example 2: Sum on the PRAM model

An aray A of n = 2k numbersA PRAM machine with n processorCompute S = A(1) + A(2) + …. + A(n)

Construct a binary tree to compute the sum in log2n time

Page 31: Parallel Algorithms and Computing Selected topics

31

Example 2: Sum on the PRAM model

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

P1 P2 P3 P4 P5 P6 P7 P8

B(1) B(2) B(3) B(4)

P1 P2 P3 P4

B(1) B(2)

P1 P2

B(1)

S=B(1)

P1

P1

Level >1, Pi computeB(i) = B(2i-1) + B(2i)

Level 1, Pi B(i) = A(i)

Page 32: Parallel Algorithms and Computing Selected topics

32

Example 2: Sum on the PRAM model

Algorithm processor Pi ( i=0,1, …n-1)Input

A : array of n = 2k elements in mémoire globalOutput

S : où S= A(1) + A(2) + …. . A(n)Local variables Pi

n : i : processor Pi identity

Begin1. global read ( A(i), a)2. global write (a, B(i))3. for h = 1 to log n do

if ( i ≤ n / 2h ) then begin global read (B(2i-1), x)

global read (b(2i), y) z = x +y global write (z,B(i)) end

4. if i = 1 then global write(z,S)End

Page 33: Parallel Algorithms and Computing Selected topics

Parallel Architecture

Network model

Page 34: Parallel Algorithms and Computing Selected topics

34

Network model

Characteristics Communication structure is important Network can be seen as a graph G=(N,E):

– Node i N is a processor– Edge (i,j)E represents a two way communication

between processors i and j Basi communication operation

– Send (X, Pi)– Receive(X, Pi)

No global shared memory

Page 35: Parallel Algorithms and Computing Selected topics

35

Network model

P1 P2 P3 Pn…

n processors Linear array

n processor ring

P1 P2 P3 Pn…

Page 36: Parallel Algorithms and Computing Selected topics

36

Network model

P11 P12 P13 P1n…

n2 processors Grid

P21 P22 P23 P2n…

P31 P32 P33 P3n…

Pn1 Pn2 Pn3 Pnn…

n2 processors Torus: columns and rows are n rings

Page 37: Parallel Algorithms and Computing Selected topics

37

Network model

(P0)

(P2)

(P1)

(P7)

(P3)

(P4) (P5)

(P6)

n=2k hypercube

Page 38: Parallel Algorithms and Computing Selected topics

38

Network model

P11 P12 P13 P1n…

n2 processors Grid

P21 P22 P23 P2n…

P31 P32 P33 P3n…

Pn1 Pn2 Pn3 Pnn…

n2 processors Torus: columns and rows are n rings

Page 39: Parallel Algorithms and Computing Selected topics

39

Exemple 1: Matrix-Vector Product on linear array

A=[aij] an nxn matrix, i,j [1,n] X=[xi] i [1,n] Compute

n

j

xjaijyiyiy1

* where,

Page 40: Parallel Algorithms and Computing Selected topics

40

Exemple 1: Matrix-Vector Product on linear array

Systolic array algorithm for n=4

x4 x3 x2 x1

a14

a13

a12

a11

a24

a23

a22

a21

a34

a33

a32

a31

a44a43

a42

a41

. ..

.

.

.

P1 P2 P3 P4

Page 41: Parallel Algorithms and Computing Selected topics

41

Exemple 1: Matrix-Vector Product on linear array

•At step j, xj enters the processor P1. At step j, processor Pi receives (when possible) a value from its left and a value from the top. It updates its partial as follows:

Yi = Yi + aij*xj , j=1,2,3, ….

• Values xj and aij reach processor i at the same time at step (i+j-1)•(x1, a11) reach P1 at step 1 = (1+1-1)•(x3, a13) reach P1 at setep 3 = (1+3-1)

• In general, Yi is computed at step N+i-1

Page 42: Parallel Algorithms and Computing Selected topics

42

Exemple 1: Matrix-Vector Product on linear array

• The computation is completed when x4 and a44 reach processor P4 at Step N + N –1 = 2N-1• Conclusion: The algorithm requires (2N-1) steps. At each step, active processor Perform an addition and a multiplication• Complexity of the algorithm: O(N)

Page 43: Parallel Algorithms and Computing Selected topics

43

Exemple 1: Matrix-Vector Product on linear array

J=1

4 y2 = a2j*xj

J=1

4 y3 = a3j*xj

J=1

4 y4 = a4j*xj

J=1

4 y1 = a1j*xj

x2

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Step1

2

3

4

5

6

7

x1

x3

x4

x1

x2 x1

x1

x1

x2

x2

x2 x1

x3

x3

x3

x3 x2 x1

x4

x4

x4

Page 44: Parallel Algorithms and Computing Selected topics

44

Exemple 1: Matrix-Vector Product on linear array

Systolic array algorithm: Time-Cost analysis

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

1 Add; 1 Mult; active: P1 idle: P2, P3, P4

2 Add; 2 Mult; active: P1, P2 idle: P3, P4

3 Add; 3 Mult; active: P1, P2,P3 idle: P4

4 Add; 4 Mult; active: P1, P2,P3 P4idle:

3 Add; 3 Mult; active: P2,P3,P4idle: P1

2 Add; 2 Mult; active: P3,P4idle: P1,P2

1 Add; 1 Mult; active: P4idle: P1,P2,P3

Page 45: Parallel Algorithms and Computing Selected topics

45

Exemple 1: Matrix-Vector Product on linear array

Systolic array algorithm: Time-Cost analysis

x2

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Step1

2

3

4

5

6

7

x1

x3

x4

x1

x2 x1

x1 x2

x2

x3

x3

x3

x4

x4

x4

1 Add; 1 Mult; active: P1 idle: P2, P3, P4

2 Add; 2 Mult; active: P1, P2 idle: P3, P4

3 Add; 3 Mult; active: P1, P2,P3 idle: P4

4 Add; 4 Mult; active: P1, P2,P3 P4idle:

3 Add; 3 Mult; active: P2,P3,P4idle: P1

2 Add; 2 Mult; active: P3,P4idle: P1,P2

1 Add; 1 Mult; active: P4idle: P1,P2,P3

Page 46: Parallel Algorithms and Computing Selected topics

46

Exemple 2: Matrix multiplication on a 2-D nxn Mesh

Given two nxn matrices A = [aij] and B = [bij], i,j [1,n],

Compute the product C=AB , where C is given by :

n

j

bkjaikCijCijC1

* where,

Page 47: Parallel Algorithms and Computing Selected topics

47

Exemple 2: Matrix multiplication on a 2-D nxn Mesh

•At step i, Row i of A (starting with ai1) is entered from the top into column i (into processor P1i)

•At step j, Column j of B (starting with b1j) is entered from the left into row j (to processor Pj1)

• The values aik and bkj reach processor (Pji) at step (i+j+k-2).

At the end of this step, aik is sent down and bkj is sent right.

Page 48: Parallel Algorithms and Computing Selected topics

48

Exemple 2: Matrix multiplication on a 2-D nxn Mesh

Example: Systolic mesh algorithm for n=4 STEP 1

(1,1)

(2,1)

(3,1)

(4,1)

(1,2) (1,3) (1,4)

(2,4)(2,3)(2,2)

(3,4)(3,3)(3,2)

(4,4)(4,3)(4,2)

a14a13a12a11

a24a23a22a21

a34a33a32a31 . .

a44a43a42a41 . . ..

b41 b3 b21 b11

b42 b32 b22 b12 .

b43 b33 b23 b13 . .

b44 b34 b24 b14 ..

Page 49: Parallel Algorithms and Computing Selected topics

49

Exemple 2: Matrix multiplication on a 2-D nxn Mesh

Example: Systolic mesh algorithm for n=4 STEP 5

a11

a24 a33 a42b41 b31 b21

a14

a13

a12

b42

b33

b24

b32 b22 b12

b23 b13

b14

a23 a32 a41

a22 a31

a21

b43

b34b44

b11

a34 a43a44 A

B

Page 50: Parallel Algorithms and Computing Selected topics

50

Exemple 2: Matrix-Vector multiplication on a ring

Analysis

To determine the number of steps for completing the multiplication of the matrice, we must find the step at which the terms ann and bnn reach rocessor Pnn.

– Values aik and bkj reach processor Pji at i+j+k-2– Substituing n for i,j,k yields :

n + n + n – 2 = 3n - 2

Complexity of the solution: O(N)

Page 51: Parallel Algorithms and Computing Selected topics

51

Exemple 3: Matrix-Vector multiplication on a ring

N=4

X4 X3 X2 X1

P1 P2 P3 P4

a13

a12

a11

a14

a22

a21

a24

a23

a31

a34

a33

a32

a44

a43

a42

a41

Xi

aij

This algorithm requires N steps for a matrix-vector multiplication

Page 52: Parallel Algorithms and Computing Selected topics

52

Exemple 3: Matrix-Vector multiplication on a ring

Goal Pipeline data into the processors, so that n product terms are computed and added to partial sums at each step.

Distribution of X on the processorsXj 1j N, Xj is assigned to processor N-j+1

This algorithm requires N steps for a matrix-vector multiplication

Page 53: Parallel Algorithms and Computing Selected topics

53

Exemple 3: Matrix-Vector multiplication on a ring

Another way to distribute the Xi over the processors and to input

Matrix A– Row i of the matrix A is shifted (rotated) down i (mod

n) timesand entered into processor Pi.

– Xi is assigned to processor Pi, at each step the Xi are shifted right

Page 54: Parallel Algorithms and Computing Selected topics

54

Exemple 3: Matrix-Vector multiplication on a ring

X1 X2 X3 X4

P1 P2 P3 P4

a12

a13

a14

a11

a23

a24

a21

a22

a34

a31

a32

a33

a41

a42

a43

a44

Xi

aij

N=4

Diagonal

Page 55: Parallel Algorithms and Computing Selected topics

55

Exemple 4: Sum of n=2P numbers on a d-hypercube

Assignment: xi is on processor Pi

0 1

32

4 5

6 7

(X0)

(X2)

(X1)

(X7)

(X3)

(X4) (X5)

(X6)

Computation of S = xi,

Page 56: Parallel Algorithms and Computing Selected topics

56

Exemple 4: Sum of n=2P numbers on a d-hypercube

Step 1: Processors of the sub-cube 1XX send their data to corresponding processors in sub-cube 0XX

0 1

32

4 5

6 7

(X0+X4)

(X2+X6)

(X1+X5)

(X3+X7)

0XX sub-cube 1XX sub-cube

Page 57: Parallel Algorithms and Computing Selected topics

57

Exemple 4: Sum of n=2P numbers on a d-hypercube

Step 2: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X

0 1

32

4 5

6 7

(X0+X4+X2+X6) (X1+X5+X3+X7)

P Processeurs actifs P Processeurs inactifs

Page 58: Parallel Algorithms and Computing Selected topics

58

Exemple 4: Sum of n=2P numbers on a d-hypercube

Step 3: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X

P Processeurs actifs P Processeurs inactifs

0 1

32

4 5

6 7

S = (X0+X4+X2+X6+ X1+X5+X3+X7)

The sum of the n numbers is stored on node P0

Page 59: Parallel Algorithms and Computing Selected topics

59

Exemple 4: Sum of n=2P numbers on a d-hypercube

Algorithm: Processor PiInput: 1) An array of X of n=2p of numbers, X[i] is assigned to processor Pi 2) processor identity idOutput: S= X[0]+…+X[n] stored on processor P0

Processor PiBegin

My_id = id ( My_id i)S=X[i]For j = 0 to (d-1) do begin Partner = M_id XOR 2j

if My_id AND 2j = 0 beginreceive(Si, Partner)S = S + Si end

if My_id AND 2j 0 beginsend(S, Partner)exit end

end end

Page 60: Parallel Algorithms and Computing Selected topics

Parallel Architecture

Message broadcast on network model (ring, torus, hypercube)

Page 61: Parallel Algorithms and Computing Selected topics

61

Basic communication

Message Broadcast One-to-all broadcast

– Ring– Mesh (Torus)– Hypercube

All-to-all broadcast– Ring– Mesh (Torus)– Hypercube

Page 62: Parallel Algorithms and Computing Selected topics

62

Communication cost

Message from

Pi Pj

l number of links traversed

Communication cost = ts + tw *m*l ts :message preparation time

m: message length tw: unit transfer time (byte)

l : number of links traversed by the message

Page 63: Parallel Algorithms and Computing Selected topics

63

Communication cost Communication time bounds:

– Ring ts + (tw)m p/2 – Mesh ts + (tw)m ((p)1/2)/2 – Hypercube ts + (tw)m log2p

Depends on the maximum number of links traversed by the message

Page 64: Parallel Algorithms and Computing Selected topics

64

One-to-All broadcast Simple solutionP0 send message M0 to processors P1, P2, … Pn-1

successively.

P1P0 M0

P2P0 M0

P3P0 M0

Pp-1P0 M0

( P0 P1 P2 )

( P0 P1 P2 P3 )

( P0 P1 P2 … Pp-1 )

Communication cost = (ts + tw m0 ) i = (ts + tw m0 )*( p(p+1)/2)

Page 65: Parallel Algorithms and Computing Selected topics

65

One-to-all Broadcast

Processor send a message M to all processors

0 1 P-1 0 1 P-1M M M M

… …

One-to-all broadcast

Dual operation (Accumulation)

Page 66: Parallel Algorithms and Computing Selected topics

66

All-to-all broadcast

All-to-All Broadcast : several simultanous One-to-All broadcast where each processor Pi initiates the communication.

0 1 p-1…X0

0 1 p-1…All-to-All broadcast

Accumulation vers plusieurs noeuds

X1 Xp-1

X0

X1

Xp-1 …

X0

X1

Xp-1 …

X0

X1

Xp-1

Page 67: Parallel Algorithms and Computing Selected topics

Parallel Architecture

Examples of message broadcasts

Page 68: Parallel Algorithms and Computing Selected topics

68

Example 1: One-to-All broadcast on a ring

Each processor forwards message to the next processor. Initially, message sent in two directions

0 1 2 3

7 6 5 4

1 2 3

43

2 4

Communication cost :

T = (ts + tw * m)p/2 où p est le nombre de processeurs

Parallel Steps

Page 69: Parallel Algorithms and Computing Selected topics

69

Example 2: One-to-All broadcast on a Torus

Two phases : Phase 1 : One-to-All broacast on first row

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

1 2

2

Page 70: Parallel Algorithms and Computing Selected topics

70

Example 2: One-to-All broadcast on a Torus

Phase 2 : parallel one-to-all broadcasts in the columns.

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

1 2

2

3 3 3 3

4 4 4 4

4 4 4 4

Page 71: Parallel Algorithms and Computing Selected topics

71

Example 2: One-to-All broadcast on a Torus

Communication cost :

Communication cost :

T = 2 * (ts + tw * m) p(1/2)/ 2 2 (p) is the number of processors

Broadcast on line

Tcom = (ts + twm) p(1/2)/2

Broadcast on columns

Tcom = (ts + twm) p(1/2)/2

Page 72: Parallel Algorithms and Computing Selected topics

72

Example 3: One-to-All broadcast on a Hypercube

Requires d steps. Each step doubles the number of active processors

0 1

2 3

4 5

6 7

12

2

3

3

3

3

Coût de communication :

T = 2 * (ts + tw * m)*logpP is the number of processors

Page 73: Parallel Algorithms and Computing Selected topics

73

Example 3: One-to-All broadcast on a Hypercube

Broadcast an element X stored on one processor (say P0) to the other processors of the hypercube.

Broadcast can be performed in O(logn) as follows

0 1

32

4 5

6 7

X

Initial distribution of data

Page 74: Parallel Algorithms and Computing Selected topics

74

Example 3: One-to-All broadcast on a Hypercube

•Step 1: Processor Po sends X to processor P1•Step 2: Processors P0 and P1 send X to P2 and P3 respectively•Step 3: Processor P0, P1, P2 and P3 send X to P4, P5, P6 and P7

X

0 1

32

4 5

6 7

XStep 1 Step 20 1

32

4 5

6 7

X X

X X

X X

0 1

32

4 5

6 7

XX

X X

XX

Step 3

P Active processors P Idle processors

Page 75: Parallel Algorithms and Computing Selected topics

75

Example 3: One-to-All broadcast on a Hypercube

Algorithm for a broadcast of X on a p-hypercube

Input: 1) X assigned to processor P0 2) processor identity idOutput: All processor Pi contain X

Processor PiBegin

If i = 0 then B = XMy_id = id ( My_id i)For j = 0 to (d-1) do if My_id 2j begin Partner = My_id XOR 2j if My_id > Partner receive(B, Partner) if My_id < Partner send(S, Partner)

endend

Page 76: Parallel Algorithms and Computing Selected topics

76

All-to-all Broadcast on a ring STEP 1

0 1 2 3

7 6 5 4

1(0) 1(1) 1(2)

1(3)1(7)

1(6) 1(5) 1(4)

(0) (1) (2) (3)

(4)(5)(6)(7)

Page 77: Parallel Algorithms and Computing Selected topics

77

All-to-all Broadcast on a ring Step 2

0 1 2 3

7 6 5 4

2(7) 2(0) 2(1)

2(2)2(6)

2(5) 2(4) 2(3)

(0,7) (1,0) (2,1) (3,2)

(4,3)(5,4)(6,5)(7,6)

Page 78: Parallel Algorithms and Computing Selected topics

78

All-to-all Broadcast on a ring Etape 3

0 1 2 3

7 6 5 4

3(6) 3(7) 3(0)

3(1)3(5)

3(4) 3(3) 3(2)

(0,7,6) (1,0,7) (2,1,0) (3,2,1)

(4,3,2)(5,4,3)(6,5,4)(7,6,5)

Page 79: Parallel Algorithms and Computing Selected topics

79

All-to-all Broadcast on a ring Etape 7

0 1 2 3

7 6 5 4

7(2) 7(3) 7(4)

7(5)7(1)

7(0) 7(7) 7(6)

(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)

(4,3,2,1,0,7,6)(5,4,3,2,1,0,7)(6,5,4,3,2,1,0)(7,6,5,4,3,2,1)

Page 80: Parallel Algorithms and Computing Selected topics

80

All-to-all Broadcast on a 2-dimensional Torus

Two phases– Phase 1: All-to-all broadcast on each line. Each processor

Pi holds a message of size Mi = (p1/2)m– Phase 2: All-to-All broadcast in the columns

Page 81: Parallel Algorithms and Computing Selected topics

81

All-to-all Broadcast

0 1 2

3 4 5

6 7 8

(0) (1) (2)

(3) (4) (5)

(6) (7) (8)

All-to-All on the rows

Start of Phase 1

Page 82: Parallel Algorithms and Computing Selected topics

82

All-to-all Broadcast

0 1 2

3 4 5

6 7 8

(0,1,2) (0,1,2) (0,1,2)

(3,4,5) (3,4,5) (3,4,5)

(6,7,8) (6,7,8) (6,7,8)

All-to-All on columns

Start of Phase 2

Page 83: Parallel Algorithms and Computing Selected topics

83

All-to-all Broadcast

Communication cost = Cost phase 1 + cost phase 2

= (p1/2 -1)(ts + twm) + (p1/2-1) (ts + tw (p1/2)m)

Page 84: Parallel Algorithms and Computing Selected topics

Parallel Algorithms and Computing

Selected topics

Sorting in Parallel

Page 85: Parallel Algorithms and Computing Selected topics

85

Performance mesures

SpeedupEfficiencyWork-TimeAmdhal’s law

Page 86: Parallel Algorithms and Computing Selected topics

86

Speed up

Speed up : S(p) (p number of processors in the parallel solution)

T(p)T(1) )( pS

S(p) < 1 : parallel solution is worst

1<S(p) ≤ 1 : Normalp<S(p) : Hyper-speed up not very frequent

T(1) : sequential execution timeT(p) : parallel execution time with p processors

Poor performance

Normal speed up

Hyper- speed up

S(p) = pS(p)

1

p

(ideal)

Page 87: Parallel Algorithms and Computing Selected topics

87

AccélérationIs hyper speed up normal? Poor non optimal sequential algorithm Storage space a factor

Page 88: Parallel Algorithms and Computing Selected topics

88

EfficiencyEfficiency : E(p)

pS(p) )( pE

0< E(p) ≤ 1 : Normal

1<E(p) : Hyper-accélération

Intérêt : Speed up : User point of viewEfficacité : Manager’s point of viewAccélération et Efficacité : designer’s point of view

Page 89: Parallel Algorithms and Computing Selected topics

89

Amdhal’s lawA program consists of two part :

+ …

Sequential part Parallel prt

psseq TTTT )1(pT

TTpT psseq )(

Page 90: Parallel Algorithms and Computing Selected topics

90

Amdhal’s lawBound on Speedup

pT

T

TpS

ps

seq

)(

T(p)T(1) )( pS

)1(11 )(

seq

s

seq

s

TT

pTT

pS

Page 91: Parallel Algorithms and Computing Selected topics

91

Amdhal’s lawBound on Speedup Sequential fraction (fs) et Parallel fraction (fp):

Speed up can be rewritten as

seq

pp

s

T T(1) ,1

1,0 T(1)T

, 1,0 T(1)T

ps

s

ff

ff

)1(11 )(

ss fp

fpS

Page 92: Parallel Algorithms and Computing Selected topics

92

Amdhal’s lawBound on Speedup

)1(11 )(

ss fp

fpS

sss

ffp

fpSp 1

)1(11 )( lim

sfpS 1)(

Page 93: Parallel Algorithms and Computing Selected topics

93

Amdhal’s lawBound on Speedup

For example if fs is equal to 1%, S(p) is less than 100.

1/fs

S(p)

p11

Page 94: Parallel Algorithms and Computing Selected topics

94

Amdhal’s law The above computation of speed up bound does not take into account

communication and synchronization overheadsOverhead

overheadpT

TTpT psseq )(

S(p) )( pSreal

Page 95: Parallel Algorithms and Computing Selected topics

95

Parallel sorting Types of sorting algorithms Properties

– Processor ordering determines order of the final result – Where input and output are stored– Basic compare-exchange operation

Page 96: Parallel Algorithms and Computing Selected topics

96

Issues in sorting algorithms Internal/External sort :

CPU

Data fits in processor memory

(RAM)

Performances based :•Comparison•Basic operationsComplexity O(nlogn)

Internal

CPU

Data in memory and on disk

(RAM) (Disk)

Performances based :•Basic operations•Overlap of computing and I/O

External

Page 97: Parallel Algorithms and Computing Selected topics

97

Issues in sorting algorithms Comparaison-Based

Non-Compared-Based– Ordering based on properties of the keys

Executions of :•Comparaison•Permutation

Page 98: Parallel Algorithms and Computing Selected topics

98

Issues in sorting algorithms Internal sort (shared memory : PRAM)

P P •Share data•Minimize memory access conflicts

Each processor sort part of the data in memory

Page 99: Parallel Algorithms and Computing Selected topics

99

Issues in sorting algorithms Internal sort (distributed memory)

– Each processor is assigned a block of N/P elements– Processor locally sorts the assigned block

(using any sort algorithm internal ot )

Initial data

N/P elements per processor Pi

P1 P2 P3 < <

Input : Distributed among processorsOutput : Store on processorsOrdre final : processor order defines the

final ordering of list

Page 100: Parallel Algorithms and Computing Selected topics

100

Issues in sorting algorithms Internal sort (distributed memory)

(0) (1)

(3)(2)

(4)

(7)

(5)

(6)

Example :Final order defined by the gray code labelling of processors

1 2 3 4 5

Page 101: Parallel Algorithms and Computing Selected topics

101

Issues in sorting algorithms Building block: compare-exchange operation

CPU

(ai, aj)

RAMSequentiel

(ai < aj) ??

ai ↔ aj

(ai)

CPU

(aj)

CPU

ai

aj

ai = min(ai, aj)

aj

ai

aj = max(ai, aj)

Parallel

Exchange-Compare-Min(P(i+1))

P(i) P(i+1)

Exchange-Compare-Max(P(i-1))

Page 102: Parallel Algorithms and Computing Selected topics

102

Issues in sorting algorithms Compare-exchange : N/p elements per processor

1 6 8 11 13 62 2 7 9 10 12 63

1 6 8 11 13 62

2 7 9 10 12 63

2 7 9 10 12 63

1 6 8 11 13 62

1 6 82 7 9 10 12 6311 13 62

min max

P(i) P(i+1)

N/p smallest elementsExchange-compare-min(P(i+1))

n/p largest elementsExchange-compare-max(P(i-1))

Page 103: Parallel Algorithms and Computing Selected topics

Example: Odd-Even Merge Sort

Unsorted list of n elements

A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1

Divide list in two lists (n/2 elements)

Sort each sub-list

A0 A2 … AM-2 B0 B2 … BM-2

Divide each in sub-lists ofOdd-even index

A1 A3 … AM-1 B1 B3 … BM-1

Merge sort the Odd – evensublists

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Merge the two list andExchange out of positionelements E0 O0 E1 O1 …. EM-1OM-1

Page 104: Parallel Algorithms and Computing Selected topics

Where is parallelism????

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

1xN

1xN

4xN/42xN/2

Page 105: Parallel Algorithms and Computing Selected topics

105

Example: Odd-Even Merge Sort Key to the Merge Sort algorithm: method used to

merge the sorted sub-list Consider 2 sorted lists of m =2k elements:A= a0, a1, ….am-1 et B= b0, b1, ….bm-1

Even(A)= a0, a2, ….am-2 , Odd(A)= a1, a3, ….am-1

Even(B)= b0, b2, ….bm-2 , Odd(B)= b1, b3, ….bm-1

Page 106: Parallel Algorithms and Computing Selected topics

106

Example: Odd-Even Merge Sort

Create 2 merged lists:

Merge Even(A) and Odd(B) to E = E0 E1 …Em-1

Merge Even(B) and Odd(A) to O = O0 O1 …Om-1

Merge E and O as follows to create a List L’ L’ = E0O0E1 O1 …Em-1 Om-1

Exchange out of order elements of L’ to obtain L

Page 107: Parallel Algorithms and Computing Selected topics

107

Example: Odd-Even Merge Sort

A=2,3,4,8 et B=1,5,6,7Even (A)= 2,4 and Odd(A)= 3,8Even (B)= 1, 6 et Odd(B)= 5,7

E = 2,4,5,7 et O=1,3,6,8

L’ = 2 ↔1, 4 ↔ 3, 5, 6, 7, 8

L = 1, 2, 3, 4, 5, 6, 7, 8

Page 108: Parallel Algorithms and Computing Selected topics

Parallel sorting

Quicksort

Page 109: Parallel Algorithms and Computing Selected topics

109

Review: Quicksort

)O(nlog steps log :caseBest

O(n2) scomparison 2

1)n(n :caseWorst

22 nn

Recursively:•choose a pivot•divide list in two using the pivot•sort left and right sub-list

Recall: Sequential Quicksort

Performance

Page 110: Parallel Algorithms and Computing Selected topics

110

Review: Quicksort

Sequential Quicksortvoid Quicksort (double *A, int q, int r){ int s, i; double pivot; if (q < r ) { /* divide A using the pivot */

pivot = A[q];s = q;for (i = q+1; i ≤ pivot {

s = s+1;exchange(A,s,i); }

}exchange(A,q,s);/* recursive calls to sort the new sublist*/Quicksort(A,q,s-1);Quicksort(A, s+1, r); }

}

Page 111: Parallel Algorithms and Computing Selected topics

111

Review: Quicksort

Create a binary tree of processor, one new processor for each recursive call of Quicksort

Easy to implement, but can be inefficient performance wise

Page 112: Parallel Algorithms and Computing Selected topics

112

Review: QuicksortImplantation en mémoire partagée (avec primitives Fork())

double A[nmax]; qoid quicksort(int q, int r){ int s, i, n; double pivot; if q < r {/*partitions */ pivot=A[q];

s=q;for (i=q+1; i <= r; i++){if A[i] <= pivot){

s= s+1;exchange(A, s,i);}

}exchange(A, q, s);

/*Create a new processor */n=fork() if ( n== 0 )exec("quicksort", q, s-1);else quicksort(,s+1,r);}}

Page 113: Parallel Algorithms and Computing Selected topics

113

Quicksort on a d-hypercube d étapes : all processors active in each step A processor is assigned N/p elements (p= 2d) Etapes de la solution:

– Initially (Step 0), 1 pivot is chosen and broadcast to all processors– Each processors its elements in two sub-lists: one less (inferior) than the

current pivot and the other greater or equal (superior)– Exchange the inferior and superior sub-lists based on dimension (d-0),

creating two sub-cubes along dimension d-0 (one for the inferior lists and the other for the superior lists)

– Each processor merges the (inferior and superior) lists – Repeat for each sub-cube

Page 114: Parallel Algorithms and Computing Selected topics

114

Quicksort on a d-hypercube

000010

110100

001

101 111

011

0XX 1XX

Step 0Pivot P0

Division along dimension 3. Two blocks of elements are created : •1 Block of elements less than pivot P0•1 block of elements greater than or equal to P0

Example on a 3-Hypercube

< P0 > P0

Page 115: Parallel Algorithms and Computing Selected topics

115

Quicksort on a d-hypercube

000010

110100

001

101 111

011

Etape 1Pivot P10

•Division along dimension 2. •Divide each sub-cube in two smaller sub-cubes

00X

01X

10X

11X

Pivot P11

< P10

> P10

< P11

> P11

Example on a 3-Hypercube

Page 116: Parallel Algorithms and Computing Selected topics

116

Quicksort on a d-hypercube

000010

110100

001

101 111

011

Etape 2Pivot P20

Division along dimension 1. Final order defined by the label ordering of the processors

000

010

Pivot P22

001

011

100 101

110 111

Pivot P21 Pivot P23

<P20

<P21

>P20

>P21

<P22 >P22

>P23<P23

Example on a 3-Hypercube

Page 117: Parallel Algorithms and Computing Selected topics

117

Quicksort on a d-hypercube

000010

110100

001

101 111

011

Final step

Each processor sorts its final list, using for example a sequential quicksort

000010

110100

001

101 111

011

local sort

{}

{}

{} : empty list

Example on a 3-Hypercube

Page 118: Parallel Algorithms and Computing Selected topics

118

Quicksort on a d-hypercubeData exchange at the initial step: sub-cubes P0XX and P1XX

P0XX P1XXBroadcastPivot P0

< P0 > P0 < P0 > P0

< P0 > P0

Exchange sub-lists

inferior / superior

< P0 > P0

Exchange sub-lists

inférior / superio

P1XX

P1XX

P1XX

P0XX

P0XX

P0XX

Sort the sub-lists at the end of each step?

Page 119: Parallel Algorithms and Computing Selected topics

119

Quicksort on a d-hypercubeAlgorithm: Processor k (k=0, …, p-1)

Hypercube-Quicksort(B, d) { /* B contains the elements assigned to processor i*/ /* d is the hypercube dimension*/int i; double x, B1[ ], B2[ ], T[ ]; my-id = k; /* Processor id*/ for ( i = d-1 to 0 ) {

x = pivot (my-id, i); partition(B, x, B1, B2): /* B1 inferior sub-list, B2 superior sub-list */

if ( my-id AND 2i == 0) { /* i th bit is 0 */ send( B2, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B1 T;}

else {send( B1, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B2 T;}

Sequential-Quicksort( B);End Hypercube-Quicksort

Page 120: Parallel Algorithms and Computing Selected topics

120

Quicksort on a d-hypercube

Choice of pivot More important for performance than in the

sequential case. It has a great impact on:– la répartition de charge entre processeurs – la performance de l’algorithme (dégradation rapide de la

performance)

Page 121: Parallel Algorithms and Computing Selected topics

121

Quicksort on a d-hypercube

Worst case: At step 0, largest element of list is selected as the pivot

Pivot0 = x=max{ Xi}

000010

110100

001

101 111

011

{} {}

{} {}

Foreground processors are overloaded Background processors are idle

Page 122: Parallel Algorithms and Computing Selected topics

122

Quicksort on a d-hypercube

Choice of pivot : ideal case In parallel do,

– Sort the initial list assigned to each processor– Choose the median element of one of the processors of

the cube – Assuming uniform distribution of elements of the list

List assigned processor Pi

Median element

List assigned processor Pi

Median element

Median element of whole list

Page 123: Parallel Algorithms and Computing Selected topics

123

Quicksort on a d-hypercube Steps of the algorithm

Local sort of assigned list

Selection of pivot by rocessor

Broadcast of pivot in sub-d’hypercube d-i

Division based on pivot(Binary search)

Exchanges of sub-lists between neighbors

Merge sorted sub-lists

repeat

Time complexity

))pNlog(*

pNO(

O(1*d)

)O(log2

)1(i 2d

i

pdd

))pNlog(*O(d

)2pN*O(d

)pN*O(d d=logp

Page 124: Parallel Algorithms and Computing Selected topics

124

Parallel Quicksort on a PRAM

Parallel QUICKSORT algorithm/* The solution constructs a binary tree of processors chich is traversed in IN-ORDER to yield the sorted */

Variables shared by all processors

root : Root of the global binary treeA[n] : AN array of n elements ( 1, 2, ….., n)Leftchild [i] : The root of the left sub-tree of processor i (i=1,2, …)Rightchild[i] : The root of the right sub-tree of processor (i=1,2, …

Page 125: Parallel Algorithms and Computing Selected topics

125

Parallel Quicksort on a PRAMProcess /* Do in parallel for each processor i */begin Root := I; Parent := i; Leftchild[i] := Rightchild[i] := n+1;EndRepeat for each processor i root do beginif (A[i] < A[Parent] …..) then

beginLeftchild[Parent] := iif i = Leftchild[Parent] then exitelse Parent := Leftchild[Parent]end

else beginRightchild[Parent] := iIf i = Rightchild[Parent] then

exitelse Parent := Leftchild[Parent]endend repeat

end process

Page 126: Parallel Algorithms and Computing Selected topics

126

Parallel Quicksort on a PRAM

Example

33 21 13 54 82 33 40 72 1 2 3 4 5 6 7 8 Root = processeur 4

[4] {54}

1 23 6 7 5 8

Binary tree

9

1 2 3 4 5 6 7 8

Leftchild

Rightchild

9 9 9 9 9 99

9 9 9 9 9 9 99

Step 0

Page 127: Parallel Algorithms and Computing Selected topics

127

Parallel Quicksort on a PRAM

Example

Racine = processeur 4

[4] {54}

1 23 6 7 5 8

Binary tree 1 2 3 4 5 6 7 8

Leftchild

Rightchild

1

5

Step 1

Processorr 1 wins the competition for the left subtree of 4 and 5 wins at right

2 3 6 7 8

[1] {33} [5] {82}

Page 128: Parallel Algorithms and Computing Selected topics

128

Parallel Quicksort on a PRAM

Example Racine = processeur 4

2 3 8 1 2 3 4 5 6 7 8

Leftchild

Rightchild

1

6 75

Page 129: Parallel Algorithms and Computing Selected topics

129

Parallel Quicksort on a PRAMExample

[4] {54}

1 23 6 7 5 8

Binary tree

2 3 6 7 8

[1] {33} [5] {82}

3

[2] {21}

7

[6] {33} [8] {72}