Top Banner
12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007
23

12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.1

More on Parallel Computing

UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007

Page 2: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.2

Block Mapping (Review)

blksz = (int)ceil((float)N / P);

for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) { ...}

(lb is the lower bound of the original loop)

Page 3: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.3

Example

for (i = 1; i < N; i++) {

for (j = 0; j < N; j++) {

a[i][j] += f(a[i-1][j]);

}

}

Page 4: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.4

Example

0,0 0,1 0,2 0,3 0,N-1...

1,0 1,1 1,2 1,3 1,N-1......

2,0 2,1 2,2 2,3 2,N-1.........

N-1,0 N-1,1 N-1,2 N-1,3 N-1,N-1...

... ... ... ... ...

j

i

Page 5: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.5

Example

• If we mapped iterations of the i loop to processors, the dependencies cross processors boundaries

• Therefore interprocessor communication would be required

Page 6: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.6

N-1,N-1

Example

0,0 0,1 0,2 0,3 0,N-1...

1,0 1,1 1,2 1,3 1,N-1......

2,0 2,1 2,2 2,3 2,N-1.........

N-1,0 N-1,1 N-1,2 N-1,3 ...

... ... ... ... ...

PE0:

PE1:

PE2:

PEP:

Page 7: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.7

Example

• A better solution would be to map iterations of the j loop to processors

Page 8: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.8

N-1,N-1

Example

0,0 0,1 0,2 0,3 0,N-1...

1,0 1,1 1,2 1,3 1,N-1......

2,0 2,1 2,2 2,3 2,N-1.........

N-1,0 N-1,1 N-1,2 N-1,3 ...

... ... ... ... ...

PE0: PE

1: PE

2: PE3:

Page 9: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.9

Example

for (i = 1; i < N; i++) {

for (j = my_rank * blksz; i < min(N, (my_rank + 1) * blksz); i++) { a[i][j] += f(a[i-1][j]);

}

}

Page 10: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.10

Block Mapping (Review)

blksz = (int)ceil((float)N / P);

for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) { ...}

(lb is the lower bound of the original loop)

Page 11: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.11

...0 1*blksz+0 (P-1)*blksz+01 1*blksz+1 (P-1)*blksz+1... ... ... ...1*blksz-1 2*blksz-1 N-1

PE0 PE1 PEP-1

Block Mapping

Page 12: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.12

Block Mapping

• The problem is that block mapping can lead to a load imbalance

• Example, let N=26, P=6• blksz = ceiling(26/6) = 5• (lb = 0)

Page 13: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.13

0 5 10 15 20 251 6 11 16 212 7 12 17 223 8 13 18 234 9 14 19 24

PE0

PE1

PE2

PE3

PE4

PE5

Block Mapping

• Processors 0-4 have 5 iterations of work• Processor 5 has 1 iteration

Page 14: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.14

Cyclic Mapping

• An alternative to block mapping is cyclic mapping

• This is where each iteration is assigned to each processors in a round robin fashion

• This leads to a better load balance

Page 15: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.15

0 1 2 3 4 56 7 8 9 10 1112 13 14 15 16 1718 19 20 21 22 2324 25

PE0

PE1

PE2

PE3

PE4

PE5

Cyclic Mapping

• Processors 0-2 have 6 iterations of work• Processor 3-6 have only 5, but it is only 1

iteration fewer!

Page 16: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.16

Cyclic Mapping

for (i = lb + my_rank; i < N; i += P) { ...}

(lb is the lower bound of the original loop)

Page 17: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.17

Cyclic Mapping

• Conceptually, this is an easier mapping to implement than block mapping

• It leads to better load balancing• However, it can (and often does) lead to

more communication• Suppose that each iteration in the above

example is dependent on the previous iteration

Page 18: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.18

0 1 2 3 4 56 7 8 9 10 1112 13 14 15 16 1718 19 20 21 22 2324 25

PE0

PE1

PE2

PE3

PE4

PE5

Cyclic Mapping

• A message is sent from iteration 0 to 1, from 1 to 2, from 2 to 3, from 3 to 4, from 4 to 5, from 5 to 6, ...

Page 19: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.19

Block Mapping

• With block mapping, only messages are sent from iteration 5 to 6, from 11 to 12, from 17 to 18, and from 23 to 24

0 5 10 15 20 251 6 11 16 212 7 12 17 223 8 13 18 234 9 14 19 24

PE0

PE1

PE2

PE3

PE4

PE5

Page 20: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.20

Block vs Cyclic

• Block mapping increases the granularity and reduces overall communication (O(P)). However, it can lead to load imbalances (O(N/P)).

• Cyclic mapping decreases granularity and increases overall communication (O(N)). However, it improves load balance (O(1)).

• Block-Cyclic is a combination of the two

Page 21: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.21

Block-Cyclic Mapping

• Block-cyclic with N=26, P=6, and blksz=2• The load imbalance will be <= blksz

0 2 4 6 8 101 3 5 7 9 1112 14 16 18 20 2213 15 17 19 21 2324 25

PE0

PE1

PE2

PE3

PE4

PE5

Page 22: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.22

Block-Cyclic Mapping(N, P, and blksz are given)

nLayers = (int)ceil(((float)N)/(blksz*P));

for (layer = 0; layer < nLayers; layer++) {

beginBlk = layer*blksz*N; for (i = beginBlk + mypid*blksz; i < min(N, beginBlk + (mypid + 1)*blksz); i++) { ... }}

Page 23: 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

12e.23

Block vs Cyclic

• Block-Cyclic is in between Block and Cyclic in terms of granularity, communication, and load balancing.

• Block and Cyclic are special cases of Block-Cyclic– Block = Block-Cyclic with blksz = ceiling(N/P)– Cyclic = Block-Cyclic with blksz = 1