Lecture 11 Pipelined Parallelism - Stanford Universitycourses/cs243/lectures/l11-handout.pdf · 9 Carnegie Mellon 3. The Problem of Creating Fully Permutable Loops • RECALL: r-deep

1

Carnegie Mellon

1. Fully permutable loop nests & pipelining2. Example: Transforming for full permutability3. Time Affine Partitioning: Problem 4. Time Affine Partitioning Algorithm5. O(1) Synchronization problem

Readings: Chapter 11.8-11.9

Lecture 11Pipelined Parallelism

M. Lam 1CS243: Loop Transformations

Carnegie Mellon

1. Recall: Maximum Parallelism & No Communication

C: Space partitioning of Computation to Processor IDFor every pair of data dependent accesses F1i1+f1 and F2i2+f2

Find C1, c1, C2, c2: " i1, i2 F1 i1+ f1 = F2 i2+f2 ® C1i1+c1 = C2i2+c2

with the objective of maximizing the rank of C1, C2

Loops Array

Processor ID

F1i1+f1

F2i2+f2

C1i1+c1C2i2+c2


2

Carnegie Mellon

SOR (Successive Over-Relaxation): An Example

for i = 1 TO m for j = 1 to nA[i,j] = c * (A[i-1,j] + A[i,j-1])

j

i


Carnegie Mellon

Pipelineable Parallelism

M. Lam CS243: Loop Transformations 4

for i = 1 TO m for j = 1 to n

A[i,j] = c * (A[i-1,j] + A[i,j-1])

Processor ID: pSynchronization variable: t[p] initialized to 0WAIT: thread waits until the condition becomes true

for j = 1 to nif (p==1) or (WAIT(t[p-1]>=j))A[p,j] = c * (A[p-1,j] + A[p,j-1])

t[p]++;

3

Carnegie Mellon

Fully Permutable Loop Nests

• Definition:A loop nest is fully permutable if all the loops can be permuted arbitrarilywithout changing the semantics of the program

• Example:


for i = 1 TO m for j = 1 to nA[i,j]=c*(A[i-1,j]+ A[i,j-1])

j

i

j’j 'i '

⎡

⎣⎢⎢

⎤

⎦⎥⎥= 0 1

1 0

⎡

⎣⎢

⎤

⎦⎥

ij

⎡

⎣⎢⎢

⎤

⎦⎥⎥

i’

for j = 1 TO n for i = 1 to mA[i,j]=c*(A[i-1,j] + A[i,j-1])

Carnegie Mellon

When is a Loop Fully Permutable?

• Sequential execution order:

• A loop nest is fully permutable if all the dependences are satisfied in the sequential execution order, under all loop permutations

• INTUITION: A loop nest is fully permutable if– Its dependences do not point backwards along any axis

• Relationship between communication-free parallelism & full permutability?


j

i

4

Carnegie Mellon

r-Dimensional Pipelineable Parallelism

• r-deep fully permutable loop nest, r > 1, with cross-iteration dependences– r choices of outermost loops– r-1 degrees of parallelism– O(nr-1) parallelism– O(n) synchronization

• Code generation– r-1 outer loops: processor ID (p1, p2, …, pr-1)– Sequential rth loop: ir– iteration ir for processor (p1, p2, …, pr-1), waits for

iteration ir for processors (p1-1, p2, …, pr-1),(p1, p2-1, …, pr-1), …, (p1, p2, …, pr-1-1).


Carnegie Mellon

Recall: Blocking for Matrix Multiplication

= x

= x

1000 1000 1000

1000

= x

32 1000 32

100032

DataAccessed

1002000

65024

8 M. LamCS243: Parallelization

5

Carnegie Mellon

Experimental Results

With BlockingWithout Blocking

9 M. LamCS243: Parallelization

1

19181716151413121110

98765432

2120

0

Speedup

Processors

Carnegie Mellon

Blocking with Matrix Multiplication

• Original programfor (i = 0; i < n; i++) {for (j = 0; j < n; j++) {

for (k = 0; k < n; k++) {Z[i,j] = Z[i,j] + X[i,k]*Y[k,j];

}}}• Stripmine 2 outer loops

for (ii = 0; ii < n; ii = ii+B) {for (i = ii; i < min(n,ii+B); i++) {for (jj = 0; jj < n; jj = jj+B) {for (j = jj; j < min(n,jj+B); j++) {

for (k = 0; k < n; k++) {Z[i,j] = Z[i,j] + X[i,k]*Y[k,j];

}}}• Permute loops

for (ii = 0; ii < n; ii = ii+B) {for (jj = 0; jj < n; jj = jj+B) {for (k = 0; k < n; k++) {for (i = ii; i < min(n,ii+B); i++) {for (j = jj; j < min(n,jj+B); j++) {Z[i,j] = Z[i,j] + X[i,k]*Y[k,j];

}}}


6

Carnegie Mellon

How to Block Loops?

• Fully permutable loop nests can be blocked1. Stripmine to create more fully permutable loops

for (i = 0; i < n; i++) {<code>

}=> for (ii = 0; ii < n; ii = ii+B) {

for (i = ii; i < min(n,ii+B); i++) {<code>

}}

2. Permute inner stripmined loop inside


Carnegie Mellon

Uses of Blocking

• Increase data locality– Block size can be chosen

so data accessed in the block fits in the faster hierarchy(virtual memory, cache, registers)

• Reduce synchronization overhead– By a factor of the block size– Consideration: startup latency, load balance for triangular loops

• SIMD instructions– To create contiguous vector access

for (j = 0; j < n; j++) {for (k = 0; k < n; k++) {

Z[i,j] = Z[i,j] + X[i,k]*Y[k,j]; }}

=> for (jj = 0; jj < n; jj+=4) {for (k = 0; k < n; k++) {for (j = jj; jj < min(n, jj+4); j++) {Z[i,j] = Z[i,j] + X[i,k]*Y[k,j];

}}}


7

Carnegie Mellon

2. How to Make Loops Fully Permutable? (example)

for i = 0 TO m for j = 0 to nX[j+1]=(X[j]+X[j+1]+X[j+2])

j

i


j

i

Example:

Carnegie Mellon

Transforming for Full Permutability

for i = 0 TO m for j = 0 to nX[j+1]=(X[j]+X[j+1]+X[j+2])

j

i


i’

j’

i 'j '

⎡

⎣⎢⎢

⎤

⎦⎥⎥= 1 0

1 1

⎡

⎣⎢

⎤

⎦⎥

ij

⎡

⎣⎢⎢

⎤

⎦⎥⎥

8

Carnegie Mellon

Code Generation

for i = 0 TO m for j = 0 to nX[j+1]=(X[j]+X[j+1]+X[j+2])/3

j

i


for i' = 0 TO m for j' = i’ to i'+nX[j’-i'+1]=(X[j’-i']+X[j’-i'+1]+X[j’-i'+2])/3

i’

j’

i 'j '

⎡

⎣⎢⎢

⎤

⎦⎥⎥= 1 0

1 1

⎡

⎣⎢

⎤

⎦⎥

ij

⎡

⎣⎢⎢

⎤

⎦⎥⎥

j’ = i + jj = j’ – i’

i’ = i 0 <= i’ <=m0 <= j’- i’ <= n

Loop bounds:

Carnegie Mellon

Is the Result Fully Permutable?

j

i


i’

j’

i 'j '

⎡

⎣⎢⎢

⎤

⎦⎥⎥= 1 0

1 1

⎡

⎣⎢

⎤

⎦⎥

ij

⎡

⎣⎢⎢

⎤

⎦⎥⎥

j’

i’

9

Carnegie Mellon

3. The Problem of Creating Fully Permutable Loops

• RECALL: r-deep fully permutable loop nest; r > 1– r choices of outermost loops– r-1 degrees of parallelism– O(nr-1) parallelism– O(n) synchronization

• GOAL: Find transformation to maximize the degree of pipelining• à Find all the possible outermost loops


Carnegie Mellon

Finding the Maximum Degree of Pipelining

C: Time partitioning of Computation to TimeFor every pair of data dependent accesses F1i1+f1 and F2i2+f2Let B1i1+b1 ³ 0, B2i2+b2 ³ 0 be the corresponding loop bound constraints,

Find C1, c1, C2, c2: " i1, i2 B1i1 + b1 ³ 0, B2i2 + b2 ³ 0

(i1 ≤ i2 )� (F1 i1+ f1 = F2 i2+f2) ® C1i1+c1 ≤ C2i2+c2


Loops Array

Time Stage

F1i1+f1

F2i2+f2

C1i1+c1C2i2+c2

i1 £ i2


10

Carnegie Mellon

Solutions of Time Mapping


j

i 1 0⎡⎣

⎤⎦, 1 1⎡⎣

⎤⎦

j

i

1 0⎡⎣

⎤⎦, 0 1⎡⎣

⎤⎦

Carnegie Mellon

Solutions to Loop Transforms


j

i

i’

j’

i 'j '

⎡

⎣⎢⎢

⎤

⎦⎥⎥= 1 0

1 1

⎡

⎣⎢

⎤

⎦⎥

ij

⎡

⎣⎢⎢

⎤

⎦⎥⎥

j’

i’

j 'i '

⎡

⎣⎢⎢

⎤

⎦⎥⎥= 1 1

1 0

⎡

⎣⎢

⎤

⎦⎥

ij

⎡

⎣⎢⎢

⎤

⎦⎥⎥

1 0⎡⎣

⎤⎦, 1 1⎡⎣

⎤⎦

11

Carnegie Mellon

4. Time Partitioning Algorithm

Loops Array

Processor ID

F1i1+f1

F2i2+f2

C1i1+c1C2i2+c2


Loops Array

Time Stage

F1i1+f1

F2i2+f2

C1i1+c1C2i2+c2

i1 £ i2

Compare:

Carnegie Mellon

Comparing the Two Problems

Pipelining Parallelism: C: Time mapping of Computation to TimeFor every pair of data dependent accesses F1i1+f1 and F2i2+f2Let B1i1+b1 ³ 0, B2i2+b2 ³ 0 be the corresponding loop bound constraints,

Find C1, c1, C2, c2: " i1, i2 B1i1 + b1 ³ 0, B2i2 + b2 ³ 0

(i1 ≤ i2 )� (F1 i1+ f1 = F2 i2+f2) ® C1i1+c1 ≤ C2i2+c2



Communication-Free Parallelism:C: Space partitioning of Computation to Processor IDFor every pair of data dependent accesses F1i1+f1 and F2i2+f2

Find C1, c1, C2, c2: " i1, i2 F1 i1+ f1 = F2 i2+f2 ® C1i1+c1 = C2i2+c2


12

Carnegie Mellon

Farkas Lemma

Finding the possible time dimensions c: Given matrix A, find a vector c such that

for all vectors x such that Ax ≥ 0, cTx ≥ 0

Farkas Lemma, 1901 (real domain)The primal system of inequalities

Ax ≥ 0, cTx < 0has a real-valued solution xor, the dual system

ATy = c, y ≥ 0has a real-valued solution y, but never both.

Time partitioning: Find c such that ATy = c, y ≥ 0

Note: Farkas Lemma: a theorem of the alternative(no intuitive proof exists)


Carnegie Mellon

Example: Cholesky Decomposition

for (i = 1; i <= N; i++) {for (j = 1; j <= i-1; j++) {

for (k = 1; k <= j-1; k++) X[i,j] = X[i,j] – X[i,k]*X[j,k];

X[i,j] = X[i,j]/X[j,j]; }for (m=1; m<=i-1; m++) {

X[i,i]=X[i,i]-X[i,m]*X[i,m];}X[i,i] = sqrt(X[i,i]); }


for (i = 1; i <= N; i++) {for (j = 1; j <= i; j++) {

for (k = 1; k <= i; k++) if (j<i && k<j)

X[i,j] = X[i,j] – X[i,k]*X[j,k]; if (j==k && j<i)

X[i,j] = X[i,j]/X[j,j];if (i==j && k<i)

X[i,i]=X[i,i]-X[i,k]*X[i,k];if (i==j && j==k)

X[i,i] = sqrt(X[i,i]); }}}

01

65432

1 65432 7 8 9

j

k

i = 6

Transformed Space

13

Carnegie Mellon

5. Beyond Pipelined Parallelism

What if there is only 1 fully permutable outermost loop?

Example: for (i=1; i<=n; i++) {

X[i] = Y[i] + Z[i]; (s1)W[A[i]] = X[i]; (s2)

}


Carnegie Mellon

O(1) Synchronization

for (i=1; i<=n; i++) {X[i] = Y[i] + Z[i]; (s1)W[A[i]] = X[i]; (s2)

}

• Program dependence graph– Nodes: statements– Edges: data dependence

• Split the program into a sequence of strongly connected componentsseparated by O(1) barriers


s1 s2

for (i=1; i<=n; i++) {X[i] = Y[i] + Z[i]; (s1)

}for (i=1; i<=n; i++) {

W[A[i]] = X[i]; (s2) }

14

Carnegie Mellon

Algorithm

1. Find parallelism with coarsest parallelism with minimum synchronization

Find outermost communication-free parallelismFind outermost fully permutable loop nestIf there are inner loops remaining

Find program dependence graphSplit the program into strongly connected componentsRepeat for each strongly connected component

2. Apply blocking to improve locality


Carnegie Mellon

Example: Neural Network

// 2D 3x3 convolution (stride=1)for i = 0 to channels-1

for y = 2 to Sy-1for x = 2 to Sx-1

B[i,y,x] = A[i,y-2,x-2]*W1[0,0]+ A[i,y-2,x-1]*W1[0,1]+…A[i,y-1,x-2]*W1[1,0] +…A[i,y,x-2]*W1[2,0] +…

// ReLU (Rectified Linear Unit)for i = 0 to channels-1

for y = 2 to Sy-1for x = 2 to Sx-1

B[i,y,x] = max(B[i,y,x], 0)

// 2D 3x3 convolution (Stride = 2)for i = 0 to channels-1

for y = 2 to (Sy-1)/2for x = 2 to (Sx-1)/2

C[i,y,x] = B[i,2*y-2,2*x-2]*W2[0,0] + …B[i,2*y-1,2*x-2]*W2[1,0] + …


// Dense neural network layerfor i = 0 to channels-1

for j = 0 to Sj-1for y = 2 to (Sy-1)/2

for x = 2 to (Sx-1)/2D[i,j] += C[i,y,x]*W3[j,y,x]

// Softmax: for i = 0 to channels-1

for j = 0 to Sj-1T[i,j] = exp(D[i,j]);E[i] += T[i,j];

for j = 0 to Sj-1F[i,j] = exp([T[i,j])/E[i]

15

Carnegie Mellon

Parallelization without Reduction Optimization// 2D convolution (stride=1)for i = 0 to channels-1 // Parallel loop

for y = 2 to Sy-1 // Permutable loop nest for x = 2 to Sx-1 // Permutable loop nest

// 2D convolutionB[i,y,x] + = A[i,y-2,x-2]*W1[0,0]+ A[i,y-2,x-1]*W1[0,1]+…

A[i,y-1,x-2]*W1[1,0] +…A[i,y,x-2]*W1[2,0] +…

// ReLU (Rectified Linear Unit)B[i,y,x] = max(B[i,y,x], 0)

// 2D convolution (Stride = 2)if (y >=4) && (x >=4) && (y mod 2 == 0) && (x mod 2 == 0)

C[i,y/2,x/2] += B[i,y-2,x-2]*W2[0,0] + …B[i,y-1,x-2]*W2[1,0] + …

// Dense neural network layerfor j = 0 to Sj-1 /* Parallel loop */

for y = 2 to (Sy-1)/2for x = 2 to (Sx-1)/2

D[i,j] += C[i,y,x]*W3[j,y,x]

// Softmaxfor j = 0 to Sj-1

T[i,j] = exp(D[i,j]);E[i] += T[i,j];

for j = 0 to Sj-1 /* Parallel loop */F[i,j] = exp([T[i,j])/E[i]


Carnegie Mellon

Summary: Two Key Algorithms


Loops Array

Processor ID

F1i1+f1

F2i2+f2

C1i1+c1C2i2+c2

Loops Array

Time Stage

F1i1+f1

F2i2+f2

C1i1+c1C2i2+c2

i1 £ i2

Lecture 11 Pipelined Parallelism - Stanford Universitycourses/cs243/lectures/l11-handout.pdf · 9 Carnegie Mellon 3. The Problem of Creating Fully Permutable Loops • RECALL: r-deep

Documents