1 1. Fully permutable loop nests & pipelining 2. Example: Transforming for full permutability 3. Time Affine Partitioning: Problem 4. Time Affine Partitioning Algorithm 5. O(1) Synchronization problem Readings: Chapter 11.8-11.9 Lecture 11 Pipelined Parallelism M. Lam 1 CS243: Loop Transformations 1. Recall: Maximum Parallelism & No Communication C: Space partitioning of Computation to Processor ID For every pair of data dependent accesses F1i1+f1 and F2i2+f2 Find C1, c1, C2, c2: " i1, i2 F1 i1+ f1 = F2 i2+f2 ® C1i1+c1 = C2i2+c2 with the objective of maximizing the rank of C1, C2 Loops Array Processor ID F 1 i 1 +f 1 F 2 i 2 +f 2 C1i1+c1 C2i2+c2 M. Lam 2 CS243: Loop Transformations
15
Embed
Lecture 11 Pipelined Parallelism - Stanford Universitycourses/cs243/lectures/l11-handout.pdf · 9 Carnegie Mellon 3. The Problem of Creating Fully Permutable Loops • RECALL: r-deep
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Carnegie Mellon
1. Fully permutable loop nests & pipelining2. Example: Transforming for full permutability3. Time Affine Partitioning: Problem 4. Time Affine Partitioning Algorithm5. O(1) Synchronization problem
Readings: Chapter 11.8-11.9
Lecture 11Pipelined Parallelism
M. Lam 1CS243: Loop Transformations
Carnegie Mellon
1. Recall: Maximum Parallelism & No Communication
C: Space partitioning of Computation to Processor IDFor every pair of data dependent accesses F1i1+f1 and F2i2+f2
Find C1, c1, C2, c2: " i1, i2 F1 i1+ f1 = F2 i2+f2 ® C1i1+c1 = C2i2+c2
with the objective of maximizing the rank of C1, C2
Loops Array
Processor ID
F1i1+f1
F2i2+f2
C1i1+c1C2i2+c2
M. Lam 2CS243: Loop Transformations
2
Carnegie Mellon
SOR (Successive Over-Relaxation): An Example
for i = 1 TO m for j = 1 to nA[i,j] = c * (A[i-1,j] + A[i,j-1])
j
i
M. Lam 3CS243: Loop Transformations
Carnegie Mellon
Pipelineable Parallelism
M. Lam CS243: Loop Transformations 4
for i = 1 TO m for j = 1 to n
A[i,j] = c * (A[i-1,j] + A[i,j-1])
Processor ID: pSynchronization variable: t[p] initialized to 0WAIT: thread waits until the condition becomes true
for j = 1 to nif (p==1) or (WAIT(t[p-1]>=j))A[p,j] = c * (A[p-1,j] + A[p,j-1])
t[p]++;
3
Carnegie Mellon
Fully Permutable Loop Nests
• Definition:A loop nest is fully permutable if all the loops can be permuted arbitrarilywithout changing the semantics of the program
• Example:
M. Lam CS243: Loop Transformations 5
for i = 1 TO m for j = 1 to nA[i,j]=c*(A[i-1,j]+ A[i,j-1])
j
i
j’j 'i '
⎡
⎣⎢⎢
⎤
⎦⎥⎥= 0 1
1 0
⎡
⎣⎢
⎤
⎦⎥
ij
⎡
⎣⎢⎢
⎤
⎦⎥⎥
i’
for j = 1 TO n for i = 1 to mA[i,j]=c*(A[i-1,j] + A[i,j-1])
Carnegie Mellon
When is a Loop Fully Permutable?
• Sequential execution order:
• A loop nest is fully permutable if all the dependences are satisfied in the sequential execution order, under all loop permutations
• INTUITION: A loop nest is fully permutable if– Its dependences do not point backwards along any axis
• Relationship between communication-free parallelism & full permutability?
M. Lam CS243: Loop Transformations 6
j
i
4
Carnegie Mellon
r-Dimensional Pipelineable Parallelism
• r-deep fully permutable loop nest, r > 1, with cross-iteration dependences– r choices of outermost loops– r-1 degrees of parallelism– O(nr-1) parallelism– O(n) synchronization
• Code generation– r-1 outer loops: processor ID (p1, p2, …, pr-1)– Sequential rth loop: ir– iteration ir for processor (p1, p2, …, pr-1), waits for
iteration ir for processors (p1-1, p2, …, pr-1),(p1, p2-1, …, pr-1), …, (p1, p2, …, pr-1-1).
M. Lam CS243: Loop Transformations 7
Carnegie Mellon
Recall: Blocking for Matrix Multiplication
= x
= x
1000 1000 1000
1000
= x
32 1000 32
100032
DataAccessed
1002000
65024
8 M. LamCS243: Parallelization
5
Carnegie Mellon
Experimental Results
With BlockingWithout Blocking
9 M. LamCS243: Parallelization
1
19181716151413121110
98765432
2120
0
Speedup
Processors
Carnegie Mellon
Blocking with Matrix Multiplication
• Original programfor (i = 0; i < n; i++) {for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {Z[i,j] = Z[i,j] + X[i,k]*Y[k,j];
}}}• Stripmine 2 outer loops
for (ii = 0; ii < n; ii = ii+B) {for (i = ii; i < min(n,ii+B); i++) {for (jj = 0; jj < n; jj = jj+B) {for (j = jj; j < min(n,jj+B); j++) {
for (k = 0; k < n; k++) {Z[i,j] = Z[i,j] + X[i,k]*Y[k,j];
}}}• Permute loops
for (ii = 0; ii < n; ii = ii+B) {for (jj = 0; jj < n; jj = jj+B) {for (k = 0; k < n; k++) {for (i = ii; i < min(n,ii+B); i++) {for (j = jj; j < min(n,jj+B); j++) {Z[i,j] = Z[i,j] + X[i,k]*Y[k,j];
}}}
M. Lam CS243: Loop Transformations 10
6
Carnegie Mellon
How to Block Loops?
• Fully permutable loop nests can be blocked1. Stripmine to create more fully permutable loops
for (i = 0; i < n; i++) {<code>
}=> for (ii = 0; ii < n; ii = ii+B) {
for (i = ii; i < min(n,ii+B); i++) {<code>
}}
2. Permute inner stripmined loop inside
M. Lam CS243: Loop Transformations 11
Carnegie Mellon
Uses of Blocking
• Increase data locality– Block size can be chosen
so data accessed in the block fits in the faster hierarchy(virtual memory, cache, registers)
• Reduce synchronization overhead– By a factor of the block size– Consideration: startup latency, load balance for triangular loops
• SIMD instructions– To create contiguous vector access
for (j = 0; j < n; j++) {for (k = 0; k < n; k++) {
for i = 0 TO m for j = 0 to nX[j+1]=(X[j]+X[j+1]+X[j+2])
j
i
M. Lam 13CS243: Loop Transformations
j
i
Example:
Carnegie Mellon
Transforming for Full Permutability
for i = 0 TO m for j = 0 to nX[j+1]=(X[j]+X[j+1]+X[j+2])
j
i
M. Lam 14CS243: Loop Transformations
i’
j’
i 'j '
⎡
⎣⎢⎢
⎤
⎦⎥⎥= 1 0
1 1
⎡
⎣⎢
⎤
⎦⎥
ij
⎡
⎣⎢⎢
⎤
⎦⎥⎥
8
Carnegie Mellon
Code Generation
for i = 0 TO m for j = 0 to nX[j+1]=(X[j]+X[j+1]+X[j+2])/3
j
i
M. Lam 15CS243: Loop Transformations
for i' = 0 TO m for j' = i’ to i'+nX[j’-i'+1]=(X[j’-i']+X[j’-i'+1]+X[j’-i'+2])/3
i’
j’
i 'j '
⎡
⎣⎢⎢
⎤
⎦⎥⎥= 1 0
1 1
⎡
⎣⎢
⎤
⎦⎥
ij
⎡
⎣⎢⎢
⎤
⎦⎥⎥
j’ = i + jj = j’ – i’
i’ = i 0 <= i’ <=m0 <= j’- i’ <= n
Loop bounds:
Carnegie Mellon
Is the Result Fully Permutable?
j
i
M. Lam 16CS243: Loop Transformations
i’
j’
i 'j '
⎡
⎣⎢⎢
⎤
⎦⎥⎥= 1 0
1 1
⎡
⎣⎢
⎤
⎦⎥
ij
⎡
⎣⎢⎢
⎤
⎦⎥⎥
j’
i’
9
Carnegie Mellon
3. The Problem of Creating Fully Permutable Loops
• RECALL: r-deep fully permutable loop nest; r > 1– r choices of outermost loops– r-1 degrees of parallelism– O(nr-1) parallelism– O(n) synchronization
• GOAL: Find transformation to maximize the degree of pipelining• à Find all the possible outermost loops
M. Lam CS243: Loop Transformations 17
Carnegie Mellon
Finding the Maximum Degree of Pipelining
C: Time partitioning of Computation to TimeFor every pair of data dependent accesses F1i1+f1 and F2i2+f2Let B1i1+b1 ³ 0, B2i2+b2 ³ 0 be the corresponding loop bound constraints,
with the objective of maximizing the rank of C1, C2
Loops Array
Time Stage
F1i1+f1
F2i2+f2
C1i1+c1C2i2+c2
i1 £ i2
M. Lam 18CS243: Loop Transformations
10
Carnegie Mellon
Solutions of Time Mapping
M. Lam CS243: Loop Transformations 19
j
i 1 0⎡⎣
⎤⎦, 1 1⎡⎣
⎤⎦
j
i
1 0⎡⎣
⎤⎦, 0 1⎡⎣
⎤⎦
Carnegie Mellon
Solutions to Loop Transforms
M. Lam 20CS243: Loop Transformations
j
i
i’
j’
i 'j '
⎡
⎣⎢⎢
⎤
⎦⎥⎥= 1 0
1 1
⎡
⎣⎢
⎤
⎦⎥
ij
⎡
⎣⎢⎢
⎤
⎦⎥⎥
j’
i’
j 'i '
⎡
⎣⎢⎢
⎤
⎦⎥⎥= 1 1
1 0
⎡
⎣⎢
⎤
⎦⎥
ij
⎡
⎣⎢⎢
⎤
⎦⎥⎥
1 0⎡⎣
⎤⎦, 1 1⎡⎣
⎤⎦
11
Carnegie Mellon
4. Time Partitioning Algorithm
Loops Array
Processor ID
F1i1+f1
F2i2+f2
C1i1+c1C2i2+c2
M. Lam 21CS243: Loop Transformations
Loops Array
Time Stage
F1i1+f1
F2i2+f2
C1i1+c1C2i2+c2
i1 £ i2
Compare:
Carnegie Mellon
Comparing the Two Problems
Pipelining Parallelism: C: Time mapping of Computation to TimeFor every pair of data dependent accesses F1i1+f1 and F2i2+f2Let B1i1+b1 ³ 0, B2i2+b2 ³ 0 be the corresponding loop bound constraints,