Using the Iteration Space Visualizer in Loop Parallelization Yijun YU http://winpar.elis.rug.ac.be/ppt/i sv
Using the Iteration Space Visualizer in Loop Parallelization
Yijun YU
http://winpar.elis.rug.ac.be/ppt/isv
OverviewISV – A 3D Iteration Space Visualizer : view
the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values Detect the parallelism Estimate the speedup Derive a loop transformation Find Statement-level parallelism Future development
1. Dependence
DO I = 1,3
A(I) = A(I-1)
ENDDO
DOALL I = 1,3
A(I) = A(I-1)
ENDDO
A(2) = A(1)
A(1) = A(0)
A(3) = A(2)
1 2 3
1 1 3
0 1 3
0 1 1
0
0
0
0
program
A(1) = A(0)
A(2) = A(1)
A(3) = A(2)
execution trace
1 2 3
0 2 3
0 0 3
0 0 0
0
0
0
0
shared memory
1.1 Example1
ISV directive
visualize
1.2 Visualize the Dependence
A dependence is visualized in an iteration space dependence graph
iteration
Node IterationFlow dependenc
e Edge Dependence order between nodes
Color Dependence type: FLOW: Write Read ANTI: Read Write OUTPUT: Write Write
1.3 Parallelism? Stepwise view sequential
execution
No parallelism found However, many programs have
parallelism…
2. Potential Parallelism Time(sequential) = number of iterations Dataflow: iterations are executed as
soon as its data are readyTime(dataflow) = number of iterations on the longest critical path
The potential parallelism is denoted byspeedup = Time(sequential)/Time(dataflow)
2.1 Example 2
Diophantine Equations
+
Loop bounds (polytope)
=
Iteration Space Dependencies
2.2 Irregular dependence Dependences have
non-uniform distance Parallelism Analysis:
200 iterations over 15 data flow steps
Speedup:13.3
Problem: How to exploit it?
3. Visualize parallelism
Find answers to these questions What is the dependence pattern? Is there a parallel loop? (How to
find?) What is the maximal parallelism?
(How to exploit it?) Is the load of parallel tasks
balanced?
3.1 Example 3
3.2 3D Space
3.3 Loop parallelizable? The I, J, K loops are in
the 3D space: 32 iterations
Simulate sequentia
l execution
Which loop can be parallel?
Interactively try the parallelization
Interactively check a
parallel loop I
3.4 Loop parallelization
The blinking dependence edges prevent the parallelization of the given loop I.
Let ISV find the correct parallelizationAutomaticall
y check the
parallel loop
Simulateparallel executio
n
3.5 Parallel execution
It takes 16 time steps
Sequential execution takes 32 time steps
Simulatedata flow execution
3.6 Dataflow execution
Dataflow execution only takes 4 times steps
Potential speedup=8.
Dataflow speedup = 8
Iterating through
partitions: the connected components
3.7 Graph partitioning
All the partitions are load balanced
4. Loop Transformation
Real parallelism
Potential parallelism
Transformation
4.1 Example 4
4.2 The iteration space Sequentially 25 iterations
4.3 Loop Parallelizable?check loop I check loop J
Totally 9 steps Potential
speedup: 25/9=2.78
Wave front effect:all iterations on the same wave are on the same line
4.4 Dataflow execution
4.5 Zoom-in on the I-space
4.6 Speedup vs program size Zoom-in previews parallelism in
part of a loop without modifying the program
Executing the programs of different size n estimates a speedup of n2/(2n-1)Loop si ze # i terati ons # datafl ow steps speedup
2 4 3 1. 33
3 9 5 1. 8
4 16 7 2. 29
5 25 9 2. 78
N N2 2N- 1 O(N/ 2)
4.7 How to obtain thepotential parallelism
Here we already have these metrics: Sequential time steps = N2
Dataflow time step = 2N-1potential speedup = N2/(2N-1)
Transformation.How to obtain the potential speedup of a loop?
4.8 Unimodular transformation (UT)
A unimodular matrix is a square integer matrix that has unit determinant. It is the result of identity matrix by three kinds of basic transformations: reversal, interchange, and skewing
The new loop execution order is determined by the transformed index. The iteration space remains unit step size
Find a suitable UT reorders the iterations such that the new loop nest has a parallel loop
iUi ' Unimodular matrix
New loop index Old loop
index
10
01
reversal
01
10
interchange
10
21
skewing
4.9 Hyperplane transformation
Interactively define a hyper-plane
Observe the plane iteration matches the dataflow simulationplane = dataflow
The plane iteration
Based on the plane, ISV calculates a unimodular transformation
The transformed iteration space and the generated loop
4.10 The derived UT
4.11 Verify the UT ISV checks if the
transformation is valid Observe that the
parallel loop execution in the transformed loop matches the plane executionparallel = plane
5. Statement-level parallelism Unimodular transformations work
at iteration level The statement dependence within
the loop body is hidden in the iteration space graph
How to exploit parallelism at statement level? Statement to iteration
5.1 Example 5
SSV: statement space
visualization
5.2 Iteration-level parallelism
The iteration space is 2D.
There are N2=16 iterations
The dataflow execution has 2N-1=7 time steps.
The potential speedup is:
16/7 = 2.29
5.3 Parallelism in statements
The (statement) iteration space is 3D
There are 2N2=32 statements
The dataflow execution still has 2N-1=7 time steps.
The potential speedup is:
32/7 = 4.58
5.4 Comparison Here: doubles the potential
speedup at iteration level
l oop si ze # i terati ons step # statements step
N N2 2N- 1 2N2 2N- 1
5.5 Define the partition planes
partitions hyper-planes
What is validity?
Show the execution order on top of the dependence arrows.(for 1 plane or all together, depending on the density of the slide)
5.6 Invalid UT The invalid unimodular
transformation derived from hyper-plane is refused by ISV
Alternatively, ISV calculates the unimodular transformation based on the dependence distance vectors available in the dependence graph
6. Pseudo distance method
The pseudo distance method:
Extract base vectors from the dependent iterations
Examine if the base vectors generates all the distances
Calculate the unimodular transformation based on the base vectors
The base vectors
The unimodula
r matrix
Another way to find parallelism automatically
The iteration space is a grid,non-uniform dependencies are members of a uniform dependence grid, with unknown base-vectors.
Finding these base vectors allows usto extend existing parallelizationto the non-uniform case.
6.1 Dependence distance
(1,0,-1) (0,1,1)
6.2 The Transformation The transforming matrix discovered by
pseudo distance method1 1 0
-1 0 11 0 0
The distance vectors are transformed(1,0,-1) (0,1,0)(0,1,1) (0,0,1)
The dependent iterations have the same first index, implies the outermost loop is parallel.
6.3 Compare the UT matrices
The transforming matrix discovered by pseudo distance method1 1 0-1 0 11 0 0
An invalid transforming matrix discovered by the hyper-plane method1 0 0-1 1 01 0 1The same first column means the transformed
outermost loops have the same index.
6.4 The transformed space
The outermost loop is parallel
There are 8 parallel tasks
The load of tasks is not balanced
The longest task takes 7 time steps
7. Non-perfectly nested loop What is it? The unimodular transformations
only work for perfectly nested loops For non-perfectly nested loop, the
iteration space is constructed with extended indices
N fold non-perfectly nested loop to a N+1 fold perfectly nested loop
7.1 Perfectly nested Loop?Non-perfectly nested loop:
DO I1 = 1,3
A(I1) = A(I1-1)
DO I2 = 1,4
B(I1,I2) = B(I1-1,I2)+B(I1,I2-1)
ENDDO
ENDDO
Perfectly nested loop:
DO I1 = 1,3
DO I2 = 1,5
DO I3 = 0,1
IF (I2.EQ.1.AND.I3.EQ.0) A(I1) = A(I1-1)
ELSE IF(I3.EQ.1) B(I1-1,I2)=B(I1-2,I2)+B(I1-1,I2-1)
ENDDO
ENDDO
ENDDO
7.2 Exploit parallelism with UT
8. ApplicationsPrograms Catagory Depth Form Pattern Transformation
Example 1 Tutorial 1 Perfect Uniform N/A
Example 2 Tutorial 2 Perfect Non-uniform N/A
Example 3 Tutorial 3 Perfect Uniform Wavefront UT
Example 4 Tutorial 2 Perfect Uniform Wavefront UT
Example 5 Tutorial 2+1 Perfect Uniform Stmt Partitioning UT
Example 6 Tutorial 2+1 Non-perfect
Uniform Wavefront UT
Matrix multiplication Algorithm 3 Perfect Uniform Parallelization
Gauss-Jordan
Algorithm 3 Perfect Non-Uniform Parallelization
FFT
Algorithm 3 Perfect Non-Uniform Parallelization
Cholesky
Benchmark 4 Non-perfect
Non-Uniform Partitioning UT
TOMCATV
Benchmark 3 Non-perfect
Uniform Parallelization
Flow3D
CFD App. 3 Perfect Uniform Wavefront UT
9. Future considerations Weighted dependence graph More semantics on data locality:
data space graph, data communication graph
data reuse iteration space graph, More loop transformation:
Affine (statement) iteration space mappingsAutomatic statement distributionIntegration with Omega library