Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Using the Iteration Space Visualizer in Loop Parallelization

Yijun YU

http://winpar.elis.rug.ac.be/ppt/isv

OverviewISV – A 3D Iteration Space Visualizer : view

the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values Detect the parallelism Estimate the speedup Derive a loop transformation Find Statement-level parallelism Future development

1. Dependence

DO I = 1,3

A(I) = A(I-1)

ENDDO

DOALL I = 1,3

A(I) = A(I-1)

ENDDO

A(2) = A(1)

A(1) = A(0)

A(3) = A(2)

1 2 3

1 1 3

0 1 3

0 1 1

0

0

0

0

program

A(1) = A(0)

A(2) = A(1)

A(3) = A(2)

execution trace

1 2 3

0 2 3

0 0 3

0 0 0

0

0

0

0

shared memory

1.1 Example1

ISV directive

visualize

1.2 Visualize the Dependence

A dependence is visualized in an iteration space dependence graph

iteration

Node IterationFlow dependenc

e Edge Dependence order between nodes

Color Dependence type: FLOW: Write Read ANTI: Read Write OUTPUT: Write Write

1.3 Parallelism? Stepwise view sequential

execution

No parallelism found However, many programs have

parallelism…

2. Potential Parallelism Time(sequential) = number of iterations Dataflow: iterations are executed as

soon as its data are readyTime(dataflow) = number of iterations on the longest critical path

The potential parallelism is denoted byspeedup = Time(sequential)/Time(dataflow)

2.1 Example 2

Diophantine Equations

+

Loop bounds (polytope)

=

Iteration Space Dependencies

2.2 Irregular dependence Dependences have

non-uniform distance Parallelism Analysis:

200 iterations over 15 data flow steps

Speedup:13.3

Problem: How to exploit it?

3. Visualize parallelism

Find answers to these questions What is the dependence pattern? Is there a parallel loop? (How to

find?) What is the maximal parallelism?

(How to exploit it?) Is the load of parallel tasks

balanced?

3.1 Example 3

3.2 3D Space

3.3 Loop parallelizable? The I, J, K loops are in

the 3D space: 32 iterations

Simulate sequentia

l execution

Which loop can be parallel?

Interactively try the parallelization

Interactively check a

parallel loop I

3.4 Loop parallelization

The blinking dependence edges prevent the parallelization of the given loop I.

Let ISV find the correct parallelizationAutomaticall

y check the

parallel loop

Simulateparallel executio

n

3.5 Parallel execution

It takes 16 time steps

Sequential execution takes 32 time steps

Simulatedata flow execution

3.6 Dataflow execution

Dataflow execution only takes 4 times steps

Potential speedup=8.

Dataflow speedup = 8

Iterating through

partitions: the connected components

3.7 Graph partitioning

All the partitions are load balanced

4. Loop Transformation

Real parallelism

Potential parallelism

Transformation

4.1 Example 4

4.2 The iteration space Sequentially 25 iterations

4.3 Loop Parallelizable?check loop I check loop J

Totally 9 steps Potential

speedup: 25/9=2.78

Wave front effect:all iterations on the same wave are on the same line

4.4 Dataflow execution

4.5 Zoom-in on the I-space

4.6 Speedup vs program size Zoom-in previews parallelism in

part of a loop without modifying the program

Executing the programs of different size n estimates a speedup of n2/(2n-1)Loop si ze # i terati ons # datafl ow steps speedup

2 4 3 1. 33

3 9 5 1. 8

4 16 7 2. 29

5 25 9 2. 78

N N2 2N- 1 O(N/ 2)

4.7 How to obtain thepotential parallelism

Here we already have these metrics: Sequential time steps = N2

Dataflow time step = 2N-1potential speedup = N2/(2N-1)

Transformation.How to obtain the potential speedup of a loop?

4.8 Unimodular transformation (UT)

A unimodular matrix is a square integer matrix that has unit determinant. It is the result of identity matrix by three kinds of basic transformations: reversal, interchange, and skewing

The new loop execution order is determined by the transformed index. The iteration space remains unit step size

Find a suitable UT reorders the iterations such that the new loop nest has a parallel loop

iUi ' Unimodular matrix

New loop index Old loop

index

10

01

reversal

01

10

interchange

10

21

skewing

4.9 Hyperplane transformation

Interactively define a hyper-plane

Observe the plane iteration matches the dataflow simulationplane = dataflow

The plane iteration

Based on the plane, ISV calculates a unimodular transformation

The transformed iteration space and the generated loop

4.10 The derived UT

4.11 Verify the UT ISV checks if the

transformation is valid Observe that the

parallel loop execution in the transformed loop matches the plane executionparallel = plane

5. Statement-level parallelism Unimodular transformations work

at iteration level The statement dependence within

the loop body is hidden in the iteration space graph

How to exploit parallelism at statement level? Statement to iteration

5.1 Example 5

SSV: statement space

visualization

5.2 Iteration-level parallelism

The iteration space is 2D.

There are N2=16 iterations

The dataflow execution has 2N-1=7 time steps.

The potential speedup is:

16/7 = 2.29

5.3 Parallelism in statements

The (statement) iteration space is 3D

There are 2N2=32 statements

The dataflow execution still has 2N-1=7 time steps.

The potential speedup is:

32/7 = 4.58

5.4 Comparison Here: doubles the potential

speedup at iteration level

l oop si ze # i terati ons step # statements step

N N2 2N- 1 2N2 2N- 1

5.5 Define the partition planes

partitions hyper-planes

What is validity?

Show the execution order on top of the dependence arrows.(for 1 plane or all together, depending on the density of the slide)

5.6 Invalid UT The invalid unimodular

transformation derived from hyper-plane is refused by ISV

Alternatively, ISV calculates the unimodular transformation based on the dependence distance vectors available in the dependence graph

6. Pseudo distance method

The pseudo distance method:

Extract base vectors from the dependent iterations

Examine if the base vectors generates all the distances

Calculate the unimodular transformation based on the base vectors

The base vectors

The unimodula

r matrix

Another way to find parallelism automatically

The iteration space is a grid,non-uniform dependencies are members of a uniform dependence grid, with unknown base-vectors.

Finding these base vectors allows usto extend existing parallelizationto the non-uniform case.

6.1 Dependence distance

(1,0,-1) (0,1,1)

6.2 The Transformation The transforming matrix discovered by

pseudo distance method1 1 0

-1 0 11 0 0

The distance vectors are transformed(1,0,-1) (0,1,0)(0,1,1) (0,0,1)

The dependent iterations have the same first index, implies the outermost loop is parallel.

6.3 Compare the UT matrices

The transforming matrix discovered by pseudo distance method1 1 0-1 0 11 0 0

An invalid transforming matrix discovered by the hyper-plane method1 0 0-1 1 01 0 1The same first column means the transformed

outermost loops have the same index.

6.4 The transformed space

The outermost loop is parallel

There are 8 parallel tasks

The load of tasks is not balanced

The longest task takes 7 time steps

7. Non-perfectly nested loop What is it? The unimodular transformations

only work for perfectly nested loops For non-perfectly nested loop, the

iteration space is constructed with extended indices

N fold non-perfectly nested loop to a N+1 fold perfectly nested loop

7.1 Perfectly nested Loop?Non-perfectly nested loop:

DO I1 = 1,3

A(I1) = A(I1-1)

DO I2 = 1,4

B(I1,I2) = B(I1-1,I2)+B(I1,I2-1)

ENDDO

ENDDO

Perfectly nested loop:

DO I1 = 1,3

DO I2 = 1,5

DO I3 = 0,1

IF (I2.EQ.1.AND.I3.EQ.0) A(I1) = A(I1-1)

ELSE IF(I3.EQ.1) B(I1-1,I2)=B(I1-2,I2)+B(I1-1,I2-1)

ENDDO

ENDDO

ENDDO

7.2 Exploit parallelism with UT

8. ApplicationsPrograms Catagory Depth Form Pattern Transformation

Example 1 Tutorial 1 Perfect Uniform N/A

Example 2 Tutorial 2 Perfect Non-uniform N/A

Example 3 Tutorial 3 Perfect Uniform Wavefront UT

Example 4 Tutorial 2 Perfect Uniform Wavefront UT

Example 5 Tutorial 2+1 Perfect Uniform Stmt Partitioning UT

Example 6 Tutorial 2+1 Non-perfect

Uniform Wavefront UT

Matrix multiplication Algorithm 3 Perfect Uniform Parallelization

Gauss-Jordan

Algorithm 3 Perfect Non-Uniform Parallelization

FFT

Algorithm 3 Perfect Non-Uniform Parallelization

Cholesky

Benchmark 4 Non-perfect

Non-Uniform Partitioning UT

TOMCATV

Benchmark 3 Non-perfect

Uniform Parallelization

Flow3D

CFD App. 3 Perfect Uniform Wavefront UT

9. Future considerations Weighted dependence graph More semantics on data locality:

data space graph, data communication graph

data reuse iteration space graph, More loop transformation:

Affine (statement) iteration space mappingsAutomatic statement distributionIntegration with Omega library

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Documents