Top Banner
Parametric Tiling of Parametric Tiling of Affine Loop Nests Affine Loop Nests * * Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan 1 1 Ohio State University 2 Louisiana State University *Supported by US NSF
32

Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Dec 14, 2015

Download

Documents

Garett Wells
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parametric Tiling of Parametric Tiling of Affine Loop NestsAffine Loop Nests**

Sanket Tavarageri 1

Albert Hartono 1

Muthu Baskaran 1

Louis-Noel Pouchet 1

J. “Ram” Ramanujam 2

P. “Saday” Sadayappan 1

1 Ohio State University 2 Louisiana State University

*Supported by US NSF

Page 2: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

A key loop transformation for:◦ Efficient coarse-grained parallel execution◦ Data locality optimization

Loop TilingLoop Tiling

i

j

i

j

for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j);

for (it=1; it<=7; it+=Ti) for (jt=1; jt<=6; jt+=Tj) for (i=it; i<min(7,it+Ti-1); i++) for (j=jt; j<min(6,jt+Tj-1); j++) S(i,j);

Inter-tile loops

Intra-tile loops

Page 3: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Rectangular Tileability

Legality of rectangular tiling:◦ Atomic execution of each tile◦ No cyclic dependence between tiles

Data dependence lexicographically positive in all space dimensions

Unimodular transformations (e.g., skewing) used as a pre-processing step to make rectangular tiling valid

i

j

i’

j’

.1 0 01 1 0

i j1

i‘ j’

=

Skewing

Page 4: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parametric TilingParametric Tiling

for (it=1; it<=N; it+=Ti) for (jt=1; jt<=N; jt+=Tj) for (i=it; i<min(N,it+Ti-1); i++) for (j=jt; j<min(N,jt+Tj-1); j++) S(i,j);

for (i=1; i<=N; i++)

for (j=1; j<=N; j++)

S(i,j);

Tile loop i with tile size Ti

Tile loop j with tile size Tj

Performance of tiled code can vary greatly with choice of tile sizes

→ Model-driven and/or empirical search for best tile sizes Parametric tile sizes

◦ Not fixed at compile time ◦ Runtime parameters◦ Valuable for:

Auto-tuning systems Generalized “ATLAS”

Page 5: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Approaches to Loop Tiling

TLOG and HiTLOG◦ Handles only perfectly nested loops◦ Tile sizes can be runtime parameters◦ Does not address parallelism

Pluto◦ Handles imperfectly nested loops◦ Tile sizes must be fixed at compile time◦ Addresses parallelism

PrimeTile◦ Handles imperfectly nested loops◦ Tile sizes can be runtime parameters◦ Does not address parallelism

DynTile and PTile (this work):Systems with all positive features of existing tiling tools:◦ Handle imperfectly nested

loops◦ Tile sizes can be runtime

parameters◦ Address parallelism◦ Support multilevel tiling

Page 6: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Tiled Code Generation with Polyhedral Model

1 0 0 -1-1 0 1 0 0 1 0 -1 0 -1 1 0

>=

Original loop:

for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j);

Tiled loop:

for (it=0; it<=floord(N,32); it++) for (jt=0; jt<=floord(N,32); jt++) for (i=max(1,32*it); i<=min(N,32*it+31); i++) for (j=max(1,32*jt); j<=min(N,32*jt+31); j++) S(i,j);

ijN

0000

i’j’

ijN1

1 0 0 00 1 0 0

1 ≤ i i ≤ N1 ≤ j j ≤ N

i’ = ij’ = j

. ij

1

= . 0 0 0 0 0 0 0 0

-32 0 1 0 0 0 32 0 -1 0 0 31 0 -32 0 1 0 0 0 32 0 -1 0 31

itjt

0000

0 ≤ i-32∙it i-32∙it ≤ 310 ≤ j-32∙jt j-32∙jt ≤ 31

0 00 0

1 0 0 0 0 00 1 0 0 0 0

it’jt’

itjt

it’ = itjt’ = jt

Statement domain:

Affine schedule:

Tile sizes = 32 x 32Assume: Rectangular tiling is valid.

i

j

1 N

…1

N

…2

2

i ≥ 1 i ≤ N

j ≤ N

j ≥ 1

Constraint of polyhedral model: Inequalities of the loop bounds must be linear in terms of loop iterators and problem sizes

Page 7: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

PrimeTile: Approach to Sequential Parametric Tiling

• Recursive level-by-level generation of tiling loops by non-polyhedral AST processing j

iFull tiles (loop i)

for (i=lbi; i<=ubi; i++) for (j=lbj(i); j<=ubj(i); j++) S(i,j);

Output pseudocode:

Partial tile (loop i)

for it {

}[epilog i]

[compute lbv] [compute ubv] if (lbv<ubv) {

} else { [untiled j] }

[prolog j] [full tiles j] [epilog j]

No full tiles

Full tiles

Page 8: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

PrimeTile: Multi-Level Tiling

j

i

Essential for:◦ Exploiting data

locality in deep multi-level memory hierarchies

Approach:◦ Boundary tiles can be

recursively tiled using smaller tile sizes

1 levels of tiling2 levels of tiling3 levels of tiling

Page 9: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

for i for j1=l2-1,l2-1 S1(i,j1) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3)

DynTile: Parametric Tiling DynTile: Parametric Tiling (Multi Statement Domains)(Multi Statement Domains)

for i S1(i) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3)

Pre-processing toembed in common space

One-trip loop

Page 10: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Convex Hull

for i for j1 S1(i,j1) for j2 S2(i,j2) for j3 S3(i,j3)

i

j S1

S2

S3 /* Inter-tile loops */for it { for jt {

}}

/* Intra-tile loops*/

DynTile: Parametric Tiling (Multiple Statement Domains)

Page 11: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

DynTile: Wave-front Parallelism

i

j

i’

j’

.1 0 01 1 0

i j1

i‘ j’

=

wavefront k

wavefront k+1

wavefront k+2

wavefront k+3

wavefront k+4

After sequential tiling:1. If no loop carried dependences exist, then each tiling loop is directly parallelizable 2. If none of the tiling loops is parallel, then wave-front parallelization is always

possible (all points in the same wavefront are independent of each other)

Page 12: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

for each bin w { #pragma omp parallel for for each tile in w {

}}

/** Intra-tile loops (treated as a black box) */

w1 w2 w3 w4 w5

DynTile: Inspector Code for Dynamic Scheduled Parallel Execution

it

jt

/** Inter-tile loops */for it { for jt {

}}

/** * Intra-tile loops * (treated as * a black box) */

Tile iteration space

w1 w2 w3 w4 w5

Step 1: Count #wavefronts and #tiles in each wavefrontStep 2: Allocate bins to store wavefronts

Step 3: Fill the bins with its corresponding tilesStep 4: Execute in parallel all tiles in each bin

#wavefronts = 5

w1 has 2 tilesw2 has 3 tilesw3 has 4 tilesw4 has 3 tilesw5 has 4 tiles

Page 13: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

DynTile: Implementation

PlutoModified CLooG

Parser + AST

GeneratorClan

Convex Hull

Generator(using ISL)

Tiling Transform

er

Inspector Code

Specifier

Code Generat

or

Sequence of loop nests

Pre-process Statement polyhedra+

Affine transforms (for rectangular tileability)

Tileable loop code (with preserved

embedding information)Loop ASTs

Statement polyhedra

Convex-hullloop AST

Tiled loop ASTs

Parallel tiled loop

ASTsParallel

tiled loop code

DynTileDynTile

Page 14: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

PTile: Loop GenerationPTile: Loop Generation

Representation of Statement Domains◦ Set of affine inequalities

S:

v1 , v2 , …, vn are loop variables (v1 outermost and vn innermost)

p1 , p2 , …, pk are program parameters

Bounds of vi , r ≤ i ≤ n, r ≥ 1

max(f1(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c), … ,

ft(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c) ) ≤ vi ≤

min(g1(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c), … ,

gs(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c) ) Bounds are dependent on outer loop variables and

parameters (row echelon form)

0..11

i

k

j

jij

n

j

jij cpPvB

Page 15: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Loop Generation (cont.)Loop Generation (cont.)

B11 0 0 … 0B21 B22 0 … 0 . . .Bn1 Bn2 … Bnn

P11 P12 … P1k

P21 P22 … P2k

. . .Pn1 Pn2 … Pnk

c1

c2 ...cn

v1 .vn

p1

.pk

1row echelon form row echelon form – suitable for generating loop code to scan iteration points represented by the system

B P

≥ 0 B | P | Cv p1

.

. ≥ 0

C

Page 16: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parametric Sequential TilingParametric Sequential Tiling

Tiling transformation◦ Express each variable vj in terms of inter-tile (tile)

co-ordinates tj, intra-tile co-ordinates uj and tile sizes sj

vj = sj .tj + uj and 0 ≤ uj ≤ sj -1

S’ :

S’ is equivalent to S

Bij.(sj.tj uj) j1

n

Pij.pj j1

k

ci 0 0 uj sj 1

Not in Row echelon form for tBut in Row echelon form for u

tup s1

.≥ 0

B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1

I : Identity matrix

Page 17: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

◦ To derive a system in row echelon form for all variables Create a system ST with only tile variables, program

parameters and tile sizes (also parameters) Relaxed projection to eliminate intra-tile variables uj

In ST , Bij .uj =

All solutions to S’ also satisfy ST

ST: B.s has same nonzero structure as B => Row echelon

form for t, where s is a diagonal matrix of parametric tile sizes

Parametric Sequential Tiling Parametric Sequential Tiling (cont.)(cont.)

0 if Bij ≤ 0 Bij . (sj -1) if Bij > 0

Bijs j .tj Bij.

j1

n

(sj 1) j1

n

Pij.pj j1

k

ci 0

≥ 0 B.s | P | B+ | C’tp s1

.

Page 18: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parametric Sequential Tiling Parametric Sequential Tiling (cont.)(cont.)

≥ 0 B.s | P | B+ | C’tp s1

.

S’:

ST:In row echelon form for t- To generate tile loops

ST|S’ :

≥ 0

tup s1

.B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1

In row echelon form for t and u- To generate tile loops and intra-tile loops

≥ 0

tup s1

.B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1

B.s | 0 | P | B+ | C’

In row echelon form for u- To generate intra-tile loops

Page 19: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parallel Non-parameterized TilingParallel Non-parameterized Tiling/* Original loops */for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k);

/* Sequential tiled loops */for (it=⌈-6/8⌉; it<=⌊N/8⌋; it++) for (jt=⌈-6/8⌉; jt<=⌊N/8⌋; jt++) for (kt=⌈(it*8-7)/8⌉; kt<=⌊N/8⌋; kt++) // intra-tile loops i,j,k

Lower-bound constraints

Upper-bound constraints

(1a): -6/8<=it(2a): -6/8<=jt(3a): (it*8-7)/8<=kt

(1b): it<=N/8(2b): jt<=N/8(3b): kt<=N/8

(4a): w-it-jt<=kt (4b): kt<=w-it-jt

(5a): Combine (4a) and (3b)w-it-jt<=N/8(8*w-8*it-N)/8<=jt

(5b): Combine (4b) and (3a)(it*8-7)/8<=w-it-jtjt<=(8*w-16*it+7)/8

(6a): Combine (5a) and (2b)(8*w-8*it-N)/8<=N/8(4*w-N)/4<=it

(6b): Combine (5b) and (2a)-6/8<=(8*w-16*it+7)/8it<=(8*w+7)/16

(7a): Combine (6b) and (1a)-6/8<=(8*w+13)/16-7/8<=w

(7b): Combine (6a) and (1b)(4*w-N)/4<=N/8w<=3*N/8

Tiling(8x8x8 tile sizes)

Introduce new wavefront constraints (for loop kt)

Original loop constraints

Use Fourier Motzkin Elimination to derive new wavefront

constraints (for loops w,it,jt )

w = it+jt+kt

Page 20: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parallel Non-parameterized Tiling Parallel Non-parameterized Tiling (cont.)(cont.)

/* Parallel tiled loops */for (w=⌈-7/8⌉; w<=⌊3*N/8⌋; w++) /* sequential */ for (it=max(⌈-6/8⌉, ⌈(4*w-N)/4⌉); it<=min(⌊N/8⌋, ⌊(8*w+7)/16⌋); it++) /* parallel */ for (jt=max(⌈-6/8⌉, ⌈(8*w-8*it-N)/8⌉); jt<=min(⌊N/8⌋, ⌊(8*w-16*it+7)/8⌋); jt++) /* parallel */ for (kt=max(⌈(it*8-7)/8⌉, w-it-jt); kt<=min(⌊N/8⌋, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k This works when tile sizes are fixed

When tile sizes are parametric, Fourier Motzkin Elimination becomes problematic◦ Sign of the coefficient in the combined inequalities can be

indeterminate impossible to determine whether the new inequality is a lower-bound or upper-bound inequality

Page 21: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parallel Parametric TilingParallel Parametric Tiling

1. Introduce an outermost wavefront loop2. Optimize the innermost iterator using wavefront

inequalities w-t1-…-tn-1 ≤ tn ≤ w-t1-…-tn-1

/* Parallel tiled loops */for (w=wmin; w<=wmax; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k

Page 22: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Static Determination of Lowest and Static Determination of Lowest and Highest Wavefront NumbersHighest Wavefront Numbers

The outermost tiling loop enumerates the wavefront numbers from lowest (wmin) to highest (wmax)

The values of wmin and wmax can be determined at compile time using ILP solvers such as PIP/PipLib

Similarly, parametric bound values of each tiling loop variable (tj

min and tjmax for 1 ≤ j ≤ n) can also be computed

using ILP solver.

Original point loops (affine inequalities)

Global parameter values (affine inequalities)

Lexicographic minimum

point in each loop level, e.g., 1, 1

Lexicographic maximal

point in each loop level,

e.g., 200,2*N

Lowest wavefront number

e.g., wmin=⌊1/Ti⌋+⌊1/Tj⌋

Highest wavefront number

e.g., wmax=⌊200/Ti⌋+⌊(2*N)/Tj⌋

ILP Solver

Page 23: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parallel Parametric TilingParallel Parametric Tiling

1. Introduce an outermost wavefront loop Utilize ILP solver to derive wmin and wmax

2. Optimize the innermost iterator using wavefront inequalities w-t1-…-tn-1 ≤ tn ≤ w-t1-…-tn-1

/* Parallel tiled loops */for (w=wmin; w<=wmax; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k

Correct code, but may visit many empty tiles

Page 24: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parallel Parametric Tiling (cont.)Parallel Parametric Tiling (cont.)

3. Optimize using bounded wavefront inequalities Utilize ILP solver to derive parametric bound values

tjmin, tj

max for 1 ≤ j ≤ n/* Parallel tiled loops */for (w=wmin; w<=wmax; w++) /* sequential */ for (it=max(lbit, w-jtmax-ktmax); it<=min(ubit, w-jtmin-ktmin); it++) /* parallel */ for (jt=max(lbjt, w-it-ktmax); jt<=min(ubjt, w-it-ktmin); jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k

Tighter loop bounds, but may still visit empty tiles

Page 25: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Parallel Parametric Tiling (cont.)Parallel Parametric Tiling (cont.)

4. Optimize using Relaxed Symbolic Fourier Motzkin Elimination (RSFME)

Lower-bound constraints Upper-bound constraints

(1a): wmin<=w (1b): wmax<=w

(2a): (1-Ti+1)/Ti<=it(3a): (1-Tj+1)/Tj<=jt(4a): (it*Ti-Tk+1)/Tk<=kt

(2b): it<=N/Ti(3b): jt<=N/Tj(4b): kt<=N/Tk

(5a): w-it-jt<=kt (5b): kt<=w-it-jt

(6a): Combine (5a) and (4b)w-it-jt<=N/Tkw-it-N/Tk<=jt(w*Tk-it*Tk-N)/Tk<=jt

(6b): Combine (5b) and (4a)(it*Ti-Tk+1)/Tk<=w-it-jtjt<=w-it-it*Ti/Tk+1-1/Tkjt<=(w*Tk-it*Tk-it*Ti+Tk-1)/Tk

(7a): Combine (6a) and (3b)w-it-N/Tk<=N/Tjw-N/Tj-N/Tk<=it(w*Tj*Tk-N*Tk-N*Tj)/Tj*Tk<=it

(7b): Combine (6b) and (3a)2/Tj-1<=w-it-it*Ti/Tk+1-1/Tkit+it*Ti/Tk<=w+2-2/Tj-1/Tkit<=(w*Tj*Tk^2+2*Tj*Tk^2-Tj*Tk-2*Tk^2) / (Ti*Tj*Tk+Tj*Tk^2)

Very tight loop bounds, with negligible overhead of scanning empty tiles

/* Original loops */for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k);/* Sequential tiled loops

*/for (it=⌈(1-Ti+1)/Ti⌉; it<=⌊N/Ti⌋; it++) for (jt=⌈(1-Tj+1)/Tj⌉; jt<=⌊N/Tj⌋; jt++) for (kt=⌈(it*Ti-Tk+1)/Tk⌉; kt<=⌊N/Tk⌋; kt++) // intra-tile loops i,j,k

No ambiguous signs encountered

Page 26: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Ambiguous Sign ResolutionAmbiguous Sign Resolution Resolving ambiguous sign in RSFME Relaxation step

◦ Replace the tile loop variables with their parametric bounded values (tjmin

and tjmax) Lower-bound constraints Upper-bound constraints

(1a): wmin<=w (1b): wmax<=w

(2a): (1-Ti+1)/Ti<=it(3a): (1-Tj+1)/Tj<=jt(4a): (it*Ti-Tk+1)/Tk<=kt

(2b): it<=N/Ti(3b): jt<=(N-it*Ti)/Tj(4b): kt<=N/Tk

(5a): w-it-jt<=kt (5b): kt<=w-it-jt

(6a): Combine (5a) and (4b)w-it-jt<=N/Tkw-it-N/Tk<=jt(w*Tk-it*Tk-N)/Tk<=jt

(6b): Combine (5b) and (4a)(it*Ti-Tk+1)/Tk<=w-it-jtjt<=w-it-it*Ti/Tk+1-1/Tkjt<=(w*Tk-it*Tk-it*Ti+Tk-1)/Tk

(7a): Combine (6a) and (3b)w-it-N/Tk<=N/Tj-it*Ti/Tjw-N/Tj-N/Tk<= it*(1-Ti/Tj)

(7b): Combine (6b) and (3a)2/Tj-1<=w-it-it*Ti/Tk+1-1/Tkit+it*Ti/Tk<=w+2-2/Tj-1/Tkit<=(w*Tj*Tk^2+2*Tj*Tk^2-Tj*Tk-2*Tk^2) / (Ti*Tj*Tk+Tj*Tk^2)

/* Original loops */for (i=1; i<=N; i++) for (j=1; j<=N-i; j++) for (k=i; k<=N; k++) S(i,j,k);/* Sequential tiled

loops */for (it=⌈(1-Ti+1)/Ti⌉; it<=⌊N/Ti⌋; it++) for (jt=⌈(1-Tj+1)/Tj⌉; jt<=⌊(N-it*Ti)/Tj⌋; jt++) for (kt=⌈(it*Ti-Tk+1)/Tk⌉; kt<=⌊N/Tk⌋; kt++) // intra-tile loops i,j,k

Ambiguous sign encountered

(7a.1) w-N/Tj-N/Tk+ itmin *Ti/Tj<=it (w*Tj*Tk-N*Tj-N*Tk+ itmin *Ti*Tk)/(Tj*Tk)<=itw-N/Tj-N/Tk<=it-

it*Ti/Tj (7a.2) it*Ti/Tj<= itmax -w+N/Tj+N/Tk it<=( itmax *Tj*Tk-w*Tj*Tk+N*Tj+N*Tk)/(Ti*Tk)

Use itmin and itmax to resolve sign ambiguity:

Page 27: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

PTile: Prototype ImplementationPTile: Prototype Implementation

PlutoModified CLooG

Parser + AST

GeneratorClan

Convex Hull

Generator(using ISL)

Tiling Transform

er

Wavefront Parallelizer + RSFME

Code Generat

or

Sequence of loop nests

Pre-process Statement polyhedra+

Affine transforms (for rectangular tileability)

Tileable loop code (with preserved embedding

information)Loop ASTs

Statement polyhedra

Convex-hullloop AST

Sequential tiled loop

ASTs Parallel tiled loop

ASTsParallel

tiled loop code

PTilePTile

Page 28: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

PTile, DynTile, PrimeTile: PTile, DynTile, PrimeTile: ExperimentsExperiments

Main comparison:◦ PTile, DynTile and PrimeTile

AMD Opteron 2380:◦ Dual-socket quad-core AMD Opteron 2380 processors

running at 2.6 GHz with 256+256KB L1 cache, 2 MB of L2 cache

Compilers:◦ GCC version 4.4.0 ◦ ICC version 11.0

Experiments:◦ With and without vectorization◦ For parallel runs, used OpenMP

Benchmarks: 2-D FDTD, Cholesky, DTRMM, LU

Page 29: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Results - 1Results - 1

◦ PTile: The RSFME relaxation step was never needed in

these and other benchmarks that we have tested with

◦ Control overhead: PrimeTile has simple loop bounds but larger code

size PTile and DynTile generate more complex loop

bounds For 2D-FDTD, there is a 20% to 40% difference in

execution time due to control overhead; for the other benchmarks, no significant difference

Page 30: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Results - 2Sequential Parallel

Bench Compiler PrimeTile DynTile PTile DynTile PTile

2d-fdtd gcc-novec 43.84s 49.19s 56.78s 9.32s 10.98s 2d-fdtd gcc-vec 43.82s 49.22s 56.85s 9.37s 10.98s 2d-fdtd icc-novec 40.27s 48.12s 54.29s 13.30s 12.96s 2d-fdtd icc-vec 40.52s 49.61s 54.63s 13.03s 13.18s

cholesky gcc-novec 6.13s 10.50s 13.43s 1.91s 2.81s cholesky gcc-vec 6.08s 10.46s 13.45s 1.89s 2.82s cholesky icc-novec 5.63s 5.86s 8.19s 1.21s 2.40s cholesky icc-vec 5.36s 5.74s 8.22s 1.27s 2.61s

dtrmm gcc-novec 9.29s 14.34s 18.99s 2.55s 4.50s dtrmm gcc-vec 9.25s 14.57s 18.99s 2.54s 3.69s dtrmm icc-novec 9.84s 9.19s 13.27s 2.17s 3.22s dtrmm icc-vec 9.91s 9.12s 13.44s 2.33s 3.27s

lu gcc-novec 8.30s 9.15s 10.98s 2.56s 2.94s lu gcc-vec 8.29s 9.15s 10.98s 2.98s 2.43s lu icc-novec 6.30s 5.63s 7.49s 6.18s 1.60s lu icc-vec 6.36s 5.58s 6.52s 6.36s 1.62s

Page 31: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Results - 3

Sequential: ◦ PrimeTile performs best; DynTile is close◦ gcc has more trouble optimizing code from DynTile

than code from PrimeTile (difference between icc and gcc)

◦ PTile is slower because the order of execution of tiles impacts locality

Parallel:◦ DynTile performs better than PTile (except for LU – we

need to understand this better)◦ All tiles in a waveftont are executed in parallel with

DynTile, where as the OpenMP parallel pragma works only with the outermost tiled parallel loop in PTile

Vectorization: ◦ complexity of loop bounds in generated code appear

to make it difficult for the compiler to vectorize

Page 32: Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

SummarySummary

Developed DynTile and PTile, two parametric tiling systems with the following features◦ Handle imperfectly nested loops◦ Allow tile sizes to be run time parameters◦ Address parallelism◦ Support multi-level tiling

Ongoing: Much more extensive set of experiments to understand and improve the efficiency of the approaches for generation of parallel parametrically tiled code