Parametric Tiling of Parametric Tiling of Affine Loop Nests Affine Loop Nests * * Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan 1 1 Ohio State University 2 Louisiana State University *Supported by US NSF
32
Embed
Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parametric Tiling of Parametric Tiling of Affine Loop NestsAffine Loop Nests**
Sanket Tavarageri 1
Albert Hartono 1
Muthu Baskaran 1
Louis-Noel Pouchet 1
J. “Ram” Ramanujam 2
P. “Saday” Sadayappan 1
1 Ohio State University 2 Louisiana State University
*Supported by US NSF
A key loop transformation for:◦ Efficient coarse-grained parallel execution◦ Data locality optimization
Loop TilingLoop Tiling
i
j
i
j
for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j);
for (it=1; it<=7; it+=Ti) for (jt=1; jt<=6; jt+=Tj) for (i=it; i<min(7,it+Ti-1); i++) for (j=jt; j<min(6,jt+Tj-1); j++) S(i,j);
Inter-tile loops
Intra-tile loops
Rectangular Tileability
Legality of rectangular tiling:◦ Atomic execution of each tile◦ No cyclic dependence between tiles
Data dependence lexicographically positive in all space dimensions
Unimodular transformations (e.g., skewing) used as a pre-processing step to make rectangular tiling valid
i
j
i’
j’
.1 0 01 1 0
i j1
i‘ j’
=
Skewing
Parametric TilingParametric Tiling
for (it=1; it<=N; it+=Ti) for (jt=1; jt<=N; jt+=Tj) for (i=it; i<min(N,it+Ti-1); i++) for (j=jt; j<min(N,jt+Tj-1); j++) S(i,j);
for (i=1; i<=N; i++)
for (j=1; j<=N; j++)
S(i,j);
Tile loop i with tile size Ti
Tile loop j with tile size Tj
Performance of tiled code can vary greatly with choice of tile sizes
→ Model-driven and/or empirical search for best tile sizes Parametric tile sizes
◦ Not fixed at compile time ◦ Runtime parameters◦ Valuable for:
Auto-tuning systems Generalized “ATLAS”
Approaches to Loop Tiling
TLOG and HiTLOG◦ Handles only perfectly nested loops◦ Tile sizes can be runtime parameters◦ Does not address parallelism
Pluto◦ Handles imperfectly nested loops◦ Tile sizes must be fixed at compile time◦ Addresses parallelism
PrimeTile◦ Handles imperfectly nested loops◦ Tile sizes can be runtime parameters◦ Does not address parallelism
DynTile and PTile (this work):Systems with all positive features of existing tiling tools:◦ Handle imperfectly nested
loops◦ Tile sizes can be runtime
parameters◦ Address parallelism◦ Support multilevel tiling
Tiled Code Generation with Polyhedral Model
1 0 0 -1-1 0 1 0 0 1 0 -1 0 -1 1 0
>=
Original loop:
for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j);
Tiled loop:
for (it=0; it<=floord(N,32); it++) for (jt=0; jt<=floord(N,32); jt++) for (i=max(1,32*it); i<=min(N,32*it+31); i++) for (j=max(1,32*jt); j<=min(N,32*jt+31); j++) S(i,j);
After sequential tiling:1. If no loop carried dependences exist, then each tiling loop is directly parallelizable 2. If none of the tiling loops is parallel, then wave-front parallelization is always
possible (all points in the same wavefront are independent of each other)
for each bin w { #pragma omp parallel for for each tile in w {
}}
/** Intra-tile loops (treated as a black box) */
w1 w2 w3 w4 w5
DynTile: Inspector Code for Dynamic Scheduled Parallel Execution
it
jt
/** Inter-tile loops */for it { for jt {
}}
/** * Intra-tile loops * (treated as * a black box) */
Tile iteration space
w1 w2 w3 w4 w5
Step 1: Count #wavefronts and #tiles in each wavefrontStep 2: Allocate bins to store wavefronts
Step 3: Fill the bins with its corresponding tilesStep 4: Execute in parallel all tiles in each bin
#wavefronts = 5
w1 has 2 tilesw2 has 3 tilesw3 has 4 tilesw4 has 3 tilesw5 has 4 tiles
DynTile: Implementation
PlutoModified CLooG
Parser + AST
GeneratorClan
Convex Hull
Generator(using ISL)
Tiling Transform
er
Inspector Code
Specifier
Code Generat
or
Sequence of loop nests
Pre-process Statement polyhedra+
Affine transforms (for rectangular tileability)
Tileable loop code (with preserved
embedding information)Loop ASTs
Statement polyhedra
Convex-hullloop AST
Tiled loop ASTs
Parallel tiled loop
ASTsParallel
tiled loop code
DynTileDynTile
PTile: Loop GenerationPTile: Loop Generation
Representation of Statement Domains◦ Set of affine inequalities
S:
v1 , v2 , …, vn are loop variables (v1 outermost and vn innermost)
lu gcc-novec 8.30s 9.15s 10.98s 2.56s 2.94s lu gcc-vec 8.29s 9.15s 10.98s 2.98s 2.43s lu icc-novec 6.30s 5.63s 7.49s 6.18s 1.60s lu icc-vec 6.36s 5.58s 6.52s 6.36s 1.62s
Results - 3
Sequential: ◦ PrimeTile performs best; DynTile is close◦ gcc has more trouble optimizing code from DynTile
than code from PrimeTile (difference between icc and gcc)
◦ PTile is slower because the order of execution of tiles impacts locality
Parallel:◦ DynTile performs better than PTile (except for LU – we
need to understand this better)◦ All tiles in a waveftont are executed in parallel with
DynTile, where as the OpenMP parallel pragma works only with the outermost tiled parallel loop in PTile
Vectorization: ◦ complexity of loop bounds in generated code appear
to make it difficult for the compiler to vectorize
SummarySummary
Developed DynTile and PTile, two parametric tiling systems with the following features◦ Handle imperfectly nested loops◦ Allow tile sizes to be run time parameters◦ Address parallelism◦ Support multi-level tiling
Ongoing: Much more extensive set of experiments to understand and improve the efficiency of the approaches for generation of parallel parametrically tiled code