University of Amsterdam CSA CSA Computer Systems Architecture Introduction to Compiler Design: optimization and backend issues Andy Pimentel Computer Systems Architecture group [email protected]Introduction to Compiler Design – A. Pimentel – p. 1/98
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Introduction to Compiler Design:optimization and backend issues
Introduction to Compiler Design – A. Pimentel – p. 31/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Reducible flow graphs
A flow graph is reducible when the edges can be partitionedinto forward edges and backedges
The forward edges must form an acyclic graph in whichevery node can be reached from the initial node
Exclusive use of structured control-flow statements such asif-then-else, while and break produces reduciblecontrol-flow
Irreducible control-flow can create loops that cannot beoptimized
Introduction to Compiler Design – A. Pimentel – p. 32/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Reducible flow graphs (cont’d)
Irreducible control-flow graphs can always be madereducible
This usually involves some duplication of code
a
cb
a
cb
c’
Introduction to Compiler Design – A. Pimentel – p. 33/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Dataflow analysis
Data analysis is needed for global code optimization, e.g.:Is a variable live on exit from a block? Does adefinition reach a certain point in the code?
Dataflow equations are used to collect dataflow informationA typical dataflow equation has the formout
S
� gen
S
�
in
S� kill
S
�
The notion of generation and killing depends on thedataflow analysis problem to be solved
Let’s first consider Reaching Definitions analysis forstructured programs
Introduction to Compiler Design – A. Pimentel – p. 34/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Reaching definitions
A definition of a variable x is a statement that assigns ormay assign a value to x
An assignment to x is an unambiguous definition of x
An ambiguous assignment to x can be an assignment to apointer or a function call where x is passed by reference
Introduction to Compiler Design – A. Pimentel – p. 35/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Reaching definitions (cont’d)
When x is defined, we say the definition is generated
An unambiguous definition of x kills all other definitions ofx
When all definitions of x are the same at a certain point, wecan use this information to do some optimizations
Example: all definitions of x define x to be 1. Now, byperforming constant folding, we can do strength reductionif x is used in z � y � x
Introduction to Compiler Design – A. Pimentel – p. 36/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Dataflow analysis for reaching definitions
During dataflow analysis we have to examine every paththat can be taken to see which definitions reach a point inthe code
Sometimes a certain path will never be taken, even if it ispart of the flow graph
Since it is undecidable whether a path can be taken, wesimply examine all paths
This won’t cause false assumptions to be made for thecode: it is a conservative simplification
It merely causes optimizations not to be performed
Introduction to Compiler Design – A. Pimentel – p. 37/98
Introduction to Compiler Design – A. Pimentel – p. 44/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Available expressions
An expression e is available at a point p if every path fromthe initial node to p evaluates e, and the variables used by eare not changed after the last evaluations
An available expression e is killed if one of the variablesused by e is assigned to
An available expression e is generated if it is evaluated
Note that if an expression e is assigned to a variable usedby e, this expression will not be generated
Introduction to Compiler Design – A. Pimentel – p. 45/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Available expressions (cont’d)
Available expressions are mainly used to find commonsubexpressions
t1 = 4 * i
?
t2 = 4 * i
B2
B3
B1 t1 = 4 * i
t2 = 4 * i
t0 = 4 * ii = ...
B1
B2
B3
Introduction to Compiler Design – A. Pimentel – p. 46/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Available expressions (cont’d)
Dataflow equations:
out
B
� e_gen
B
�
in
B� e_kill
B
�
in
B
�
P � pred
B!
out
P
for B not initial
in
B1
� /0 where B1 is the initial block
The confluence operator is intersection instead of the union!
Introduction to Compiler Design – A. Pimentel – p. 47/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Liveness analysis
A variable is live at a certain point in the code if it holds avalue that may be needed in the future
Solve backwards:Find use of a variableThis variable is live between statements that havefound use as next statementRecurse until you find a definition of the variable
Introduction to Compiler Design – A. Pimentel – p. 48/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Dataflow for liveness
Using the sets use
B
and de f
B
de f
B
is the set of variables assigned values in B priorto any use of that variable in Buse
B
is the set of variables whose values may be usedin B prior to any definition of the variable
A variable comes live into a block (in in
B
), if it is eitherused before redefinition of it is live coming out of the blockand is not redefined in the block
A variable comes live out of a block (in out
B
) if and onlyif it is live coming into one of its successors
Introduction to Compiler Design – A. Pimentel – p. 49/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Dataflow equations for liveness
in
B
� use
B
�
out
B
� de f
B �
out
B
�
S �succ$
B%
in
S
Note the relation between reaching-definitions equations:the roles of in and out are interchanged
Introduction to Compiler Design – A. Pimentel – p. 50/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Algorithms for global optimizations
Global common subexpression eliminationFirst calculate the sets of available expressions
For every statement s of the form x � y � z where y � z isavailable do the following
Search backwards in the graph for the evaluations ofy � zCreate a new variable uReplace statements w � y � z by u � y � z; w � uReplace statement s by x � u
Introduction to Compiler Design – A. Pimentel – p. 51/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Copy propagation
Suppose a copy statement s of the form x � y isencountered. We may now substitute a use of x by a use ofy if
Statement s is the only definition of x reaching the useOn every path from statement s to the use, there are noassignments to y
Introduction to Compiler Design – A. Pimentel – p. 52/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Copy propagation (cont’d)
To find the set of copy statements we can use, we define anew dataflow problem
An occurrence of a copy statement generates this statement
An assignment to x or y kills the copy statement x � y
Dataflow equations:
out
B
� c_gen
B
�
in
B
� c_kill
B
�
in
B �
P � pred
B
!
out
P
for B not initial
in
B1
� /0 where B1 is the initial block
Introduction to Compiler Design – A. Pimentel – p. 53/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Copy propagation (cont’d)
For each copy statement s: x � y doDetermine the uses of x reached by this definition of xDetermine if for each of those uses this is the onlydefinition reaching it ( � s � in
Buse
)If so, remove s and replace the uses of x by uses of y
Introduction to Compiler Design – A. Pimentel – p. 54/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Detection of loop-invariant computations
1. Mark invariant those statements whose operands areconstant or have reaching definitions outside the loop
2. Repeat step 3 until no new statements are marked invariant
3. Mark invariant those statements whose operands either areconstant, have reaching definitions outside the loop, or haveone reaching definition that is marked invariant
Introduction to Compiler Design – A. Pimentel – p. 55/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Code motion
1. Create a pre-header for the loop
2. Find loop-invariant statements
3. For each statement s defining x found in step 2, check that(a) it is in a block that dominate all exits of the loop(b) x is not defined elsewhere in the loop(c) all uses of x in the loop can only be reached from
this statement s
4. Move the statements that conform to the pre-header
Introduction to Compiler Design – A. Pimentel – p. 56/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Code motion (cont’d)
i = 2u = u + 1
i = 1
if u < v goto B3
v = v − 1if v <= 20 goto B5
j = i
B3
B2
B4
B5
B1
i = 1 B1
i = 2u = u + 1
if u < v goto B3
v = v − 1if v <= 20 goto B5
j = i
B3
B2
B4
B5
i = 3
i = 2u = u + 1
i = 1
if u < v goto B3
B3
B2
B1
v = v − 1if v <= 20 goto B5
j = i B5
k = iB4
Condition (a) Condition (b) Condition (c)
Introduction to Compiler Design – A. Pimentel – p. 57/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Detection of induction variables
A basic induction variable i is a variable that only hasassignments of the form i � i
&
c
Associated with each induction variable j is a triple
�
i ' c ' d
�
where i is a basic induction variable and c and d areconstants such that j � c � i � d
In this case j belongs to the family of i
The basic induction variable i belongs to its own family,with the associated triple
�
i ' 1 ' 0
�
Introduction to Compiler Design – A. Pimentel – p. 58/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Detection of induction variables (cont’d)
Find all basic induction variables in the loop
Find variables k with a single assignment in the loop withone of the following forms:
k � j � b, k � b � j, k � j(
b, k � j � b, k � b � j, whereb is a constant and j is an induction variable
If j is not basic and in the family of i then there must beNo assignment of i between the assignment of j and kNo definition of j outside the loop that reaches k
Introduction to Compiler Design – A. Pimentel – p. 59/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Strength reduction for induction variables
Consider each basic induction variable i in turn. For eachvariable j in the family of i with triple
�i ' c ' d
�
:Create a new variable sReplace the assignment to j by j � sImmediately after each assignment i � i
&
n appends � s � c � nPlace s in the family of i with triple
�
i ' c ' d
�
Initialize s in the preheader: s � c � i � d
Introduction to Compiler Design – A. Pimentel – p. 60/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Strength reduction for induction variables (cont’d)
i = i + 1t2 = 4 * it3 = a[t2]if t3 < v goto B2
Strength reduction
i = m − 1t1 = 4 * nv = a[t1]
if i < n goto B5
B5
B1
B2
B3
B4
i = m − 1t1 = 4 * nv = a[t1]
s2 = 4 * i
t3 = a[t2]if t3 < v goto B2
t2 = s2s2 = s2 + 4i = i + 1
if i < n goto B5
B5
B3
B2
B1
B4
Introduction to Compiler Design – A. Pimentel – p. 61/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Elimination of induction variables
Consider each basic induction variable i only used tocompute other induction variables and tests
Take some j in i’s family such that c and d from the triple�
i ' c ' d
�
are simple
Rewrite tests if (i relop x) tor � c � x � d; if ( j relop r)
Delete assignments to i from the loop
Do some copy propagation to eliminate j � s assignmentsformed during strength reduction
Introduction to Compiler Design – A. Pimentel – p. 62/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Alias Analysis
Aliases, e.g. caused by pointers, make dataflow analysismore complex (uncertainty regarding what is defined andused: x � �p might use any variable)
Use dataflow analysis to determine what a pointer mightpoint to
in
B
contains for each pointer p the set of variables towhich p could point at the beginning of block B
Elements of in
B
are pairs
�
p ' a
�
where p is a pointerand a a variable, meaning that p might point to a
out
B
is defined similarly for the end of B
Introduction to Compiler Design – A. Pimentel – p. 63/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Alias Analysis (cont’d)
Define a function transB such that transB�
in
B � � out
B
transB is composed of transs, for each stmt s of block BIf s is p � &a or p � &a
&
c in case a is an array, thentranss
�
S
� �
�
S �� �
p ' b
�)
any variable b� � � �
p ' a
� �
If s is p � q
&
c for pointer q and nonzero integer c,then
transs
�
S� � �S �� �
p ' b
�)
any variable b
� �
� �p ' b
�) �
q ' b
� �
S and b is an array variable
�
If s is p � q, thentranss
�
S
� � �S �� �
p ' b
�)
any variable b
� �
� �
p ' b
�) �
q ' b
� � S
�
Introduction to Compiler Design – A. Pimentel – p. 64/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Alias Analysis (cont’d)
– If s assigns to pointer p any other expression, thentranss
�
S
� � S �� �
p ' b
�)
any variable b�
– If s is not an assignment to a pointer, then transs
�
S
� � S
Dataflow equations for alias analysis:
out
B � transB
�
in
B
�
in
B �
P � pred
B
!
out
P
where transB�
S� � transsk
�
transsk *1
�,+ + + �
transs1
�
S
� � � �
Introduction to Compiler Design – A. Pimentel – p. 65/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Alias Analysis (cont’d)
How to use the alias dataflow information? Examples:In reaching definitions analysis (to determine gen andkill)
� statement �p � a generates a definition of everyvariable b such that p could point to b
� �p � a kills definition of b only if b is not an arrayand is the only variable p could possibly point to (tobe conservative)
In liveness analysis (to determine de f and use)
� �p � a uses p and a. It defines b only if b is theunique variable that p might point to (to beconservative)
� a � �p defines a, and represents the use of p and ause of any variable that p could point to
Introduction to Compiler Design – A. Pimentel – p. 66/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Code generation
Instruction selectionWas a problem in the CISC era (e.g., lots of addressingmodes)
RISC instructions mean simpler instruction selection
However, new instruction sets introduce new, complicatedinstructions (e.g., multimedia instruction sets)
Introduction to Compiler Design – A. Pimentel – p. 67/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Instruction selection methods
Tree-based methods (IR is a tree)Maximal MunchDynamic programmingTree grammars
Input tree treated as string using prefix notationRewrite string using an LR parser and generateinstructions as side effect of rewriting rules
If the DAG is not a tree, then it can be partitioned intomultiple trees
Introduction to Compiler Design – A. Pimentel – p. 68/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Tree pattern based selection
Every target instruction is represented by a tree pattern
Such a tree pattern often has an associated cost
Instruction selection is done by tiling the IR tree with theinstruction tree patterns
There may be many different ways an IR tree can be tiled,depending on the instruction set
Introduction to Compiler Design – A. Pimentel – p. 69/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Tree pattern based selection (cont’d)
+
mem
const d
+
mem
mem
move
const b
temp 1 const a
* temp 2 const c
+
Name Effect Trees Cycles
— ri temp 0
ADD ri
- r j
. rk+
1
MUL ri
- r j
/ rk*
1
ADDI ri
- r j
. c +
const
+
constconst 1
LOAD ri
- M 0r j. c 1
+
const
mem
+
mem
const
mem
const
mem3
STORE M
0r j
. c 1 - ri
+
const
mem
move
+
mem
move
const
mem
move
const
mem
move3
MOVEM M
0
r j
1 - M 0ri
1
mem
move
mem6
Introduction to Compiler Design – A. Pimentel – p. 70/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Optimal and optimum tilings
The cost of a tiling is the sum of the costs of the tree patterns
An optimal tiling is one where no two adjacent tiles can becombined into a single tile of lower cost
An optimum tiling is a tiling with lowest possible cost
An optimum tiling is also optimal, but not vice-versa
Introduction to Compiler Design – A. Pimentel – p. 71/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Maximal Munch
Maximal Munch is an algorithm for optimal tilingStart at the root of the treeFind the largest pattern that fitsCover the root node plus the other nodes in the pattern;the instruction corresponding to the tile is generatedDo the same for the resulting subtrees
Maximal Munch generates the instructions in reverse order!
Introduction to Compiler Design – A. Pimentel – p. 72/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Dynamic programming
Dynamic programming is a technique for finding optimumsolutions
Bottom up approachFor each node n the costs of all children are foundrecursively.Then the minimum cost for node n is determined.
After cost assignment of the entire tree, instructionemission follows:
Emission(node n): for each leaves li of the tileselected at node n, perform Emission(li). Then emitthe instruction matched at node n
Introduction to Compiler Design – A. Pimentel – p. 73/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Register allocation...a graph coloring problem
First do instruction selection assuming an infinite numberof symbolic registers
Build an interference graphEach node is a symbolic registerTwo nodes are connected when they are live at thesame time
Color the interference graphConnected nodes cannot have the same colorMinimize the number of colors (maximum is thenumber of actual registers)
Introduction to Compiler Design – A. Pimentel – p. 74/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Coloring by simplification
Simplify interference graph G using heuristic method(K-coloring a graph is NP-complete)
Find a node m with less than K neighborsRemove node m and its edges from G, resulting in G
2
.Store m on a stackColor the graph G
2Graph G can be colored since m has less than Kneighbors
Introduction to Compiler Design – A. Pimentel – p. 75/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Coloring by simplification (cont’d)
SpillIf a node with less than K neigbors cannot be found inG
Mark a node n to be spilled, remove n and its edgesfrom G (and stack n) and continue simplification
SelectAssign colors by popping the stackArriving at a spill node, check whether it can becolored. If not:
The variable represented by this node will reside inmemory (i.e. is spilled to memory)Actual spill code is inserted in the program
Introduction to Compiler Design – A. Pimentel – p. 76/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Coalescing
If there is no interference edge between the source anddestination of a move, the move is redundant
Removing the move and joining the nodes is calledcoalescing
Coalescing increases the degree of a node
A graph that was K colorable before coalescing might notbe afterwards
Introduction to Compiler Design – A. Pimentel – p. 77/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Sketch of the algorithm with coalescing
Label move-related nodes in interference graph
While interference graph is nonemptySimplify, using non-move-related nodesCoalesce move-related nodes using conservativecoalescing
Coalesce only when the resulting node has less thanK neighbors with a significant degree
No simplifications/coalescings: “freeze” amove-related node of a low degree � do not considerits moves for coalescing anymoreSpill
Select
Introduction to Compiler Design – A. Pimentel – p. 78/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Register allocation: an example
Live in: k,jg = mem[j+12]h = k −1f = g * he = mem[j+8]m = mem[j+16]b = mem[f]c = e + 8d = ck = m + 4j = bgoto dLive out: d,k,j
e
d
h g
kj b
f
m
c
Assume a 4-coloring (K � 4)
Simplify by removing and stacking nodes with � 4neighbors (g,h,k,f,e,m)
Introduction to Compiler Design – A. Pimentel – p. 79/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Register allocation: an example (cont’d)
After removing and stacking the nodes g,h,k,f,e,m:
After simplification
d
j b
c
j&b d&c
After coalescing
Coalesce now and simplify again
Introduction to Compiler Design – A. Pimentel – p. 80/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Register allocation: an example (cont’d)
R0 R1 R2 R3Stacked elements: d&cj&b mefkgh
4 registers available:
e
d
h g
kj b
f
m
c
e
d
h g
kj b
f
m
c
Introduction to Compiler Design – A. Pimentel – p. 81/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Register allocation: an example (cont’d)
R0 R1 R2 R3efkgh
4 registers available:
e
d
h g
kj b
f
m
c
e
d
h g
kj b
f
m
c
Stacked elements: m
ETC., ETC.
No spills are required and both moves were optimized away
Introduction to Compiler Design – A. Pimentel – p. 82/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Instruction scheduling
Increase ILP (e.g., by avoiding pipeline hazards)Essential for VLIW processors
Scheduling at basic block level: list schedulingSystem resources represented by matrix Resources 3
TimePosition in matrix is true or false, indicating whetherthe resource is in use at that timeInstructions represented by matrices Resources 3
Instruction durationUsing dependency analysis, the schedule is made byfitting instructions as tight as possible
Introduction to Compiler Design – A. Pimentel – p. 83/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
List scheduling (cont’d)
Finding optimal schedule is NP-complete problem � useheuristics, e.g. at an operation conflict schedule the mosttime-critical first
For a VLIW processor, the maximum instruction durationis used for scheduling � painful for memory loads!
Basic blocks usually are small (5 operations on the average)
� benefit of scheduling limited � Trace Scheduling
Introduction to Compiler Design – A. Pimentel – p. 84/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Trace scheduling
Schedule instructions over code sections larger than basicblocks, so-called traces
A trace is a series of basic blocks that does not extendbeyond loop boundaries
Apply list scheduling to whole trace
Scheduling code inside a trace can move code beyond basicblock boundaries � compensate this by adding code to theoff-trace edges
Introduction to Compiler Design – A. Pimentel – p. 85/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Trace scheduling (cont’d)
BB1
BB2 BB3
BB4
Introduction to Compiler Design – A. Pimentel – p. 86/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Trace scheduling (cont’d)
Operation to be movedbefore Op A
Op COp AOp B
Off Trace
Off Trace
in TraceBasic Block
in TraceBasic Block
in TraceBasic Block
(c)
(b)
(a)
Copied code
Basic BlockOff Trace
traceOp ABranch
Op BOp C
Branch
Op A
Branch
Op B
Op B
below Branch in
Op COp AOp B Op B
Op C
BranchOp A
Op A Op C Op BOp A
Op C
In Trace
allowed if no side-
In TraceCopied code inoff Trace Basic Block
codeeffects in Off trace
Moved code onlyOperation to be movedabove Branch
In Trace
In Trace
Operation to be moved
Introduction to Compiler Design – A. Pimentel – p. 87/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Trace scheduling (cont’d)
Trace selection
Because of the code copies, the trace that is most oftenexecuted has to be scheduled first
A longer trace brings more opportunities for ILP (loopunrolling!)
Use heuristics about how often a basic block is executedand which paths to and from a block have the most chanceof being taken (e.g. inner-loops) or use profiling (inputdependent)
Introduction to Compiler Design – A. Pimentel – p. 88/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Other methods to increase ILP
Loop unrollingTechnique for increasing the amount of code availableinside a loop: make several copies of the loop body
Reduces loop control overhead and increases ILP (moreinstructions to schedule)
When using trace scheduling this results in longer tracesand thus more opportunities for better schedules
In general, the more copies, the better the job the schedulercan do but the gain becomes minimal
Introduction to Compiler Design – A. Pimentel – p. 89/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Loop unrolling (cont’d)
Example
for (i = 0; i < 100; i++)
a[i] = a[i] + b[i];becomes
for (i = 0; i < 100; i += 4) {
a[i] = a[i] + b[i];
a[i+1] = a[i+1] + b[i+1];
a[i+2] = a[i+2] + b[i+2];
a[i+3] = a[i+3] + b[i+3];
}
Introduction to Compiler Design – A. Pimentel – p. 90/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Software pipelining
Also a technique for using the parallelism available inseveral loop iterations
Software pipelining simulates a hardware pipeline, henceits name
pipelinedSoftware
iteration
Iterattion 0Iteration 1
Iteration 2Iteration 3
Iteration 4
There are three phases: Prologue, Steady state and Epilogue
Introduction to Compiler Design – A. Pimentel – p. 91/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Software pipelining (cont’d)
Loop: LDADDD F4,F0,F2
SD
F0,0(R1)
0(R1),F4
Body
SBGEZ R1, Loop Loop control
T0
T1
T2
T... Loop:
LD
ADDD .
.
SD
LD
LD SBGEZ Loop.ADDD
LD
SD ADDD .Steady state
Prologue
Epilogue
Tn
Tn+1
Tn+2
SD ADDD
SD
Introduction to Compiler Design – A. Pimentel – p. 92/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Modulo scheduling
Scheduling multiple loop iterations using softwarepipelining can create false dependencies between variablesused in different iterations
Renaming the variables used in different iterations is calledmodulo scheduling
When using n variables for representing the same variable,the steady state of the loop has to be unrolled n times
Introduction to Compiler Design – A. Pimentel – p. 93/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Compiler optimizations for cache performance
Merging arrays (better spatial locality)
int val[SIZE]; struct merge {
int key[SIZE]; 4 int val, key; };
struct merge m_array[SIZE]
Loop interchange
Loop fusion and fission
Blocking (better temporal locality)
Introduction to Compiler Design – A. Pimentel – p. 94/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Loop interchange
Exchanging of nested loops to change the memory footprintBetter spatial locality
for (i = 0; i < 50; i++)
for (j = 0; j < 100; j++)
a[j][i] = b[j][i] * c[j][i];
becomesfor (j = 0; j < 100; j++)
for (i = 0; i < 50; i++)
a[j][i] = b[j][i] * c[j][i];
Introduction to Compiler Design – A. Pimentel – p. 95/98
Introduction to Compiler Design – A. Pimentel – p. 96/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Loop fission
Split a loop with independent statements into multiple loopsEnables other transformations (e.g. vectorization)Results in smaller cache footprint (better temporallocality)
for (i = 0; i < n; i++) {
a[i] = b[i] + c[i];
d[i] = e[i] * f[i];
}
becomes
for (i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
for (i = 0; i < n; i++) {
d[i] = e[i] * f[i];
}
Introduction to Compiler Design – A. Pimentel – p. 97/98
Universityof
Amsterdam
CSACSAComputerSystems
Architecture
Blocking
Perform computations on sub-matrices (blocks), e.g. whenmultiple matrices are accessed both row by row and column bycolumn
i
j
i
k j
k
X Y Zfor (i=0; i < N; i++) for (j=0; j < N; j++) {
r = 0;for (k = 0; k < N; k++) {
r = r + y[i][k]*z[k][j];};x[i][j] = r;
};
Matrix multiplication x = y*z
not touched older access recent access
i
j
i
k j
k
X Y Z
Blocking
Introduction to Compiler Design – A. Pimentel – p. 98/98