Code Optimization Manas Thakur

CS502: Compiler Design

Code Optimization

Manas Thakur

Fall 2020

Manas Thakur CS502: Compiler Design 2

Fast. Faster. Fastest?

Lexical AnalyzerLexical Analyzer

Syntax AnalyzerSyntax Analyzer

Semantic AnalyzerSemantic Analyzer

Intermediate Code Generator

Intermediate Code Generator

Character stream

Token stream

Syntax tree

Syntax tree

Intermediaterepresentation

Machine-Independent Code Optimizer

Code GeneratorCode Generator

Target machine code

Intermediate representation

Machine-Dependent Code Optimizer

Target machine code

SymbolTable

F r

o n

t e

n d

B a

c k

e n

d


Role of Code Optimizer

● Make the program better

– time, memory, energy, ...

● No guarantees in this land!

– Will a particular optimization for sure improve something?

– Will performing an optimization affect something else?

– In what order should I perform the optimizations?

– What “scope” to perform certain optimization at?

– Is the optimizer fast enough?

● Can an optimized program be optimized further?


Full employment theorem for compiler writers

● Statement: There is no fully optimizing compiler.

● Assume it exists:

– such that it transforms a program P to the smallest program Opt(P) that has the same behaviour as P.

– Halting problem comes to the rescue:● Smallest program that never halts:

L1: goto L1– Thus, a fully optimizing compiler could solve the halting problem by

checking if a given program is L1: goto L1!

– But HP is an undecidable problem.

– Hence, a fully optimizing compiler can’t exist!

● Therefore we talk just about an optimizing compiler.

– and keep working without worrying about future prospects!


How to perform optimizations?● Analysis

– Go over the program

– Identify some properties● Potentially useful properties

● Transformation

– Use the information computed by the analysis to transform the program

● without affecting the semantics

● An example that we have (not literally) seen:

– Compute liveness information

– Delete assignments to variables that are dead


Classifying optimizations● Based on scope:

– Local to basic blocks

– Intraprocedural

– Interprocedural

● Based on positioning:

– High-level (transform source code or high-level IR)

– Low-level (transform mid/low-level IR)

● Based on (in)dependence w.r.t. target machine:

– Machine independent (general enough)

– Machine dependent (specific to the architecture)


May versus Must information

● Consider the program:

● Which variables may be assigned?

– a, b, c● Which variables must be assigned?

– a● May analysis:

– the computed information may hold in at least one execution of the program.

● Must analysis:

– the computed information must hold every time the program is executed.

if (c) { a = ... b = ...} else { a = ... c = ...}


Many many optimizations

● Constant folding, constant propagation, tail-call elimination, redundancy elimination, dead code elimination, loop-invariant code motion, loop splitting, loop fusion, strength reduction, array scalarization, inlining, synchronization elision, cloning, data prefetching, parallelization . . . etc . .

● How do they interact?

– Optimist: we get the sum of all improvements.

– Realist: many are in direct opposition.

● Let us study some of them!


Constant propagation● Idea:

– If the value of a variable is known to be a constant at compile-time, replace the use of the variable with the constant.

– Usually a very helpful optimization

– e.g., Can we now unroll the loop?● Why is it good?● Why could it be bad?

– When can we eliminate n and c themselves?● Now you know how well different optimizations might interact!

n = 10;c = 2;for (i=0; i<n; ++i) s = s + i * c;

n = 10;c = 2;for (i=0; i<10; ++i) s = s + i * 2;


Constant folding

● Idea:

– If operands are known at compile-time, evaluate expression at compile-time.

– What if the code was?

– And what now?

r = 3.141 * 10; r = 31.41;

PI = 3.141;r = PI * 10;

PI = 3.141;r = PI * 10;d = 2 * r;

Constant propagation

Constant folding

Called partial evaluation


Common sub-expression elimination

● Idea:

– If program computes the same value multiple times,reuse the value.

– Subexpressions can be reused until operands are redefined.

a = b + c;c = b + c;d = b + c;

t = b + c;a = t;c = t;d = b + c;


Copy propagation

● Idea:

– After an assignment x = y, replace the uses of x with y.

– Can only apply up to another assignment to x, or

... another assignment to y!

– What if there was an assignment y = z earlier?● Apply transitively to all assignments.

x = y;if (x > 1) s = x + f(x);

x = y;if (y > 1) s = y + f(y);


Dead-code elimination

● Idea:

– If the result of a computation is never used,remove the computation.

– Remove code that assigns to dead variables.● Liveness analysis done before would help!

– This may, in turn, create more dead code.● Dead-code elimination usually works transitively.

x = y + 1;y = 1;x = 2 * z;

y = 1;x = 2 * z;


Unreachable-code elimination

● Idea:

– Eliminate code that can never be executed

– High-level: look for if (false) or while (false)● perhaps after constant folding!

– Low-level: more difficult● Code is just labels and gotos● Traverse the CFG, marking reachable blocks

#define DEBUG 0if (DEBUG) print(“Current value = ", v);


Next class

● Next class:

– How to perform the optimizations that we have seen using a dataflow analysis?

● Starting with:

– The back-end fullform of CFG!

● Approximately only 10 more classes left.

– Hope this course is being successful in making (y)our hectic days a bit more exciting :-)


Code Optimization (Cont.)

Manas Thakur

Fall 2020


Recall A2

● Is ‘a’ initialized in this program?

– Reality during run-time: Depends

– What to tell at compile-time?● Is this a ‘must’ question or a ‘may’ question?● Correct answer: No

– How do we obtain such answers?● Need to model the control-flow

int a;if (*) { a = 10;else { //something that doesn’t touch ‘a’}x = a;


Control-Flow Graph (CFG)

● Nodes represent instructions; edges represent flow of control

a = 0L1: b = a + 1 c = c + b a = b * 2 if a < N goto L1 return c

a = 0

b = a + 1

c = c + b

a = b * 2

a < N

return c


Some CFG terminology

● pred[n] gives predecessors of n

– pred[1]? pred[4]? pred[2]?

● succ[n] gives successors of n

– succ[2]? succ[5]?

● def(n) gives variables defined by n

– def(3) = {c}

● use(n) gives variables used by n

– use(3) = {b, c}

a = 0

b = a + 1

c = c + b

a = b * 2

a < N

return c

1

2

3

4

5

6


Live ranges revisited

● A variable is live, if its current value may be used in future.

– Insight:● work from future to past● backward over the CFG

● Live ranges:

– a: {1->2, 4->5->2}

– b: {2->3, 3->4}

– c: All edges except 1->2

a = 0

b = a + 1

c = c + b

a = b * 2

a < N

return c

1

2

3

4

5

6


Liveness● A variable v is live on an edge if there is a

directed path from that edge to a use of v that does not go through any def of v.

● A variable is live-in at a node if it is live on any of the in-edges of that node.

● A variable is live-out at a node if it is live on any of the out-edges of that node.

● Verify:

– a: {1->2, 4->5->2}

– b: {2->4}

a = 0

b = a + 1

c = c + b

a = b * 2

a < N

return c

1

2

3

4

5

6


Computation of liveness

● Say live-in of n is in[n], and live-out of n is out[n].

● We can compute in[n] and out[n] for any n as follows:

in[n] = use[n] ∪ (out[n] – def[n])

out[n] = s∀ ∈succ[n] ∪ in[s]

Called dataflow equations.Called flow functions.


Liveness as an iterative dataflow analysis

Initialize

Save previous values

Computenew values

Repeat till fixed-point

IDFAfor each n

in[n] = {}; out[n] = {}

repeat

for each n

in’[n] = in[n]; out’[n] = out[n]

in[n] = use[n] (out[n] – def[n])∪

out[n] = s succ[n] ∀ ∈ in[s]∪● until in’[n] == in[n] and out’[n] == out[n] ∀n


Liveness analysis example

a = 0

b = a + 1

c = c + b

a = b * 2

a < N

return c

1

2

3

4

5

6

in[n] = use[n] ∪ (out[n] – def[n]) out[n] = s∀ ∈succ[n] ∪ in[s]

Fixed point


In backward order a = 0

b = a + 1

c = c + b

a = b * 2

a < N

return c

1

2

3

4

5

6● Fixed point only in 3 iterations!

● Thus, the order of processing statements is important for efficiency.

in[n] = use[n] ∪ (out[n] – def[n]) out[n] = s∀ ∈succ[n] ∪ in[s]


Complexity of our liveness computation algorithm

● For input program of size N

– ≤N nodes in CFG

⇒ N variables

⇒ N elements per in/out

⇒ O(N) time per set union

– for loop performs constant number of set operations per node

⇒ O(N2) time for for loop

– Each iteration of for loop can only add to each set (monotonicity)

– Sizes of all in and out sets sum to 2N2

thus bounding the number of iterations of the repeat loop

⇒ worst-case complexity of O(N4)

– Much less in practice (usually O(N) or O(N2)) if ordered properly.

repeat

for each n



out[n] = s succ[n] ∀ ∈ in[s]∪

until in’[n] == in[n] and out’[n] == out[n] ∀n


Least fixed points

● There is often more than one solution for a given dataflow problem.

– Any solution to dataflow equations is a conservative approximation.

● Conservatively assuming a variable is live does not break the program:

– Just means more registers may be needed.

● Assuming a variable is dead when really live will break things.

● Many possible solutions; but we want the smallest: the least fixed point.

● The iterative algorithm computes this least fixed point.


Confused!?

● Is compilers a theoretical topic or a practical one?

● Recall:

– “A sangam of theory and practice.”

● Next class:

– We are not leaving a topic as important as IDFA so soon!


Code Optimization (Cont.)

Manas Thakur

Fall 2020


Recall our IDFA algorithm

for each n

in[n] = ...; out[n] = ...

repeat

for each n


in[n] = ...

out[n] = ...

until in’[n] = in[n] and out’[n] = out[n] for all n

Initialize

Save previous values

Computenew values

Repeat till fixed-point

Do we need to process all the nodes in each iteration?


Worklist-based Implementation of IDFA

● Initialize a worklist of statements

● Forward analysis:

– Start with the entry node

– If OUT(n) changes, then add succ(n) to the worklist

● Backward analysis:

– Start with the exit node

– If IN(n) changes, then add pred(n) to the worklist

● In both the cases, iterate till fixed point.


Writing an IDFA (Cont.)

● Initialization of IN and OUT sets depends on the analysis:

– empty if the information grows

– all the nodes if the information shrinks

● Requirement for termination:

– unidirectional growth/shrinkage

– Called monotonicity

● Confluence/Meet operation (at control-flow merges):

– Union

– Intersection Depends on the analysis


Live-variable analysis revisited

● Direction:

– Backward

● Initialization:

– Empty sets

● Flow functions:

– out[n] = s∀ ∈succ[n] ∪ in[s]– in[n] = use[n] ∪ (out[n] – def[n])

● Confluence operation:

– Union


Common sub-expressions revisited

● Idea:

– If a program computes the same value multiple times,reuse the value.

– Subexpressions can be reused until operands are redefined.

– Say given a node n, the expressions computed at n are denoted as gen(n) and the ones killed (operands redefined) at n are denoted as kill(n).

a = b + c;c = b + c;d = b + c;

t = b + c;a = t;c = t;d = b + c;


Common subexpressions as an IDFA● Direction:

– Forward

● Initialization:

– Empty sets

● Flow functions:

– in[n] = p∀ pred∈ [n] ∩ out[p]– out[n] = gen[n] ∪ (in[n] – kill[n])

● Confluence operation:

– Intersection


Are we efficient enough?

● When can IDFAs take a lot of time?

● Which operations could be expensive?

– Confluence

– Equality

● Compilers may have to perform several IDFAs.

● How can we make an IDFA more efficient (perhaps with some loss of precision)?

repeat

for each n



out[n] = s succ[n] in[s]∀ ∈ ∪

until in’[n] == in[n] and out’[n] == out[n] n∀


Basic Blocks a = 0L1: b = a + 1 c = c + b a = b * 2 if a < N goto L1 return c

a = 0

b = a + 1

c = c + b

a = b * 2

a < N

return c

a = 0

b = a + 1 c = c + b a = b * 2 a < N

return c

Each instruction as a node Using basic blocks


Basic Blocks (Cont.)

● Idea:

– Once execution enters a basic block, all statements are executed in sequence.

– Single-entry, single-exit region

● Details:

– Starts with a label

– Ends with one or more branches

– Edges may be labeled with predicates● True/false● Exceptions

● Key: Improve efficiency, with reasonable precision.


Have you got a compiler’s eyes yet?

● What properties can you identify about this program?

● What’s the advantage if it was rewritten as follows?

● Def-use becomes explicit.

S1: y = 1;S2: y = 2;S3: x = y;

S1: y1 = 1;S2: y2 = 2;S3: x = y2;


Static Single Assignment (SSA)

● A form of IR in which each use can be mapped to a single definition.

– Achieved using variable renaming and phi nodes.

● Many compilers use SSA form in their IRs.

if (flag) x = -1;else x = 1;y = x * a;

if (flag) x1 = -1;else x2 = 1;x3 = Φ(x

1, x

2)

y = x3 * a;


SSA Classwork

● Convert the following program to SSA form:

– (Hint: First convert to 3AC)

x = 0;for (i=0; i<N; ++i) { x += i; i = i + 1; x--;}x = x + i;

x1 = 0;i1 = 0;L1:i13 = Φ(i1,i3);if (i13 < N) { x13 = Φ(x1,x3); x2 = x13 + i13; i2 = i13 + 1;

x3 = x2 – 1;i3 = i2 + 1;goto L1;

}x4 = Φ(x1, x3);x5 = x4 + i13;


Effect of SSA on Register Allocation!?

● What is the effect of SSA form on liveness?

● What does SSA do?

– Breaks a single variable into multiple instances

– Instances represent distinct, non-overlapping uses

● Effect:

– Breaks up live ranges; often improves register allocation

x x1 x2


Featuring Next in Code Optimization

● Heard of the 80-20 or 90-10 rule?

– X% of time is spent in executing y% of the code, where X >> y.

● Which all kinds of code portions tend to form the region ‘y’ in typical programs?

– Loops

– Methods

● Tomorrow: Loop optimizations


Loop Optimizations

Manas Thakur

Fall 2020


Why optimize loops?

● Form a significant portion of the time spent in executing programs.

– If N is just 10000 (not uncommon), we have too many instructions!● How many in the above loop?

– What if S1/S2 is/are function calls?

● Involve costly instructions in each iteration:

– Comparisons

– Jumps

for (i=0; i<N; i++) { S1; S2;}


What is a loop?

● A loop in a CFG is a set of nodes S such that:

– There is a designated header node h in S

– There is a path from each node in S to h

– There is a path from h to each node in S

– h is the only node in S with an incoming edge from outside S


Are all these loops?


What about these?


Identifying loops using dominators

● A node d dominates a node n if every path from entry to n goes through d.

● Compute dominators of each node:


Flow function for computing dominators

● Assuming D[i] is the set of dominators of node i:

D[entry] = {entry}

D[n] = {n} ∪ p∀ pred∈ [n] ∩ D[p]


Identifying loops using dominators (Cont.)

● First, identify a back edge:

– An edge from a node n to another node h, where h dominates n● Each back edge leads to a loop:

– Set X of nodes such that for each x ∈ X, h dominates x and there is a path from x to n not containing h

– h is the header

● Verify:


Loop-Invariant Code Motion (LICM)

● Loop-invariant code:

– d: t = a OP b, such that:● a and b are constants; or● all the definitions of a and b that reach d are outside the loop; or● only one definition each of a and b reaches d, and that definition is

loop-invariant.

● Example:

L0: t = 0L1: i = i + 1 t = a * b M[i] = t if i<N goto L1L2: x = t


LICM: Get ready for code hoisting

● Can we always hoist loop-invariant code?

● Criteria for hoisting d: t = a OP b:

– d dominates all loop exits at which t is live-out, and

– there is only one definition of t in the loop, and

– t is not live-out of the loop preheader

● How can we hoist code in the pink and the orange blocks?

L0: t = 0L1: i = i + 1 t = a * b M[i] = t if i<N goto L1L2: x = t

L0: t = 0L1: if i>=N goto L2 i = i + 1 t = a * b M[i] = t goto L1L2: x = t

L0: t = 0L1: M[j] = t i = i + 1 t = a * b M[i] = t if i<N goto L1L2: x = t


Induction-variable optimization

● Induction variables:

– Variables whose value depends on iteration variable

● Optimization:

– Compute them efficiently, if possible

s = 0 i = 0L1: if i>=N goto L2 j = i * 4 k = j + a x = M[k] s = s + x i = i + 1 goto L1L2:

s = 0 k’ = a b = N * 4 c = a + bL1: if k’>=c goto L2 x = M[k’] s = s + x k’ = k’ + 4 goto L1L2:


Loop unrolling

● Minimize the number of increments and condition-checks

● Be careful about the increase in code size (I-cache misses!)

L1: x = M[i] s = s + x i = i + 4 if i<N goto L1L2:

L1: x = M[i] s = s + x x = M[i+4] s = s + x i = i + 8 if i<N goto L1L2:

if i<N-8 goto L1 goto L2L1: x = M[i] s = s + x x = M[i+4] s = s + x i = i + 8 if i<N-8 goto L1L2: x = M[i] s = s + x i = i + 4 if i<N goto L2L3:

Only even no. of iterations: Any no. of iterations:

Unroll by factor of 2


Loop interchange● A C/Java programmer starting with MATLAB:

● But MATLAB stores matrices in column-major order!

● Implication?

– Cache misses (perhaps in each iteration)!

● Solution (interchange the loops!):

for i=1:1000, for j=1:1000, a(i) = a(i) + b(i,j)*c(i) endend

for j=1:1000, for i=1:1000, a(i) = a(i) + b(i,j)*c(i) endend


Many more loop optimizations

● Loop fusion

● Loop fission

● Loop inversion

● Loop tiling

● Loop unswitching

● . . .

● Vectorization

● ParallelizationNext class!

Some other time!

Code Optimization Manas Thakur

Documents