F001 CS/EE 5810 CS/EE 6810 Chapter 2-3: Basic Program Transformations
Jan 22, 2016
F001CS/EE 5810CS/EE 6810
Chapter 2-3: Basic Program Transformations
F002CS/EE 5810CS/EE 6810
Optimizing your code
Transforming the code to something different than the programmer wrote, but that still does the same thing, can have a huge impact on performance
Is this a compiler subject or a computer architecture subject? Yes!
Many many architectural details are driven by, or have an affect on, code optimization
F003CS/EE 5810CS/EE 6810
Major Types of OptimizationSee Chapter 2, Fig 2-19, P93
High level (At or near source level)Procedure integration
Local (Within a basic block)Common subexpression eliminationConstant propagationStack height reduction
Global (Across a branch)Copy propagationCode motion Induction variable elimination
Machine-DependentStrength reductionPipeline scheduling (more about this later…)
F004CS/EE 5810CS/EE 6810
Strength Reduction
Substitute a simpler operation when equivalentMultiply => shifts and adds is a popular area
Y = X ** 2; replace with Y = X * X;
J = K * 2; replace with J = K + K;
F005CS/EE 5810CS/EE 6810
Variable Renaming
Use distinct names for each unrelated use of the same variable to simplify later optimizations
X = Y * Z; Second use of X is unrelated Q = R + X + X;
X = A + B; Replace with X1 = A + B;
F006CS/EE 5810CS/EE 6810
Common Subexpression Elimination
Avoid recalculating the same expression In this code, you would hope the compiler would
compute the address of a[ j ][ k ] only once for both statements…
a[ j ][ k ] = b[ j ][ k ] + x * b[ j ][ j-1 ] ;
sum = length[ j ] * a[ j ][ k ];
F007CS/EE 5810CS/EE 6810
Loop Invariant Code Motion
Avoid operations in loops that are the same in each iteration
Originalfor ( j = 0; j < max; j++){ a[ j ] = b [ j ] + c * d; e = g[ k ]; }
Revisedtmp = c * d;for (j = 0; j < max; j++)
a[ j ] = b[ j ] + tmp;e = g[ k ];
F008CS/EE 5810CS/EE 6810
Copy Propagation
Propagate the original instead of the copy In this example, x is still copied to y, but then all
subsequent calls to x are replaced with yOriginal
x = y;z = 2 * x;q = x + 15;
Revisedx = y;z = 2 * y; q = y + 15;
We may find that x is never used again…
F009CS/EE 5810CS/EE 6810
Constant Folding
If the value of a variable is really a constant that can be determined at compile time, replace it with the
constant int j = 0;
int k = 1;
m = j + k;
F0010CS/EE 5810CS/EE 6810
Dead Code Removal
Eliminate instructions whose results are never used
update (){
int j, k;j = k = 1;j += 1;k += 2; printf{“ J is %d\n”, j);
}
F0011CS/EE 5810CS/EE 6810
Branch Delay Slots
Some machines (like DLX) always execute instructions in the Branch Delay Slot(s)
Challenge is for the compiler to find code to put in those slots (See Fig 3.28, P 169)
Three places to find such codeAn independent instruction from before the branch (Best choice) From the branch target (Risky, may need to copy the instruction,
can’t cause problem if executed incorrectly!) From the fall-through code (Risky, same problems as above…)
Compiler can hide ~70% of branch hazards on DLX running Spec92 codes.
F0012CS/EE 5810CS/EE 6810
Chapter 4: Pipeline Scheduling and ILP
F0013CS/EE 5810CS/EE 6810
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
F0014CS/EE 5810CS/EE 6810
Instruction Level Parallelism (ILP)
Pipelining supports a limited sense of ILPE.g. overlapped instructions, hazard issues, forwarding logic, etc.
Remember:
Pipeline CPI = Ideal CPI + Structural Stalls + Data Stalls + Control Stalls
So, let’s try to be more aggressive about reducing the stalls to improve the CPI…
F0015CS/EE 5810CS/EE 6810
Software Techniques
Loop unrollingBigger basic blocksAttempt to reduce control stalls
Basic pipeline schedulingReduce RAW stalls
Lots of other hardware techniques to talk about later…
F0016CS/EE 5810CS/EE 6810
ILP Within a Basic Block
Basic Block definitionStraight line code, no branches outSingle entry point at the topReal code is a bunch of basic blocks connected by branches
Notice:Branch frequency is approx 15% of total mix (for integer programs)This implies that basic block size is between 6 and 7 instructionsMachine instructions don’t do muchSo, there’s probably little in the way of ILP available
Easiest target is the loopAlready exploited by vector processors, but using different
mechanisms
F0017CS/EE 5810CS/EE 6810
Loop Level Parallelism
Consider adding two 1000 element arrays
for(I=1; I<=1000, I=I+1)x[I] = x[I] + y[I];
Sure it’s trivial, but it illustrates the pointThere is no dependence between data values produced in any
iteration j and those needed in j+n for any j and nTruly independent – hence could be 1000-way parallel Independence means no stalls due to data hazazrdsProblem is that we have to use that pesky branch instruction
Vector processor modelLoad vectors X and Y (up to some machine-dependent max)Then do result-vec = xvec + yvec in a single instruction
F0018CS/EE 5810CS/EE 6810
Assumptions About Timing
Default DLX pipeline timings for this chapter
Inst, Producing value Inst. Consuming Value Clock cycles to avoid stalls
FP ALU Op FP ALU OP 3
FP ALU Op Store Double 2
Load Double FP ALU Op 1
Load Double Store Double 0
Integer Load Integer ALU Op 1
Integer ALU Op Integer ALU Op 0
Branch Delay Slot Anything 1
F0019CS/EE 5810CS/EE 6810
Loop Unrolling Consider adding a scalar s to a vector (assume lowest array element is in location 0)
For (I = 1; I<=1000; I++) x[I] = x[I] + s;
Loop: LD F0, 0(r1) ; R1 array ptrADDD F4, F0, F2 ;Add scalar in F2SD 0(r1), F4 ; store resultSUBI r1, r1, 8 ; decr. Ptr by 8 bytesBNEZ r1, loop ; branch r1 != 0
How does it run without scheduling? 9 cycles per iteration
LD, LD stall, ADDD, 2 RAW stalls, SD, SUBI, BNEZ, Branch delay control stall
F0020CS/EE 5810CS/EE 6810
Loop Without and With SchedulingLoop: LD F0, 0(r1)
stallADDD F4, F0, F2stallstallSD 0(r1), f4SUBI R1, R1, #8BNEX R1, Loopstall
Loop: LD F0, 0(r1)stallADDD F4, F0, F2SUBI r1, r1, #8BNEZ R1, LoopSD 8(r1), F4
Note that this is non-trivial, and many compilers don’t even try Move SD to branch delay slot But, SUBI changes a register that SD needs! Since we moved it past the SUBI, need to adjust offset
Down to 6 cycles/loop, but still has 3 cycle loop+stall overhead
F0021CS/EE 5810CS/EE 6810
Loop Unrolling
Basic Idea – take n loop bodies and concatenate them into one basic blockWill need to adjust termination codeLet’s say n was 4Then modify the R1 pointer in the example by 4x of what it was
before => 32
Savings – 4 BNEZ’s + 4 SUBI’s => just one of each in new unrolled loopHence 75% savings
Problem: Still have 4 load stalls per loop
F0022CS/EE 5810CS/EE 6810
Unrolled Loop Examle
Loop: LD F0, 0(r1)ADDD F4, F0, F2SD 0(r1), F4 ; drop SUBI and BNEZLD F6, -8(r1)ADDD F8, F6, F2SD -8(r1), F8 ; drop SUBI and BNEZLD F10, -16(r1)ADDD F12, F10, F2SD -16(r1), F12 ; drop SUBI and BNEZLD F14, -24(r1)ADDD F16, F14, F2SD -24(r1), F16SUBI r1, r1, #32BNEZ Loop
F0023CS/EE 5810CS/EE 6810
Unrolling With Scheduling
Don’t concatenate the unrolled segments, Shuffle them instead
4 LDs then 4 ADDDs then 4 SDsNo more stalls since LD -> ADDD dependent path now
has 3 instructions in it… Result is 14 cycles for 4 elements
=> 3.5 cycles/elementCompare with 9 cycles with no schedulingor 6 cycles with scheduling but no unrolling
F0024CS/EE 5810CS/EE 6810
Loop Unrolling With Scheduling
Loop: LD F0, 0(r1)LD F6, -8(r1)LD F10, -16(r1)LD F14, -24(r1)ADDD F4, F0, F2ADDD F8, F6, F2 ADDD F12, F10, F2ADDD F16, F14, F2 SD 0(r1), F4SD -8(r1), F8SD -16(r1), F12SUBI r1, r1, #32BNEZ LoopSD 8(r1), F16 ; note 8-32 = -24
F0025CS/EE 5810CS/EE 6810
Things to Notice
We had 8 more unused register pairsWe could have gone to an 8 block unroll without register conflictNo problem since the 1000-element array would still have broken
cleanly (1000/8 – 125)What if it had not? Suppose the division has a remainder R? Just put R blocks (shuffled of course) in front of the loop, then start
for realEven if you run out of registers, you can still cycle names and
remove stalls.
Most compilers unroll early to expose code for later optimizationsThis one had a tricky one => SD/SUBI swapKey was independent nature of each loop bodyWhat if they’re not independent?
F0026CS/EE 5810CS/EE 6810
Data Dependency Analysis
Three types: Data, Name, and Control I is data dependent on j if:
I uses a result produced by jOr, I uses a result produced by K, and k depends on j
Dependence indicates a possible RAW hazardDoes it induce a stall? Depends on pipeline structure and forwarding
capability
Compiler dataflow analysis Creates a graph that makes these dependencies explicit directed
paths
F0027CS/EE 5810CS/EE 6810
Data Dependency
Loop: LD F0, 0(r1)
ADDD F4, F0, F2
SD 0(r1), F4
SUBI R1, R1, #8
BNEZ R1, Loop
F0028CS/EE 5810CS/EE 6810
Name Dependence
Occurs when second instruction uses same register name without a data dependenceE.g. unrolled loop without changing register names
Let I preceed j in program order I is antidependent on j when j writes a register that I reads
Essentially the same as a WAR hazard Hence ordering must be preserved to avoid the hazard
I is output dependent on j if they both write the same register Essentially a WAW hazard So we have to avoid that too
Otherwise, no real data dependence, just nameSo registers can be renamed statically by the compiler or
dynamically by the hardware
F0029CS/EE 5810CS/EE 6810
Control Dependence
Since branches are conditionalSome instructions will be executed, and others will notMust maintain order due to branches
Two obvious constraints to maintain control dep’s Instructions controlled by branch can’t be moved before the branch
(or they would become unconditional) Instructions not controlled by the branch can’t be moved after the
branch (or they would become conditional)
Simple pipelines preserve this so it’s not a big deal.
F0030CS/EE 5810CS/EE 6810
Loop-Carried Dependence
Consider the following code:
For(I=1; I<=1000; I++){A[I+1] = A[I] + C[I]; /* S1*/B[I+1] = B[I] + A[I+1];} /* S2 */
S1 uses S1 value produced in a previous iterationS2 uses S2 value produced in a previous iterationS2 uses an S1 value produced in the same iterationSo, S1 depends on a loop-carried dependence on S1Similar to S2’s loop carried dependence If non-loop-carried dependencies were the only ones,
could execute loop bodies in parallel
F0031CS/EE 5810CS/EE 6810
Another Loop Carried Dependence
S1 uses previous value of S2However, dependence is not circular since neither
statement depends on itselfAnd no S1 depends on S2 depends on S1 circularity
eitherSo, no cycle in dependencies, loop can be parallelized
and unrolled (provided statements are kept in order)
For(I=1; I<=100; I++){A[I]= A[I] + B[I]; /* S1 */B[I+1] = C[I] + D[I];} /* S2 */
A[1] = A[1] + B[1]; For(I=1; I<=99; I++){
B[I+1] = C[I] + D[I];A[I+1] = A[I+1] + B[I+1];}
B[101] = C[100] + D[100];
F0032CS/EE 5810CS/EE 6810
Our Infrastructure for Lab 1
In the /home/cs/handin/cs5810/bin directory on CADE lcc - DLX C compiler - use with -S switch to get
assembly code in a .s filedlxasm - Assembler that converts .s files into .dlx
object files that can run on our simulatorbin2a - A binary to ASCII converter that lets you look
at object files if you likedlxsim - A simulator for the DLX processor
Type h or ? At the prompt for brief listing of commandsOnly gives executed instruction counts at the momentYou’ll extend it later…
F0033CS/EE 5810CS/EE 6810
Data Infrastructure
In the /home/cs/handin/cs5810/ directoryNew directory for each lab
I.e. /home/cs/handin/cs5810/lab1
Also a src directory with benchmarks (small toy examples) in C