Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

1

Compiler Techniques for ILP

So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling

Scoreboard Tomasulo’s algorithm

Speculation Multiple issue

How can compilers help?


2

Loop Unrolling Let’s look at the code:

for (i=1000;i>0;i=i-1)x[i] = x[i] + s

ADD R2,R0,R0Loop: L.D F0,0(R1)

ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop


3

Scheduling On A Simple 5 Stage MIPS

Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty

10 cycles


4

We Could Rearrange The InstructionsLoop: L.D F0,0(R1)

stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty

Interleavethese inst. with someindependentinst.Best we canachieve is 6

6 cycles

Loop: L.D F0,0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop8


5

Loop Unrolling Getting into the loop more

useful instructions and reducing overhead Step 1: Put several iterations together

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, Loop

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop Assume taken


6

Loop Unrolling Step 2: Take out control instructions, adjust offsets

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, Loop

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) L.D F0,-8(R1)ADD.D F4, F0, F2S.D F4, -8(R1) L.D F0,-16(R1)ADD.D F4, F0, F2S.D F4, -16(R1) L.D F0,-24(R1)ADD.D F4, F0, F2S.D F4, -24(R1) DADDUI R1, R1, #-32BNE R1, R2, Loop


7

Loop Unrolling Step 3: Rename registers

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) L.D F0,-8(R1)ADD.D F4, F0, F2S.D F4, -8(R1) L.D F0,-16(R1)ADD.D F4, F0, F2S.D F4, -16(R1) L.D F0,-24(R1)ADD.D F4, F0, F2S.D F4, -24(R1) DADDUI R1, R1, #-32BNE R1, R2, Loop

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, Loop


8

Loop Unrolling Current loop still has stalls due to RAW

dependencies


Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty

28 cycles = 7 per it.


9

Loop Unrolling Step 4: Interleave iterations


14 cycles = 3.5 per it.

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2S.D F4, 0(R1)S.D F8, -8(R1)DADDUI R1, R1, #-32S.D F12, 16(R1)BNE R1, R2, LoopS.D F16, 8(R1)


10

Loop Unrolling + Multiple Issue Let’s unroll the loop 5 times, mark int. and FP operations

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)L.D F18,-32(R1)ADD.D F20, F18, F2S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop


11

Loop Unrolling + Multiple Issue Move all loads first, then ADD.D then S.D

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop


12

Loop Unrolling + Multiple Issue Rearrange instructions to handle delay for DADDUI and

BNELoop: L.D F0,0(R1)

L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, -24(R1)BNE R1, R2, LoopS.D F20, -32(R1)


13

Loop Unrolling + Multiple Issue Fix immediate displacement values

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, 16(R1)BNE R1, R2, LoopS.D F20, 8(R1)


14

Loop Unrolling + Multiple Issue Now imagine we can issue 2 instructions per cycle, one

integer and one FP

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, 16(R1)BNE R1, R2, LoopS.D F20, 8(R1)

123

3

4

4

5

56

67

789101112

12 cycles = 2.4 per it.


15

Static Branch Prediction Analyze the code, figure out which outcome of a branch

is likely Always predict taken

Predict backward branches as taken, forward as not taken

Predict based on the profile of previous runs

Static branch prediction can help us schedule delayed branch slots


16

Static Multiple Issue: VLIW Hardware checking for dependencies in issue packets

may be expensive and complex Compiler can examine instructions and decide which ones can

be scheduled in parallel – group instructions into instruction packets – VLIW

Hardware can then be simplified

Processor has multiple functional units and each field of the VLIW is assigned to one unit

For example, VLIW could contain 5 fields and one has to contain ALU instruction or branch, two have to contain FP instructions and two have to be memory references


17

Example Assume VLIW contains 5 fields: ALU instruction or

branch, two FP instructions and two memory references

Ignore branch delay slot

Memory reference

Memory reference

FP instruction

ALU instruction

ALU instruction

Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loop


18

Example Unroll seven times and rearrange

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

1

ALU /branch FP FP mem mem

3


19

Example



2


3

4


20

Example



3

3


4

6

5


21

Example



4

4


7

6

5

6


22

Example



5


7

6

6

8


23

Example



6

6


7

9

8


24

Example


S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, 24(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

7

7


9

8


25

Example



8

8


9


26

Example



9

Overall 9 cycles for 7 iterations 1.29 per iterationBut VLIW was always half-full



27

Detecting and Enhancing Loop Level Parallelism

Determine whether data in later iterations depends on data in earlier iterations – loop-carried dependence

Easier detected at source code level than at machine code for(i=1; i<=100; i=i+1){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1] /* S2 */}

S1 calculates a value A[i+1] which will be used in next iteration of S1S2 calculates a value B[i+1] which will be used in next iteration of S2 This is a loop-carried dependence and prevents parallelismS1 calculates a value A[i+1] which will be used in the current iteration of S2 This is dependence within the loop


28


for(i=1; i<=100; i=i+1){ A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i] /* S2 */}

S1 calculates a value A[i] which is not used in the futureS2 calculates a value B[i+1] which will be used in next iteration of S1 This is a loop-carried dependence but S1 depends on S2 not on itself and S2 does not depend on S1This loop can be made parallel if we transform it so that there is no loop-carried dependence

A[1] = A[1] + B[1]; for(i=1; i<=99; i=i+1){ B[i+1] = C[i] + D[i] /* S2 */ A[i+1] = A[i+1] + B[i+1]; /* S1 */} B[101] = C[100]+D[100]


29


Recursion creates loop-carried dependence

But sometimes it may parallelizable if distance between dependent elements is >1

for(i=1; i<=100; i=i+1){ A[i] = A[i-1] + B[i];}

for(i=1; i<=100; i=i+1){ A[i] = A[i-5] + B[i];}


30


Find all dependencies in the following loop (5) and eliminate as many as you can:

for(i=1; i<=100; i=i+1){ Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c – Y[i]; /* S4 */}

Solution at page 325


31

Code Transformation

Eliminating dependent computations Copy propagation

Tree height reduction

DADDUI R1, R2, #4DADDUI R1, R1, #4 DADDUI R1, R2, #8

ADD R1, R2, R3ADD R4, R1, R6ADD R8, R4, R7

ADD R1, R2, R3ADD R4, R6, R7ADD R8, R1, R4

Can be done in parallel

sum=sum+x /* suppose this is in a loop and we unroll

it 5 times */

sum=sum+x1+x2+x3+x4+x5sum=(sum+x1)+(x2+x3)+(x4+x5)

Can be done in parallel

Must be done sequentially


32

Software Pipelining

Combining instructions from different loop iterations to separate dependent instructions within an iteration


33

Software Pipelining

Apply software pipelining technique to the following loop:

L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop

L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)



R1+16 R1+8 R1

168

S.D F0,16(R1)ADD.D F4, F0, F2L.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop

Startup code

Cleanup code


34

Software Pipelining vs. Loop Unrolling

Loop unrolling eliminates loop maintenance overhead exposing parallelism between iterations Creates larger code

Software pipelining enables some loop iterations to run at top speed by eliminating RAW hazards that create latencies within iteration Requires more complex transformations


35

Homework #8 Due Tuesday, November 16 by the end of the class

Submit either in class (paper) or by E-mail (PS or PDF only) or bring the paper copy to my office

Do exercises 4.2, 4.6, 4.9 (skip parts d. and e.), 4.11

Compiler Techniques for ILP

Documents