CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 1 Compiler Techniques for ILP So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling Scoreboard Tomasulo’s algorithm Speculation Multiple issue How can compilers help?
Compiler Techniques for ILP. So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling Scoreboard Tomasulo’s algorithm Speculation Multiple issue How can compilers help?. Loop Unrolling. Let’s look at the code: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
1
Compiler Techniques for ILP
So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling
Scoreboard Tomasulo’s algorithm
Speculation Multiple issue
How can compilers help?
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
3
Scheduling On A Simple 5 Stage MIPS
Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty
10 cycles
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
4
We Could Rearrange The InstructionsLoop: L.D F0,0(R1)
stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty
Interleavethese inst. with someindependentinst.Best we canachieve is 6
6 cycles
Loop: L.D F0,0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, Loop8
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
5
Loop Unrolling Getting into the loop more
useful instructions and reducing overhead Step 1: Put several iterations together
Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty
28 cycles = 7 per it.
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
15
Static Branch Prediction Analyze the code, figure out which outcome of a branch
is likely Always predict taken
Predict backward branches as taken, forward as not taken
Predict based on the profile of previous runs
Static branch prediction can help us schedule delayed branch slots
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
16
Static Multiple Issue: VLIW Hardware checking for dependencies in issue packets
may be expensive and complex Compiler can examine instructions and decide which ones can
be scheduled in parallel – group instructions into instruction packets – VLIW
Hardware can then be simplified
Processor has multiple functional units and each field of the VLIW is assigned to one unit
For example, VLIW could contain 5 fields and one has to contain ALU instruction or branch, two have to contain FP instructions and two have to be memory references
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
17
Example Assume VLIW contains 5 fields: ALU instruction or
branch, two FP instructions and two memory references
Ignore branch delay slot
Memory reference
Memory reference
FP instruction
ALU instruction
ALU instruction
Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
Overall 9 cycles for 7 iterations 1.29 per iterationBut VLIW was always half-full
ALU /branch FP FP mem mem
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
27
Detecting and Enhancing Loop Level Parallelism
Determine whether data in later iterations depends on data in earlier iterations – loop-carried dependence
Easier detected at source code level than at machine code for(i=1; i<=100; i=i+1){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1] /* S2 */}
S1 calculates a value A[i+1] which will be used in next iteration of S1S2 calculates a value B[i+1] which will be used in next iteration of S2 This is a loop-carried dependence and prevents parallelismS1 calculates a value A[i+1] which will be used in the current iteration of S2 This is dependence within the loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
S1 calculates a value A[i] which is not used in the futureS2 calculates a value B[i+1] which will be used in next iteration of S1 This is a loop-carried dependence but S1 depends on S2 not on itself and S2 does not depend on S1This loop can be made parallel if we transform it so that there is no loop-carried dependence
Software pipelining enables some loop iterations to run at top speed by eliminating RAW hazards that create latencies within iteration Requires more complex transformations
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
35
Homework #8 Due Tuesday, November 16 by the end of the class
Submit either in class (paper) or by E-mail (PS or PDF only) or bring the paper copy to my office
Do exercises 4.2, 4.6, 4.9 (skip parts d. and e.), 4.11