CS 211: Computer Architecture Lecture 6 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1/29/2010 1

CS 211: Computer ArchitectureLecture 6

Exploiting Instruction Level Parallelism with Software Approaches

Instructor: Morris Lancaster

1/29/2010 21/29/2010 CS 211 Lecture 6 2

Basic Compiler Techniques for Exposing ILP

• Crucial for processors that use static issue, and important for processors that make dynamic issue decisions but use static scheduling

1/29/2010 35/30/3008 CS 211 Lecture 6 3

Basic Pipeline Scheduling and Loop Unrolling

• Exploiting parallelism among instructions– Finding sequences of unrelated instructions that can be overlapped

in the pipeline

– Separation of a dependent instruction from a source instruction by a distance in clock cycles equal to the pipeline latency of the source instruction. (Avoid the stall)

• The compiler works with a knowledge of the amount of available ILP in the program and the latencies of the functional units within the pipeline– This couples the compiler, sometimes to the specific chip version,

or at least requires the setting of appropriate compiler flags

1/29/2010 4CS 211 Lecture 6 4

Assumed Latencies

Instruction Producing Result Instruction Using Result Latency In Clock Cycles

(needed to avoid stall)

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Result of the load can be bypassed without stalling store

1/29/2010 5CS 211 Lecture 6 5

Basic Pipeline Scheduling and Loop Unrolling (cont)

• Assume standard 5 stage integer pipeline– Branches have a delay of one clock cycle

• Functional units are fully pipelined or replicated (as many times as the pipeline depth)– An operation of any type can be issued on every clock cycle and

there are no structural hazards

1/29/2010 6CS 211 Lecture 6 6


• Sample codeFor (i=1000; i>0; i=i-1)

x[i] = x[i] + s;

• MIPS codeLoop: L.D F0,0(R1) ;F0 = array element

ADD.D F4,F0,F2 ;add scalar in F2S.D F4,0(R1) ;store back

DADDUI R1,R1,#-8 ;decrement index

BNE R1,R2,Loop ;R2 is precomputed so that

;8(R2) is last value to be

;computed

1/29/2010 7CS 211 Lecture 6 7


• MIPS codeLoop: L.D F0,0(R1) ;1 clock cycle

stall ;2ADD.D F4,F0,F2 ;3stall ;4 stall ;5 S.D F4,0(R1) ;6

DADDUI R1,R1,#-8 ;7stall ;8

BNE R1,R2,Loop ;9

1/29/2010 8CS 211 Lecture 6 8

Rescheduling Gives

• Sample codeFor (i=1000; i>0; i=i-1)

x[i] = x[i] + s;

• MIPS codeLoop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2

ADD.D F4,F0,F2 3stall 4BNE R1,R2,Loop 5

S.D F4,8(R1) 6

1/29/2010 9CS 211 Lecture 6 9

Unrolling Gives

• MIPS codeLoop: L.D F0,0(R1)

ADD.D F4,F0,F2S.D F4,0(R1)L.D F6,-8(R1)ADD.D F8,F6,F2S.D F8,-8(R1)L.D F10,-16(R1)ADD.D F12,F10,F2

S.D F12,-16(R1)L.D F14,-24(R1)ADD.D F16,F14,F2S.D F16,-24(R1)DADDUI R1,R1,#-32

BNE R1,R2,Loop

1/29/2010 10CS 211 Lecture 6 10

Unrolling and Removing Hazards Gives

• MIPS codeLoop: L.D F0,0(R1) ;total of 14 clock cycles L.D F6,-8(R1)

L.D F10,-16(R1) L.D F14,-24(R1)ADD.D F4,F0,F2

ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,-8(R1)DADDUI R1,R1,#-32 S.D F12,16(R1)

BNE R1,R2,Loop S.D F16,8(R1)

1/29/2010 11CS 211 Lecture 6 11

Unrolling Summary for Above

• Determine that it was legal to move the S.D after the DADDUI and BNE, and find the amount to adjust the S.D offset

• Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code

• Use different registers to avoid unnecessary constraints that would be forced by using the same registers

• Eliminate the extra test and branch instruction and adjust the loop termination and iteration code.

• Determine that the loads and stores can be interchanged by determining that the loads and stores from different iterations are independent

• Schedule the code, preserving any dependencies

1/29/2010 12CS 211 Lecture 6 12

Unrolling Summary (continued)

• Limits to Impacts of Unrolling Loops– As we unroll more, each unroll yields a decreased amount of

improvement of distribution of overhead

– Growth in code size• Not good for embedded computers

• Large code size may increase cache miss rate

– Shortfall in available registers (register pressure)• Scheduling the code to increase ILP causes the number of live

values to increase

• This could generate a shortage of registers and negatively impact the optimization

• Useful in a variety of processors today

1/29/2010 13CS 211 Lecture 6 13

Unrolling and Pipeline Scheduling with Static Multiple Issue

• Assume two issue, statically scheduled superscalar MIPS pipeline – here are 5 unrolls which would take 17 cycles in previous example

Integer instruction FP instruction Clock

Loop: L.D F0,0(R1) L.D F6,-8(R1)L.D F10,-16(R1) L.D F14,-24(R1)L.D F18,-32(R1)S.D F4,0(R1)S.D F8,-8(R1)S.D F12,-16(R1)DADDUI R1,R1,#-40 S.D F16,16(R1)S.D F20,8(R1)BNE R1,R2,Loop

ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2ADD.D F20,F18,F2

123456789

101112

1/29/2010 14CS 211 Lecture 6 14

Unrolling and Pipeline Scheduling with Static Multiple Issue

• Unrolled loop now 12 cycles or 2.4 cycles per element versus 3.5 cycles for the scheduled and unrolled loop on the normal pipeline

1/29/2010 15CS 211 Lecture 6 15

Static Branch Prediction

• Expectation is that branch behavior is highly predictable at compile time (can also be used to help dynamic predictors)

• It turns out that mis-prediction variance rates are large and that mis-predictions vary from between 9% and 59% for benchmarks

• Look at this example, as stall for the DSUBU and BEQZ exists

LD R1,0(R2) DSUBU R1,R1,R3

BEQZ R1, L OR R4,R5,R6DADDU R10,R4,R3

L: DADDU R7,R8,R9

1/29/2010 16CS 211 Lecture 6 16


• Suppose this branch was almost always taken and that the value of R7 was not needed in the fall through

LD R1,0(R2) DADDU R7,R8,R9 DSUBU R1,R1,R3


L: ; it was here



L: DADDU R7,R8,R9

1/29/2010 17CS 211 Lecture 6 17


• Suppose this branch was rarely taken and that the value of R4 was not needed in the taken path

LD R1,0(R2) OR R4,R5,R6 DSUBU R1,R1,R3

BEQZ R1, L

;it was hereDADDU R10,R4,R3

L: DADDU R7,R8,R9



L: DADDU R7,R8,R9

1/29/2010 18CS 211 Lecture 6 18


• Prediction Schemes– Predict Branch as taken –

• Average misprediction equal to the untaken branch frequency which is about 34% for the SPEC programs

• For some programs the frequency of forward taken branches may be significantly less than 50%

– Predict on branch direction• Backward branches predicted as taken

• Forward branches predicted as not taken

• Misprediction rate still around 30% to 40%

– Profile scheme• Based on information collected from earlier runs

• Finds that an individual branch is often highly biased toward taken or untaken

1/29/2010 19CS 211 Lecture 6 19

Misprediction Rate on SPEC 92 for Profile Based Prediction

1/29/2010 20CS 211 Lecture 6 20

Instructions Between Mispredictions

1/29/2010 21CS 211 Lecture 6 21

Static Multiple Issue: The VLIW Approach

• An alternative to superscalar approach– Superscalars decide dynamically how many instructions to issue

• Relies on compiler technology to – Minimize potential data hazards and stalls– Format the instructions in a potential issue packet so that the

hardware does not need to check for dependences• Compiler ensures that dependences within an issue packet cannot

be present or,• Indicate when a dependence may be present

• Simpler hardware• Good performance through extensive compiler optimization

1/29/2010 22CS 211 Lecture 6 22

VLIW Processors

• First multiple issue processors requiring the instruction stream to be explicitly organized used wide instructions with multiple operations per instruction– Very Long Instruction Word (VLIW)

– 64, 128 or more bits wide

• Early VLIW processors were rigid in formats

• New, less rigid architectures being pursued for modern desktops

1/29/2010 23CS 211 Lecture 6 23

VLIW Approach

• Multiple, independent functional units

• Multiple operations packaged into one very long instruction, or require that issue packets are constrained

• For discussion we assume the multiple instruction in one VLIW

• No hardware needed to make instruction issue decisions– As maximum issue rate grows, the hardware to make the decisions

becomes significantly more complex

1/29/2010 24CS 211 Lecture 6 24

VLIW Approach (continued)

• Example– Instructions might contain five operations, including one integer

operation (which could be branch), two floating point operations, and two memory references

– Set of fields for each functional unit, on the order of 16 to 24 bits per unit, yielding an instruction length of between 112 and 168 bits

1/29/2010 25CS 211 Lecture 6 25

VLIW Approach (continued)

• Scheduling– Local scheduling – where unrolling generates straight line code

• Operate on a single basic block

– Global scheduling – where scheduling occurs across branches• More complex

• Example is trace scheduling

1/29/2010 26CS 211 Lecture 6 26

VLIW Example, Local Scheduling on Single Block1.29 Cycles Per Result (Equivalent of 7 unrolls)

Memory Reference 1 Memory Reference 2 FP Operation 1 FP Operation 2 Integer Operation/Branch

L.D F0,0(R1) L.D F6,-8(R1)

L.D F10,-16(R1) L.D F14,-24(R1)

L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2

L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2

ADD.D F20,F18,F2 ADD.D F24,F22,F2

S.D F4,0(R1) S.D F8,-8(R1) ADD.D F28,F26,F2

S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56

S.D F20,24,(R1) S.D F24,16(R1)

S.D F28,8(R1) BNE R1,R2,Loop

7 results in 9 clocks23 ops in 9 clocks

1/29/2010 27CS 211 Lecture 6 27

VLIW Original Model Issues

• Code size increase– Extreme loop unrolling

– Wasted bits in instruction encoding

• Limitations of lockstep operation– Early VLIWs operated in lock step with no hazard detection

– Stall in any functional unit in the pipeline caused entire processor to stall (all functional units kept synchronized)

• Prediction of which data accesses will encounter cache stall is difficult – any miss affects all instructions in the word

• For large numbers of memory references the lock step restriction becomes unacceptable

• In more recent processors, functional units operate independently

1/29/2010 28CS 211 Lecture 6 28

VLIW Original Model Issues

• Binary code compatibility issue– In VLIW approach, the code sequence (words) make use of both

the instruction set definition and the detailed pipeline structure, including both the functional units and their latencies

– Requires different versions of the code

– A new processor design on the old instruction set will require recompilation of code

• Object code translation

• Loops– Where loops are unrolled, the individual loop iterations were

dependent and most like would have run well on a vector processor. The multiple issue approach, however, is still preferred

1/29/2010 29CS 211 Lecture 6 29

Advanced Compiler Support – Detecting and Enhancing Loop Level Parallelism

• Loop level parallelism normally analyzed at or near the source code level

• What dependencies exist among the operations in a loop across the iterations of that loop

– Loop-carried dependence – where data accesses in later iterations are dependent on data values produced in earlier iterations

• Most examples considered so far have not loop-level dependence

For (i=1000; i>0; i=i-1)x[i] = x[i] + s;

– There is a dependence between the two uses of x[i] but they do not carry across the loop

– There is a dependence on successive uses of i in different iterations, which is loop carried, but this dependence involves an induction variable

1/29/2010 30CS 211 Lecture 6 30


• For (i=1; I<=100; i=i+1){A[i+1] = A[i] + C[i]; //S1B[i+1] = B[i] + A[i+1]; //S2

}

– Assume A, B, and C are distinct non-overlapping arrays• Note that this can be tricky, and requires sophisticated analysis of

the program

• Dependences– S1 uses a value computed by S1 in a previous iteration (A[i]) and S2

uses a value computed by S2 in a previous iteration• Loop carried dependencies

– S2 uses a value computed by S1 in the same iteration • Not loop carried• Multiple Iterations can execute in parallel as long as dependent

statements are kept in order

1/29/2010 31CS 211 Lecture 6 31


• Another loop For (i=1; I<=100; i=i+1){

A[i] = A[i] + B[i]; //S1B[i+1] = C[i] + D[i]; //S2

}

• Dependences

– S1 uses a value computed by S2 in a previous iteration (B[i+1])

– Dependence is not circular, neither statement depends upon itself

– S2 does not depend on S1

• A loop is parallel if it can be written without a cycle in the dependences

1/29/2010 32CS 211 Lecture 6 32

Detecting and Enhancing Loop Level Parallelism – Transforming the Loop

• The transformed loopA[1] = A[1]+B[1]; for (i=1; I<=99; i=i+1){

B[i+1] = C[i] + D[i]; A[i+1]=A[i+1]+B[i+1];

}B[101] = C[100] + D[100]

• Transformation

– There was no dependence from S1 to S2. Interchanging the two statements will not affect the outcome

– On the first iteration of the loop, statement S1 depends on the value B[1] computed prior to initiating the loop

For (i=1; I<=100; i=i+1){A[i] = A[i] + B[i]; //S1B[i+1] = C[i] + D[i]; //S2

}

1/29/2010 33CS 211 Lecture 6 33

Finding Dependencies – Array Oriented Dependence Difficulties

• Situations in which array oriented dependence analysis cannot give all information needed– When objects are referenced via pointers rather than array indices

– When array indexing is indirect through another array (many sparse array representations use this)

– When a dependence may exist for some value of the inputs, but does not exist when the code is run since the inputs never take on those values

– When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of that variable depend

1/29/2010 34CS 211 Lecture 6 34

Finding Dependencies – Points To Analysis

• Deals with analyzing programs with pointers

• Three major sources of analysis info– Type information – which restricts what a pointer can point to

• Issues in loosely typed languages

– Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what the pointer can point to

• If p always points to an object allocated in a given source line and q never points to that object, the p and q can never point to the same object

– Information derived from pointer assignments• If p may be assigned to the value of q, then p may point to

anything q points to

1/29/2010 35CS 211 Lecture 6 35

Eliminating Dependent Computations

• Example 1– Change instruction stream

DADDUI R1,R2,#4DADDUI R1,R1,#4

– ToDADDUI R1,R2,#8

• Example 2– Reorder

ADD R1,R2,R3ADD R4,R1,R6ADD R8,R4,R7

– ToADD R1,R2,R3ADD R4,R6,R7ADD R8,R1,R4

1/29/2010 36CS 211 Lecture 6 36

Eliminating Dependent Computations

• Example 3– Change How Operations Are Performed in unrolling

sum = sum + x• From

sum = sum + x1 + x2 + x3 + x4 + x5;

• Tosum = ((sum+x1)+(x2+x3))+(x4+x5)

In first example, the sum is computed left to right. In second example, there are only 3 dependences on results.

1/29/2010 37CS 211 Lecture 6 37

Software Pipelining

• Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

• Software pipelining (Symbolic loop unrolling): reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop

• Idea is to separate dependencies in original loop body– Register management can be tricky but idea is to turn the code into a

single loop body

– In practice both unrolling and software pipelining will be necessary due to the register limitations

1/29/2010 38CS 211 Lecture 6 38

Software Pipelining

• OriginalLoop: L.D F0,0(R1)

ADD.D F4,F0,F2S.D F4,0(R1)DADDUI R1,R1,#-8BNE R1,R2,Loop

• Software Pipelined Version – unrolled loop and select instructions i: L.D F0,0(R1)

ADD.D F4,F0,F2S.D F4,0(R1)

i+1: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)

i+2: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)

Loop: S.D F4,16(R1)ADD.D F4,F0,F2L.D F0,0(R1)DADDUI R1,R1,#-8BNE R1,R2,Loop

It0 it1 it2L.DADD.D L.DS.D ADD.D L.D

S.D ADD.DS.D

Above we only show 3 iterations, we could do more!!

1/29/2010 39CS 211 Lecture 6 39

Software Pipelining

•

L.D F0, 16, R1

ADD.D F4, F0, F2

L.D F0, 8 (R1)

LOOP: SD F4, 16(R1)

ADD.D F4, F0, F2

L.D F0, 0(R1)

DADDUI R1, R1, -8

BNE R1, R2, LOOP

S.D F4, 16(R1)

ADD.D F4, F0, F2

ST F4, 9(R1)

Loop: S.D F4,16(R1)ADD.D F4,F0,F2L.D F0,0(R1)DADDUI R1,R1,#-8BNE R1,R2,Loop 6.0

7.0

8.0

9.0

R2

R1

2.0F2

1/29/2010 40CS 211 Lecture 6 40

Cross Section View

CS 211: Computer Architecture Lecture 6 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Documents

loop cs

8r1 cs

loop9 cs

loop s

basic pipeline scheduling

loop iterations

unrolling givesmips

0r16daddui r1