Topics covered: Pipelining CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Topics covered:Pipelining

CSE243: Introduction to Computer Architecture and Hardware/Software Interface

2

Basic concepts

Speed of execution of programs can be improved in two ways: Faster circuit technology to build the processor and the

memory. Arrange the hardware so that a number of operations can

be performed simultaneously. The number of operations performed per second is increased although the elapsed time needed to perform any one operation is not changed.

Pipelining is an effective way of organizing concurrent activity in a computer system to improve the speed of execution of programs.

3

Basic concepts (contd..)

•Processor executes a program by fetching and executing instructions one after the other. •This is known as sequential execution. •If Fi refers to the fetch step, and Ei refers to the execution step of instruction Ii,then sequential execution looks like:

F1

E1

F2

E2

F3

E3

1 2 3

Time

What if the execution of one instruction is overlapped with the fetching of the next one?

4


Instructionfetchunit

Executionunit

Interstage bufferB1

•Computer has two separate hardware units, one for fetching instructions and one for executing instructions.•Instruction is fetched by instruction fetch unit and deposited in an intermediate buffer B1.•Buffer enables the instruction execution unit to execute the instruction while the fetch unit is fetching the next instruction.•Results of the execution are deposited in the destination location specified by the instruction.

5


F1 E1

F2 E2

F3 E3

I1

I2

I3

Instruction

Clock cycle 1 2 3 4Time

•Computer is controlled by a clock whose period is such that the fetch and execute steps of any instruction can be completed in one clock cycle.•First clock cycle: - Fetch unit fetches an instruction I1 (F1) and stores it in B1.•Second clock cycle: - Fetch unit fetches an instruction I2 (F2) , and execution unit executes instruction I1 (E1).•Third clock cycle: - Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2).•Fourth clock cycle: - Execution unit executes instruction I3 (E3).

6


In each clock cycle, the fetch unit fetches the next instruction, while the execution unit executes the current instruction stored in the interstage buffer. Fetch and the execute units can be kept busy all the time.

If this pattern of fetch and execute can be sustained for a long time, the completion rate of instruction execution will be twice that achievable by the sequential operation.

Fetch and execute units constitute a two-stage pipeline. Each stage performs one step in processing of an

instruction. Interstage storage buffer holds the information that needs

to be passed from the fetch stage to execute stage. New information gets loaded into the buffer every clock

cycle.

7


•Suppose the processing of an instruction is divided into four steps: F Fetch: Read the instruction from the memory. D Decode: Decode the instruction and fetch the source operands. E Execute: Perform the operation specified by the instruction. W Write: Store the result in the destination location. •There are four distinct hardware units, for each one of the steps. •Information is passed from one unit to the next through an interstage buffer. •Three interstage buffers connecting four units.•As an instruction progresses through the pipeline, the information needed by the downstream units must be passed along.

F : Fetchinstruction

D : Decodeinstructionand fetchoperands

E: Executeoperation

W : Writeresults

Interstage buffers

B1 B2 B3

8


F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction

Clock cycle 1 2 3 4 5 6 7Time

Clock cycle 1: F1Clock cycle 2: D1, F2Clock cycle 3: E1, D2, F3Clock cycle 4: W1, E2, D3, F4Clock cycle 5: W2, E3, D4Clock cycle 6: W3, E3, D4Clock cycle 7: W4

9


F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction


•Buffer B1 holds instruction I3, which is being decoded by the instruction-decodingunit. Instruction I3 was fetched in cycle 3.•Buffer B2 holds the source and destination operands for instruction I2. It also holds the information needed for the Write step (W2) of instruction I2. This information will be passed to the stage W in the following clock cycle. •Buffer B1 holds the results produced by the execution unit and the destination information for instruction I1.

During clock cycle #4:

10

Role of cache memory

Each stage in the pipeline is expected to complete its operation in one clock cycle: Clock period should be sufficient to complete the longest task. Units which complete the tasks early remain idle for the

remaining clock period. Tasks being performed in different stages should require

about the same amount of time for pipelining to be effective. If instructions are to be fetched from the main memory,

the instruction fetch stage would take as much as ten times greater than the other stage operations inside the processor.

However, if instructions are to be fetched from the cache memory which is on the processor chip, the time required to fetch the instruction would be more or less similar to the time required for other basic operations.

11

Pipeline performance

Potential increase in performance achieved by using pipelining is proportional to the number of pipeline stages. For example, if the number of pipeline stages is 4, then the

rate of instruction processing is 4 times that of sequential execution of instructions.

Pipelining does not cause a single instruction to be executed faster, it is the throughput that increases.

This rate can be achieved only if the pipelined operation can be sustained without interruption through program instruction.

If a pipelined operation cannot be sustained without interruption, the pipeline is said to “stall”.

A condition that causes the pipeline to stall is called a “hazard”.

12

Data hazard

•Execution of the instruction occurs in the E stage of the pipeline. •Execution of most arithmetic and logic operations would take only one clock cycle. •However, some operations such as division would take more time to complete.•For example, the operation specified in instruction I2 takes three cycles to completefrom cycle 4 to cycle 6.

F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction


13

Data hazard (contd..)

F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction


•Cycles 5 and 6, the Write stage is idle, because it has no data to work with. •Information in buffer B2 must be retained till the execution of the instruction I2 is complete. •Stage 2, and by extension stage 1 cannot accept new instructions because the information in B1 cannot be overwritten. •Steps D6 and F5 must be postponed. •A data hazard is a condition in which either the source or the destination operand is not available at the time expected in the pipeline.

14

Control or instruction hazard

•Pipeline may be stalled because an instruction is not available at the expected time. •For example, while fetching an instruction a cache miss may occur, and hence the instruction may have to be fetched from the main memory. •Fetching the instruction from the main memory takes much longer than fetching the instruction from the cache. •Thus, the fetch cycle of the instruction cannot be completed in one cycle.•For example, the fetching of instruction I2 results in a cache miss.•Thus, F2 takes 4 clock cycles instead of 1.

F1

F2

F3

I1

I2

I3

D1

D2

D3

E1

E2

E3

W1

W2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycleTime

15

Control or instruction hazard (contd..)

•Fetch operation for instruction I2 results in a cache miss, and the instruction fetch unit must fetch this instruction from the main memory. •Suppose fetching instruction I2 from the main memory takes 4 clock cycles.•Instruction I2 will be available in buffer B1 at the end of clock cycle 5. •The pipeline resumes its normal operation at this point. •Decode unit is idle in cycles 3 through 5.•Execute unit is idle in cycles 4 through 6.•Write unit is idle in cycles 5 through 7. •Such idle periods are called as stalls or bubbles. •Once created in one of the pipeline stages, a bubble moves downstream unit it reaches the last unit.

1 2 3 4 5 6 7 8Clock cycle

Stage

F: Fetch

D: Decode

E: Execute

W: Write

F1 F2 F3

D1

D2

D3

idle idle idle

E1 E2 E3idle idle idle

W1

W2

idle idle idle

9

W3

F2 F2 F2

Time

16

Structural hazard

Two instructions require the use of a hardware resource at the same time.

Most common case is in access to the memory: One instruction needs to access the memory as part of the

Execute or Write stage. Other instruction is being fetched. If instructions and data reside in the same cache unit, only

one instruction can proceed and the other is delayed. Many processors have separate data and instruction

caches to avoid this delay. In general, structural hazards can be avoided by

providing sufficient resources on the processor chip.

17

Structural hazard (contd..)

F1

F2

F3

I1

I2 (Load X(R1),R2

I3

E1

M2

D1

D2

D3

W1

W2

Instruction

F4

I4

F5I5 D5

Clock cycle 1 2 3 4 5 6 7

E2

E3 W3

E4D4

•Memory address X+R1 is computed in step E2 in cycle 4, memory access takes placein cycle 5, operand read from the memory is written into register R2 in cycle 6.•Execution of instruction I2 takes two clock cycles 4 and 5.•In cycle 6, both instructions I2 and I3 require access to register file.•Pipeline is stalled because the register file cannot handle two operations at once.

18

Pipelining and performance

Pipelining does not cause an individual instruction to be executed faster, rather, it increases the throughput. Throughput is defined as the rate at which instruction

execution is completed. When a hazard occurs, one of the stages in the pipeline

cannot complete its operation in one clock cycle. The pipeline stalls causing a degradation in performance.

Performance level of one instruction completion in each clock cycle is the upper limit for the throughput that can be achieved in a pipelined processor.

19

Data hazards

•Data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed.•Consider two instructions: I1: A = 3 + A I2: B = 4 x A•If A = 5, and I1 and I2 are executed sequentially, B=32.•In a pipelined processor, the execution of I2 can begin before the execution of I1. •The value of A used in the execution of I2 will be the original value of 5 leading to an incorrect result.•Thus, instructions I1 and I2 depend on each other, because the data used by I2 depends on the results generated by I1.•Results obtained using sequential execution of instructions should be the same as the results obtained from pipelined execution.•When two instructions depend on each other, they must be performed in the correctorder.

20

Data hazards (contd..)

F1

F2

F3

I1

I2

I3

D1

D3

E1

E3

E2

W3

Instruction1 2 3 4 5 6 7 8 9Clock cycle

W1

D2A W2

F4 D4 E4 W4I4

D2

Mul R2, R3, R4

Add R5,R4,R6

•Mul instruction places the results of the multiply operation in register R4 at the end of clock cycle 4. •Register R4 is used as a source operand in the Add instruction. Hence the Decode Unit decoding the Add instruction cannot proceed until the Write step of the first instruction is complete.•Data dependency arises because the destination of one instruction is used as a sourcein the next instruction.

21

Operand forwarding

•Data hazard occurs because the destination of one instruction is used as the sourcein the next instruction.•Hence, instruction I2 has to wait for the data to be written in the register file by theWrite stage at the end of step W1.•However, these data are available at the output of the ALU once the Execute stage completes step E1. •Delay can be reduced or even eliminated if the result of instruction I1 can be forwarded directly for use in step E2.•This is called “operand forwarding”.

22

Operand forwarding (contd..)

Registerfile

SRC1 SRC2

RSLT

Destination

Source 1

Source 2

ALU

•Similar to the three-bus organization. •Registers SRC1, SRC2 and RSLT have been added. •SRC1, SRC2 and RSLT are interstage buffers for pipelined operation. •SRC1 and SRC2 are part of buffer B2.•RSLT is part of buffer B3. •Data forwarding mechanism is shown bythe two red lines. •Two multiplexers connected at the inputsto the ALU allow the data on the destinationbus to be selected instead of the contents ofSRC1 and SRC2 register.

23

Operand forwarding (contd..)

Registerfile

SRC1 SRC2

RSLT

Destination

Source 1

Source 2

ALU

I1: Mul R2, R3, R4I2: Add R5, R4, R6

Clock cycle 3: - Instruction I2 is decoded, and a data dependency is detected. - Operand not involved in the dependency, register R5 is loaded in register SRC1.Clock cycle 4: - Product produced by I1 is available in register RSLT. - The forwarding connection allows the result to be used in step E2. Instruction I2 proceeds without interruption.

24

Handling data dependency in software

Data dependency may be detected by the hardware while decoding the instruction: Control hardware may delay by an appropriate number of

clock cycles reading of a register till its contents become available. The pipeline stalls for that many number of clock cycles.

Detecting data dependencies and handling them can also be accomplished in software. Compiler can introduce the necessary delay by introducing

an appropriate number of NOP instructions. For example, if a two-cycle delay is needed between two instructions then two NOP instructions can be introduced between the two instructions.

I1: Mul R2, R3, R4 NOP NOP

I2: Add R5, R4, R6

25

Side effects

Data dependencies are explicit easy to detect if a register specified as the destination in one instruction is used as a source in the subsequent instruction.

However, some instructions also modify registers that are not specified as the destination. For example, in the autoincrement and autodecrement

addressing mode, the source register is modified as well. When a location other than the one explicitly specified

in the instruction as a destination location is affected, the instruction is said to have a “side effect”.

Another example of a side effect is condition code flags which implicitly record the results of the previous instruction, and these results may be used in the subsequent instruction.

26

Side effects (contd..)

I1: Add R3, R4I2: AddWithCarry R2, R4

Instruction I1 sets the carry flag and instruction I2 uses the carry flag leading to animplicit dependency between the two instructions.

•Instructions with side effects can lead to multiple data dependencies.•Results in a significant increase in the complexity of hardware or software needed to handle the dependencies.•Side effects should be kept to a minimum in instruction sets designed for executionon pipelined hardware.

27

Instruction hazards

Instruction fetch units fetch instructions and supply the execution units with a steady stream of instructions.

If the stream is interrupted then the pipeline stalls. Stream of instructions may be interrupted because of a

cache miss or a branch instruction.

28

Instruction hazards (contd..)

Consider a two-stage pipeline, first stage is the instruction fetch stage and the secondstage is the instruction execute stage.Instructions I1, I2 and I3 are stored at successive memory locations.I2 is a branch instruction with branch target as instruction Ik.I2 is an unconditional branch instruction.Clock cycle 3: - Fetch unit is fetching instruction I3. - Execute unit is decoding I2 and computing the branch target address.Clock cycle 4: - Processor must discard I3 which has been incorrectly fetched and fetch Ik. - Execution unit is idle, and the pipeline stalls for one clock cycle.

29


F2I2 (Branch)

I3

Ik

E2

F3

Fk Ek

Fk+1 Ek+1Ik+1

Instruction

Execution unit idle

1 2 3 4 5Clock cycleTime

F1

I1

E1

6

X

•Pipeline stalls for one clock cycle.•Time lost as a result of a branch instruction is called as branch penalty.•Branch penalty is one clock cycle.

30


Branch penalty depends on the length of the pipeline, may be higher for a longerpipeline.For a four-stage pipeline: - Branch target address is computed in stage E2. - Instructions I3 and I4 have to be discarded. - Execution unit is idle for 2 clock cycles. - Branch penalty is 2 clock cycles.

X

F1 D1 E1 W1

I2 (Branch)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

F3

Fk Dk Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

E2

D3

F4 XI4

8Time

31


•Branch penalty can be reduced by computing the branch target address earlier in thepipeline. •Instruction fetch unit has special hardware to identify a branch instruction after the instruction is fetched.•Branch target address can be computed in the Decode stage (D2), rather than in the Execute stage (E2).•Branch penalty is only one clock cycle.

F1 D1 E1 W1

I2 (Branch)

I1


F2 D2

F3 X

Fk Dk Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

Time

32


F : Fetchinstruction

E : Executeinstruction

W : Writeresults

D : Dispatch/Decode

Instruction queue

Instruction fetch unit

unit

Fetch unit fetches instructions before they are needed &stores them in a queue

Queue can hold severalinstructions

Dispatch unit takes instructions from the front of the queue and dispatches them to the Execution unit. Dispatch unit also decodesthe instruction.

33


Fetch unit must have sufficient decoding and processing capability to recognize and execute branch instructions.

Pipeline stalls because of a data hazard: Dispatch unit cannot issue instructions from the queue. Fetch unit continues to fetch instructions and add them to

the queue. Delay in fetching because of a cache miss or a branch:

Dispatch unit continues to dispatch instructions from the instruction queue.

34


X

F1 D1 E1 E1 E1 W1

F4

W3E3

I5 (Branch)

I1

F2 D2

1 2 3 4 5 6 7 8 9Clock cycle

E2 W2

F3 D3

E4D4 W4

F5 D5

F6

Fk Dk Ek

Fk+1 Dk+1

I2

I3

I4

I6

Ik

Ik+1

Wk

Ek+1

10

1 1 1 1 2 3 2 1 1Queue length 1

•Initial length of the queue is 1.•Fetch adds 1 to the queue, dispatchreduces the length by 1.•Queue length remains the same for first 4 clock cycles.•I1 stalls the pipeline for 2 cycles.•Queue has space, so the fetch unitcontinues and queue length risesto 3 in clock cycle 6.

•I5 is a branch instruction withtarget instruction Ik.•Ik is fetched in cycle 7, and I6

is discarded. •However, this does not stall the pipeline, since I4 is dispatched.I2, I3, I4 and Ik are executed in successive clock cycles.Fetch unit computes the branch address concurrently with the execution of otherinstructions. This is called as branch folding.

35


Branch folding can occur if there is at least one instruction available in the queue other than the branch instruction. Queue should ideally be full most of the time. Increasing the rate at which the fetch unit reads

instructions from the cache. Most processors allow more than one instruction to be

fetched from the cache in one clock cycle. Fetch unit must replenish the queue quickly after a branch

has occurred. Instruction queue also mitigates the impact of cache

misses: In the event of a cache miss, the dispatch unit continues to

send instructions to the execution unit as long as the queue is full.

In the meantime, the desired cache block is read. If the queue does not become empty, cache miss has no

effect on the rate of instruction execution.

36

Conditional branches and branch prediction

Conditional branch instructions depend on the result of a preceding instruction. Decision on whether to branch cannot be made until the

execution of the preceding instruction is complete. Branch instructions represent 20% of the dynamic

instruction count of most programs. Dynamic instruction count takes into consideration that

some instructions are executed repeatedly. Branch instructions may incur branch penalty reducing

the performance gains expected from pipelining. Several techniques to mitigate the negative impact of

branch penalty on performance.

37

Delayed branch

•Branch target address is computed in stage E2. •Instructions I3 and I4 have to be discarded.•Location following a branch instruction is called a branch delay slot.•There may be more than one branch delay slot depending on the time it takes to determine whether the instruction is a branch instruction.•In this case, there are two branch delay slots. •The instructions in the delay slot are always fetched and at least partially executedbefore the branch decision is made and the branch address is computed.

X

F1 D1 E1 W1

I2 (Branch)

I1


F2 D2

F3

Fk Dk Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

E2

D3

F4 XI4

8Time

38

Delayed branch (contd..)

Delayed branching can minimize the penalty incurred as a result of conditional branch instructions.

Since the instructions in the delay slots are always fetched and partially executed, it is better to arrange for them to be fully executed whether or not branch is taken. If we are able to place useful instructions in these slots,

then they will always be executed whether or not the branch is taken.

If we cannot place useful instructions in the branch delay slots, then we can fill these slots with NOP instructions.

39


Add

LOOP Shift_left R1DecrementBranch=0

R2LOOP

NEXT

(a) Original program loop

LOOP Decrement R2Branch=0

Shift_left

LOOP

R1NEXT Add

R1,R3

R1,R3

Register R2 is used as a counter to determine how many times R1 is to be shifted.Processor has a two stage pipeline orone delay slot.Instructions can be reordered so that the shift left instruction appears in thedelay slot. Shift left instruction is always executed whether the branch condition is true orfalse.

(b) Reordered instructions

40


F E

F E

F E

F E

F E

F E

F E

Instruction

Decrement

Branch

Shift (delay slot)

Decrement (Branch taken)

Branch

Shift (delay slot)

Add (Branch not taken)

1 2 3 4 5 6 7 8Clock cycleTime

Shift instruction is executed when the branchis taken.

Shift instruction is executed when the branch is not taken.

41


Logically, the program is executed as if the branch instruction were placed after the shift instruction.

Branching takes place one instruction later than where the branch instruction appears in the instruction sequence (with reference to reordered instructions).

Hence, this technique is termed as “delayed branch”. Delayed branch requires reordering as many

instructions as the number of delay slots. Usually possible to reorganize one instruction to fill one

delay slot. Difficult to reorganize two or more instructions to fill two or

more delay slots.

42

Branch prediction

To reduce the branch penalty associated with conditional branches, we can predict whether the branch will be taken.

Simplest form of branch prediction: Assume that the branch will not take place. Continue to fetch instructions in sequential execution order. Until the branch condition is evaluated, instruction execution

along the predicted path must be done on a speculative basis. “Speculative execution” implies that the processor is

executing instructions before it is certain that they are in the correct sequence. Processor registers and memory locations should not be

updated unless the sequence is confirmed. If the branch prediction turns out to be wrong, then instructions

that were executed on a speculative basis and their data must be purged.

Correct sequence of instructions must be fetched and executed.

43

Branch prediction (contd..)

F1

F2

I1 (Compare)

I2 (Branch>0)

I3

D1 E1 W1

F3

F4

Fk Dk

D3 X

XI4

Ik

Instruction

E2

Clock cycle 1 2 3 4 5 6

D2/P2

Time

•I1 is a compare instruction and I2

is a branch instruction.•Branch prediction takes place in cycle3 when I2 is being decoded. •I3 is being fetched at that time. •Fetch unit predicts that the branch willnot be taken and continues to fetch I4

in cycle 4 when I3 is being decoded.

•Results of I1 are available in cycle 3.•Fetch unit evaluates branch condition in cycle 4. •If the branch prediction is incorrect, the fetch unit realizes at this point.•I3 and I4 are discarded and Ik is fetched from the branch target address.

44


If branch outcomes were random, then the simple approach of always assuming that the branch would not be taken would be correct 50% of the time.

However, branch outcomes are not random and it may be possible to determine a priori whether a branch will be taken or not depending on the expected program behavior. For example, a branch instruction at the end of the loop

causes a branch to the start of the loop for every pass through the loop except the last one. Better performance can be achieved if this branch is always predicted as taken.

A branch instruction at the beginning of the loop causes the branch to be not taken most of the time. Better performance can be achieved if this branch is always predicted as not taken.

45


Which way to predict the result of the branch instruction (taken or not taken) may be made in the hardware, depending on whether the target address of the branch instruction is lower or higher than the address of the branch instruction. If the target address is lower, then the branch is predicted

as taken. If the target address is higher, then the branch is predicted

as not taken. Branch prediction can also be handled by the compiler.

Complier can set the branch prediction bit to 0 or 1 to indicate the desired behavior.

Instruction fetch unit checks the branch prediction bit to predict whether the branch will be taken.

46


Branch prediction decision is the same every time an instruction is executed. This is “static branch prediction”.

Branch prediction decision may change depending on the execution history. This is “dynamic branch prediction”.

47


Branch prediction algorithms should minimize the probability of making a wrong branch prediction decision.

In dynamic branch prediction the processor hardware assesses the likelihood of a given branch being taken by keeping track of branch decisions every time that instruction is executed.

Simplest form of execution history used in predicting the outcome of a given branch instruction is the result of the most recent execution of that instruction. Processor assumes that the next time the instruction is

executed, the result is likely to be the same. For example, if the branch was taken the last time the

instruction was executed, then the branch is likely to be taken this time as well.

48


Branch prediction algorithm may be described as a two-state machine with 2 states: LT : Branch is likely to be taken LNT: Branch is likely not to be takenInitial state of the machine be LNTWhen the branch instruction is executed, and if the branch is taken, the machine moves to state LT.If the branch is not taken, it remains in state LNT. When the same branch instruction is executed the next time, the branch is predicted as taken if the state of the machine is LT, else it is predicted as not taken.

Branch taken (BT)

Branch not taken (BNT)

BTBNT LNT LT

49


Requires only one bit of history information for each branch instruction.

Works well inside loops: Once a loop is entered, the branch instruction that controls

the looping will always yield the same result until the last pass.

In the last pass, the branch prediction will turn out to be incorrect.

The branch history state machine will be changed to the opposite state.

However, if the same loop is entered the next time, and there is more than one pass, the branch prediction machine will lead to wrong branch prediction.

Better performance may be achieved by keeping more execution history.

50


BTBNT

BNT

BT

BNTBT

BNT LNT

LT ST

SNT

BT

ST : Strong likely to be takenLT : Likely to be takenLNT : Likely not to be takenSNT : Strong likely not to be taken

•Initial state of the algorithm is LNT.•After the branch instruction is executed, if the branch is taken, the state is changed to ST•For a branch instruction, the fetch unit predictsthat the branch will be taken if the state is STor LT, else it predicts that the branch will notbe taken.•In state SNT: - The prediction is that the branch is not taken. - If the branch is actually taken, the state changes to LNT. - Next time the branch is encountered, the prediction again is that it is not taken. - If the prediction is wrong the second time, the state changes to ST. - After that, the branch is predicted as taken.

51


Consider a loop with a branch instruction at the end. Initial state of the branch prediction algorithm is LNT. In the first pass, the algorithm will predict that the

branch is not taken. This prediction will be incorrect. The state of the algorithm will change to ST.

In the subsequent passes, the algorithm will predict that the branch is taken: Prediction will be incorrect, except for the last pass.

In the last pass, the branch is not taken: The state will change to LT from ST.

When the loop is entered the second time, the algorithm will predict that the branch is taken: This prediction will be correct.

52


Branch prediction algorithm mispredicts the outcome of the branch instruction in the first pass. The prediction in the first pass depends on the initial state of

the branch prediction algorithm. If the initial state can be set correctly, the misprediction in

the first pass can be avoided. Information necessary to set the initial state of the branch

prediction algorithm can be provided by static prediction schemes. Comparing the branch target address with the address of the

branch instruction, Checking the branch prediction bit set by the compiler. Branch instruction at the end of the loop, initial state is LT. Branch instruction at the start of the loop, initial state is LNT.

With this, the only misprediction that occurs is on the final pass through the loop. This misprediction is unavoidable.

53

Superscalar operation

Pipelining enables multiple instructions to be executed concurrently by dividing the execution of an instruction into several stages: Instructions enter the pipeline in strict program order. If the pipeline does not stall, one instruction enters the

pipeline and one instruction completes execution in one clock cycle.

Maximum throughput of a pipelined processor is one instruction per clock cycle.

An alternative approach is to equip the processor with multiple processing units to handle several instructions in parallel in each stage.

54

Superscalar operation (contd..)

If a processor has multiple processing units then several instructions can start execution in the same clock cycle. Processor is said to use “multiple issue”.

These processors are capable of achieving instruction execution throughput of more than one instruction per cycle.

These processors are known as “superscalar processors”.

55


W : Writeresults

Dispatchunit

Instruction queue

Floating-pointunit

Integerunit

F : Instructionfetch unit

Processor has two execution units:Integer and Floating Point

Instruction fetch unit is capable of reading twoinstructions at a time and storing them in the instruction queue.

Dispatch unit fetchesand retrieves up totwo instructions at a time from the front of the queue.

If there is one integer and one floating point instruction, and nohazards, then both instructions aredispatched in the same clock cycle.

56


Various hazards cause a even greater deterioration in performance in case of a superscalar processor.

Compiler can avoid many hazards by careful ordering of instructions: For example, the compiler should try to interleave floating-

point and integer instructions. Dispatch unit can then dispatch two instructions in most

clock cycles, and keep both integer and floating point units busy most of the time.

If the compiler can order instructions in such a way that the available hardware units can be kept busy most of the time, high performance can be achieved.

57


I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3 E3 E3

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

F1

F2

F3

F4

1 2 3 4 5 6Clock cycle 7

•Instructions in the floating-point unit take three cycles to execute.•Floating-point unit is organized as a three-stage pipeline.•Instructions in the integer unit take one cycle to execute. •Integer unit is organized as a single-stage pipeline.•Clock cycle 1: - Instructions I1 (floating point) and I2 (integer) are fetched.•Clock cycle 2: - Instructions I1 and I2 are decoded and dispatched, I3 is fetched.

58


I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3 E3 E3

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

F1

F2

F3

F4

1 2 3 4 5 6Clock cycle 7

•Clock cycle 3: - I1 and I2 begin execution, I2 completes execution. I3 is dispatched to floating - point unit and I4 is dispatched to integer unit. Clock cycle 4: - I1 continues execution, I3 begins execution, I2 completes Write stage, I4 completes execution.Clock cycle 5: - I1 completes execution, I3 continues execution, and I4 completes Write.Order of completion is I2, I4, I1, I3

59

Out-of-order execution

Instructions are dispatched in the same order as they appear in the program, however, they complete execution out-of-order. Dependencies among instructions need to be handled

correctly, so that this does not lead to any problems. What if during the execution of an instruction an

exception occurs and one or more of the succeeding instructions have been executed to completion? For example, the execution of instruction I1 may cause an

exception after the instruction I2 has completed execution and written the results to the destination location?

If a processor permits succeeding instructions to complete execution and write to the destination locations, before knowing whether the prior instructions cause exceptions, it is said to allow “imprecise exceptions”.

60

Out-of-order execution (contd..)

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3A E3B E3C

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

1 2 3 4 5 6Clock cycle

F1

F2

F3

F4

7

•To guarantee a consistent state when exceptions occur, the results of execution mustbe written to the destination locations strictly in the program order.•Step W2 must be delayed until cycle 6, when I1 enters the write stage.•Integer unit must retain the results of I2 until cycle 6, and cannot accept another instruction until then. •If an exception occurs during an instruction, then all subsequent instructions that mayhave been partially executed are discarded.•This is known a “precise exception.”

61

Execution completion

It is beneficial to allow out-of-order execution, so that the execution unit is freed up to execute other instructions.

However, instructions must be completed in program order to allow precise exceptions.

These requirements are conflicting. It is possible to resolve the conflict by allowing the

execution to proceed and writing the results into temporary registers.

The contents of the temporary registers are transferred to permanent registers in correct program order.

62

Execution completion (contd..)

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3A E3B E3C

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)


TW2

TW4

7

F1

F2

F3

F4

•Step TW is a write into a temporary register.•Step W is the final step in which the contents of the register are transferred into the appropriate permanent register.•Step W is called the “commitment step” because the effect of the instruction executioncannot be reversed after that point. •Before the commitment step, if any instruction causes an exception, then the results ofthe succeeding instructions that are still in the temporary registers can be safely discarded.

63


•Temporary register is given the same name and is treated in the same way as the permanent register whose data it is holding.•For example, if the destination register of I2 is R5, then the temporary register usedin step TW2 is treated as R5 in the clock cycles 6 and 7.•This technique is called as “Register renaming”. •If any succeeding instruction refers to R5 during clock cycles 6 and 7, then the contents of the temporary register are forwarded.

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3A E3B E3C

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)


TW2

TW4

7

F1

F2

F3

F4

64


A special control unit called “commitment unit” is needed to ensure in-order commitment when out-of-order execution is allowed.

Commitment unit has a queue called “reorder buffer” to determine which instructions should be committed next: Instructions are entered in the queue strictly in the

program order as they are dispatched for execution. When an instruction reaches the head of the queue and

its execution has been completed: Results are transferred from temporary registers to

permanent registers. All resources assigned to this instruction are released. The instruction is said to have “retired”.

Instructions are retired strictly in program order, though they may be completed out-of-order.