Top Banner
2.2 Pipelining Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process being executed in a special dedicated segment that operates concurrently with all other segments. Any operation that can be decomposed into a sequence of sub-operations of about the same complexity can be implemented by a pipeline processor. 2.2.1 Linear Pipeline Processors A linear pipeline processor is a cascade of processing stages which are linearly connected to perform a fixed function over a stream of data flowing from one end to the other. In modern computers, linear pipelines are applied for instruction execution, arithmetic computation and memory access operations. 2.2.1.1 Asynchronous and Synchronous Models A linear pipeline is constructed with k processing stages (segments). External inputs (operands) are fed into the pipeline at the first stage S1. The processed results are passed from stage Si to stage Si+1 for all i = 1, 2, …., k-1. The final result emerges from the pipeline at the last stage.Depending on the control of data flow along the pipeline, linear pipeline is modeled into two categories: asynchronous and synchronous. Asynchronous Model As shown in Fig.2-2a, data flow between adjacent stages in asynchronous pipeline is controlled by a handshaking protocol. When stage Si is ready to transmit, it sends a ready signal to stage Si+1. After stage Si+1 receives the incoming data, it returns an acknowledge signal to Si. Synchronous Model Synchronous pipelines are illustrated in Fig.2-2b. The operands pass through all segments in a fixed sequence. Each segment consists of a combinational circuit Si that performs a suboperation over the data stream flowing through the pipe Isolating registers R (latches) are used to interface between stages and hold the
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2.2

2.2 Pipelining

Pipelining is a technique of decomposing a sequential process into

sub-processes, with each sub-process being executed in a special

dedicated segment that operates concurrently with all other

segments. Any operation that can be decomposed into a sequence

of sub-operations of about the same complexity can be

implemented by a pipeline processor.

2.2.1 Linear Pipeline Processors

A linear pipeline processor is a cascade of processing stages which

are linearly connected to perform a fixed function over a stream of

data flowing from one end to the other. In modern computers,

linear pipelines are applied for instruction execution, arithmetic

computation and memory access operations.

2.2.1.1 Asynchronous and Synchronous Models

A linear pipeline is constructed with k processing stages

(segments). External inputs (operands) are fed into the pipeline at

the first stage S1. The processed results are passed from stage Si to

stage Si+1 for all i = 1, 2, …., k-1. The final result emerges from

the pipeline at the last stage.Depending on the control of data flow

along the pipeline, linear pipeline is modeled into two categories:

asynchronous and synchronous.

Asynchronous Model

As shown in Fig.2-2a, data flow between adjacent stages in

asynchronous pipeline is controlled by a handshaking protocol.

When stage Si is ready to transmit, it sends a ready signal to stage

Si+1. After stage Si+1 receives the incoming data, it returns an

acknowledge signal to Si.

Synchronous Model

Synchronous pipelines are illustrated in Fig.2-2b. The operands

pass through all segments in a fixed sequence. Each segment

consists of a combinational circuit Si that performs a suboperation

over the data stream flowing through the pipe Isolating registers R

(latches) are used to interface between stages and hold the

Page 2: 2.2

intermediate results between the stages. Upon the arrival of a clock

pulse, all registers transfer data to the next stage simultaneously.

S1 S2 Sk

Input

Ready

Ack

Ready

Ack

Ready

Ack

Ready

Ack

Output

S1 S2 SkRR R R

Clock

Input Output

(a) An asynchronous pipeline model

(b) A synchronous pipeline model

R

S1 S2 Sk

Input

Ready

Ack

Ready

Ack

Ready

Ack

Ready

Ack

Output

S1 S2 SkRR R R

Clock

Input Output

(a) An asynchronous pipeline model

(b) A synchronous pipeline model

R

Fig.2-2

Synchronous pipeline will be simply described by the following

example.

Example 2.1

In certain scientific computations it is necessary to perform the

arithmetic operation (Ai + Bi)*(Ci + Di) with a stream of numbers.

Specify a pipeline configuration to carry out this task. List the

contents of all registers in the pipeline for i = 1 through 6.

Solution

Each sub-operation is to in a segment within a pipeline. Each

segment will have one or more registers and a combinational

circuit as shown. The sub-operations performed in each segment of

the pipeline are as follows:

R1� Ai , R2� Bi , R3� Ci , R4� Di input Ai, Bi, Ci

and Di

R5 � R1 + R2 , R6 � R3 + R4 Perform Addition of

Ai + Bi and Ci + Di

Page 3: 2.2

R7 � R5 * R6 Multiply (Ai + Bi)*(Ci + Di)

R4R3R2R1

R6R5

R7

AdderAdder

Multiplier

Ai Bi Ci Di

Segment 1

Segment 2

Segment 3

R4R3R2R1

R6R5

R7

AdderAdder

Multiplier

Ai Bi Ci Di

Segment 1

Segment 2

Segment 3

Fig.2-3

The seven registers are loaded with new data every clock pulse.

The effect of each clock will be as shown:

Segment 1 Segment 2 Segment 3 Clock

pulse R1 R2 R3 R4 R5 R6 R7

1 A1 B1 C1 D1 - - -

2 A2 B2 C2 D2 A1+B1 C1+D1 -

3 A3 B3 C3 D3 A3+B2 C2+D2 (A1+B1)*(C1+D1)

4 A4 B4 C4 D4 A3+B3 C3+D3 (A2+B2)*(C2+D2)

5 A5 B5 C5 D5 A4+B4 C4+D4 (A3+B3)*(C3+D4)

6 A6 B6 C6 D6 A5+B5 C5+D5 (A4+B4)*(C4+D4)

7 - - - - A6+B6 C6+D6 (A5+B5)*(C5+D5)

8 - - - - - - (A6+B6)*(C6+D6)

2.2.2 Space-time diagram

This is a diagram that shows segment utilization as a function of

time. The space-time diagram of a four segment pipeline is shown

in Fig 2-4 The horizontal axis displays the time in clock cycles and

the vertical axis gives the segment number. This diagram shows

six tasks T1 through T6 executed in four segments. The first task

T1 is completed after the fourth (k) clock cycle. No matter how

Page 4: 2.2

many segments are there in the system, once the pipeline is full, it

takes only one clock period to obtain the output.

Task : A task is defined as the total operation performed going

through all the segments in the pipe.

Clock cycles: 1 2 3 4 5 6 7 8 9

1 T1 T2 T3 T4 T5 T6 - - -

2 - T1 T2 T3 T4 T5 T6 - -

3 - - T1 T2 T3 T4 T5 T6 -

Segments:

4 - - - T1 T2 T3 T4 T5 T6

Fig.2-4

Example 2.2

Draw the space-time diagram for a six-segment pipeline showing

the time it takes to process eight tasks.

Solution Clock cycles: 1 2 3 4 5 6 7 8 9 10 11 12 13

1 T1 T2 T3 T4 T5 T6 T7 T8 - - - - -

2 - T1 T2 T3 T4 T5 T6 T7 T8 - - - -

3 - - T1 T2 T3 T4 T5 T6 T7 T8 - - -

4 - - - T1 T2 T3 T4 T5 T6 T7 T8 - -

5 - - - - T1 T2 T3 T4 T5 T6 T7 T8 -

Segments:

6 - - - - - T1 T2 T3 T4 T5 T6 T7 T8

It takes 13 clock cycles to process 8 tasks.

2.2.3 Pipeline Speedup

Consider the case where a k-segment pipeline with a clock cycle

time tp is used to execute n tasks. The first task T1 will take a

time equal to ktp. All remaining n-1 tasks will take a time equal to

(n-1)tp (see section 2.2). Therefore, to complete n tasks through a

k-segment pipeline requires k + (n - 1) clock cycles.

Page 5: 2.2

Next, consider a non-pipeline that performs the same

operation and takes a time equal to tn to complete each task. The

total time required for n tasks is ntn. The speedup of a pipeline

processing over an equivalent non-pipeline processing is defined

by the ratio:

ptnknnt

timepipeline

timepipelineNonS

)1( −+=

−=

As the number of tasks increases, n becomes much larger than k

– 1, and (k + 1 – n) approaches the value of n, under this

condition, the speedup becomes

ptntS =

If we assume that the time it takes to process a task is the same in

the pipeline and non-pipeline circuits, we will have tn = ktp.

Including this assumption, the speedup will be reduced to

kptntS ==

This shows that the theoretical maximum speedup that a pipeline

can provide is k, where k is the number of segments in the

pipeline. Fig 2-5 plots the speedup as a function of n, the number

of tasks performed by the pipeline.

k = 10

k = 6

0

2

4

6

8

10

12

0 1 2 4 8 16 32 64 128 256 512 1024

Number of tasks

Sp

eed

up

facto

r

Fig.2-5

Page 6: 2.2

2.2.4 Pipeline Efficiency

The efficiency Ek of a linear k-segment pipeline is defined as

kk

S

segmentsofNumber

Speedupk

E ==

If we assume that the time it takes to process a task is the same in

the pipeline and nonpipeline circuits, the speed up will be

p

p

ktnk

kntS

)1( −+

=

Then the efficiency will be defined as

)1( −+

==

nk

n

kk

S

kE

Example 2.3

A non-pipeline system takes 50 ns to process a task. The same

task can be processed in a six-segment pipeline with a clock

cycle of 10 ns. Determine the speedup and the efficiency of the

pipeline for 100 tasks. What is the maximum speedup and

efficiency that can be achieved?

Solution

given

- For the non-pipeline system: tn = 50 ns

- For the pipeline system: k = 6, tp = 10 ns

- Number of tasks n = 100

required

Page 7: 2.2

%33.836

5max

max

%33.796

76.4

510

50max

1,

1arg

max

76.410)11006(

50100

)1(

===

=⇒

====⇒−

====

=

×−+

×=

−+=⇒−

k

S

segmentsfoNumber

SpeedupMaximumEefficiencyMaximum

k

S

segmentsfoNumber

SpeedupEEfficiency

pt

nt

pnt

nnt

S

toedreducbewillratioThe

kneglectcanweSo

kthanerlmuchvalueatoupincreases

tasksofnumberthewhenachievedbecanedupespimumThe

ptnknnt

SSpeedup

2.2.5 Arithmetic pipeline

Pipelining techniques can be applied to speed up numerical

arithmetic computations. Pipeline arithmetic units are usually

found in very high speed computers. They to implement floating-

point operations, multiplication of fixed point numbers, and

similar computations encountered in scientific problems.

Fixed-Point Operations

Fixed-point numbers are represented internally in machines in

sign-magnitude, one's complement, or two's complement

notation. Add, subtract, multiply, and divide are the four primitive

arithmetic operations.

Floating-Point Numbers

A floating-point number X is represented by a pair (m, e), where

m is the mantissa (or fraction) and e is the exponent with an

implied base (or radix). The algebraic value is represented as

ermX ×=

The sign of X can be embedded in the mantissa.

Page 8: 2.2

Floating-Point Operations

The four primitive arithmetic operations are defined below for a

pair of floating-point numbers represented by X = (mx, ex) and Y

= (my, ey), Assuming ex <= ey and base r = 2.

yx

yx

yyx

yyx

ee

yx

ee

yx

e

y

ee

x

e

y

ee

x

2)mm(YX

2)mm(YX

2)m2m(YX

2)m2m(YX

+

×÷=÷

××=×

×−×=−

×+×=+

These operations can be divided into two halves:

One half is for exponent operations such as comparing their

relative magnitudes or adding/subtracting them; the other half is

for mantissa operations including four types of fixed-point

operations.

The floating-point addition and subtraction can be

performed in four segments as shown in Fig.2-6. The registers

labeled R are latches used to interface between segments and to

store intermediate results. The sub-operations that are performed

in the four segments are:

1. Compare the exponents: The exponents are compared by subtracting them to determine

their difference. The larger exponent is chosen as the

exponent of the result.

2. Align the mantissa: The mantissa associated with the smaller exponent is shifted

to the right a number of times equal to the exponent

difference.

3. Add or subtract the mantissa:

The two mantissas are added or subtracted.

4. Normalize the result: Normalizing the result so that it has

a fraction (mantissa) with nonzero first digit.

Page 9: 2.2

RR

Compare Exponents

by subtraction

R

Choose exponent Choose exponent

R

Add or subtractmantissas

RR

Normalizeresult

Adjustexponent

RR

Mantissasmx my

Exponentsex ey

Segment 1

Segment 2

Segment 3

Segment 4

RR

Compare Exponents

by subtraction

R

Choose exponent Choose exponent

R

Add or subtractmantissas

RR

Normalizeresult

Adjustexponent

RR

Mantissasmx my

Exponentsex ey

RR

Compare Exponents

by subtraction

R

Choose exponent Choose exponent

R

Add or subtractmantissas

RR

Normalizeresult

Adjustexponent

RR

Mantissasmx my

Exponentsex ey

Segment 1

Segment 2

Segment 3

Segment 4

Fig.2-6 Pipeline in Fig.2-6 refers to binary numbers, but in the

following numerical example we use decimal numbers for

simplicity (r = 10). Consider the two normalized floating-point

numbers:

3

4

105.0Y

1099.0X

×=

×=

Sub-operations will be carried out through the four segments as

follows:

Segment 1: The two exponents are subtracted to obtain 4 – 3 = 1. The larger

exponent 4 is chosen to be the exponent of the result.

Segment 2: The mantissa of Y is shifted to the right once to obtain

Page 10: 2.2

4

4

1005.0Y

1099.0X

×=

×=

This aligns the two mantissas under the same exponent.

Segment 3:

The two mantissas are added to produce the sum 41004.1Z ×=

Segment 4: The sum is adjusted by normalizing the result. This done by

shifting the mantissa once to the right and incrementing the

exponent to obtain 510104.0Z ×=

Example 2.4

The time delays in the four segments in the pipeline of Fig.2-6

are as follows:

nstnstnstnst 45,95,30,50 4321 ==== . The interface registers delay

time nstr 5= .

a. How long would it take to add 100 pairs of numbers in the

pipeline?

b. How can we reduce the total time to about one-half of the

time calculated in part (a)?

Solution

(a) The time taken to add n pairs of numbers in an k-segment

pipeline = (k + n - 1 ) tp

Given: k = 4 and n = 100

Calculating the clock cycle tp:

The total time delay in a segment i

Page 11: 2.2

TDi = the time delay in the segment circuit + the time delay of

the interface register

� The maximum time delay of all segments

TDmax = the maximum time delay in segment circuits + the time

delay of the interface register

= 95ns + 5ns = 100ns

As we have a different time delay in each segment, the clock

cycle must be larger than or equal to the maximum delay

� The clock cycle tp = 100ns

� The time taken to add 100 pairs of numbers in the 4-segment

pipeline = (k + n - 1 ) tp

= (4 +

100 - 1) 100 ns

= 10.3

ms

(b) To reduce the time calculated in part (a) to one-half, we have to

reduce the clock cycle tp. We can achieve this if we could reduce

the maximum time delay TDmax to 50ns.

This can be reached by reducing 1t from 50ns to 45ns and 3t from

95ns to 45ns

2.2.6 Instruction Pipeline

An instruction pipeline reads consecutive instruction from

memory while previous instructions are being executed in other

segments. This causes the instruction fetch and execute phases to

overlap and perform simultaneous operations.

Instruction cycle

Computers with complex instructions (CISC) require other

phases in addition to the fetch and execute to process an

instruction completely. In the most general case, the computer

needs to process each instruction with the following sequence of

steps:

1. Fetch the instruction from memory.

Page 12: 2.2

2. Decode instruction.

3. Calculate the effective address.

4. Fetch operands from memory.

5. Execute the instruction.

6. Store the result.

There are certain difficulties that will prevent the instruction

pipeline from operating at its maximum rate. Different segments

may take different times to operate on the incoming information.

Therefore, the design the design of an instruction pipeline will be

most efficient if the instruction cycle is divided into segments of

equal duration.

2.2.6.1 Four Segment Instruction Pipeline

A pipelined processor may process each instruction in four steps,

as follows:

F Fetch: read the instruction from the memory.

D Decode: decode the instruction and fetch the source

operand(s).

E Execute: Perform the operation specified by the

instruction.

W Write: Store the result in the destination location.

Fig.2-7 shows how the instruction cycle in the CPU can be

processed with a four-segment pipeline.

An instruction in the sequence may be a program control type

that causes a branch out of normal sequence. In that case, the

pending operations in the last two segments are completed and

the instruction buffer is emptied. The pipeline then restarts from

the new address stored in the program counter. Similarly, an

interrupt request, when acknowledged, will cause the pipeline to

empty and restart from a new address value, which is the

beginning of an interrupt service routine.

Page 13: 2.2

Fetch instruction

from memory

Decode instruction

and fetch the

Source operand

Branch

Perform the

operation specified by the instruction

Store the result

InterruptInterrupt

handling

Update PC

Empty pipe

yes

yes

no

no

Segment 1

Segment 2

Segment 3

Segment 4

Fetch instruction

from memory

Decode instruction

and fetch the

Source operand

Branch

Perform the

operation specified by the instruction

Store the result

InterruptInterrupt

handling

Update PC

Empty pipe

Fetch instruction

from memory

Decode instruction

and fetch the

Source operand

Branch

Perform the

operation specified by the instruction

Store the result

InterruptInterrupt

handling

Update PC

Empty pipe

yes

yes

no

no

Segment 1

Segment 2

Segment 3

Segment 4

Fig.2-7

Fig.2-8 shows the operation of the four-segment pipeline.

For a variety of reasons, one of the pipeline stages may not be

able to complete its processing task for a given instruction in the

time allotted. Any condition that causes the pipeline to deviate

form its normal operation (to stall) is called a hazard. In general

there are three major hazards:

Page 14: 2.2

F1 D1 E1 W1

F2 D2 E2 W2

F3 D3 E3 W3

F4 D4 E4 W4

Clock cycles 1 2 3 4 5 6 7

InstructionsI1

I2

I3

I4

Time

F1 D1 E1 W1F1 D1 E1 W1

F2 D2 E2 W2F2 D2 E2 W2

F3 D3 E3 W3F3 D3 E3 W3

F4 D4 E4 W4F4 D4 E4 W4

Clock cycles 1 2 3 4 5 6 7

InstructionsI1

I2

I3

I4

Time

Fig.2-8

Data hazard.

Instruction hazard.

Structural hazard.

2.2.6.2 Data hazards

A data hazard is any condition in which either the source or the

destination operands of an instruction are not available at the

time expected in the pipeline. As a result some operation has to

be delayed, and the pipeline stalls. An example of that situation,

when the source operand(s) of an instruction is the result of the

previous instruction. The former may be in the D segment while

the later is in the E segment, which means that the result of the

later instruction is not yet stored in the destination while the

former instruction requesting it, which will cause a conflict.

Pipelined computers deal with such conflicts in a variety of ways.

Hardware Interlock

The most straight forward method is to insert hardware interlock.

An interlock is a circuit that detects instructions whose source

operands are destinations of instructions farther up in the

pipeline. Detection of this situation causes the instruction whose

source is no available to be delayed by enough clock cycles to

resolve the conflict. Hardware is used to insert the required

delays.

Operand forwarding

Page 15: 2.2

Instead of transferring an ALU result into a destination register, a

hardware checks the destination operand, and if it is needed as a

source in the next instruction, it passes the result directly into the

ALU input. Fig.2-9 shows a part of a pipeline processor which

caries out this task.

E: Execute

(ALU)

W: Write

(Register file)

Forwarding path

Source Result

E: Execute

(ALU)

W: Write

(Register file)

Forwarding path

Source Result

Fig.2-9

Example 2.5

Consider the four instructions in the following program. Suppose

that the first instruction starts from step 1 in the pipeline used in

Fig.2-8. Specify what operations are performed in the four

segments during step 5and step 6, assuming:

(a) The pipeline system uses hardware interlock technique to

handle data hazards.

(b) The pipeline system uses the operand forwarding

technique in Fig.2-9 to handle data hazards.

LOAD R1 � M[312]

ADD R2 � R2 + M[313]

INC R3 � R4 + 1

STORE M[314] � R3

Solution

The STORE instruction will cause a data conflict as its source

operand (R3) is the result of the previous instruction (INC), this

will result in a data hazard. The timing for the pipeline of the

above program will be as follows:

Page 16: 2.2

F1 D1 E1 W1

F2 D2 E2 W2

F3 D3 E3 W3

F4 D4 E4 W4

Clock cycles 1 2 3 4 5 6 7

InstructionsLOAD

ADD

INC

STORE

Time

F1 D1 E1 W1F1 D1 E1 W1

F2 D2 E2 W2F2 D2 E2 W2

F3 D3 E3 W3F3 D3 E3 W3

F4 D4 E4 W4

Clock cycles 1 2 3 4 5 6 7

InstructionsLOAD

ADD

INC

STORE

Time

(a)

In a pipeline system using hardware interlock, the interlock

circuit will detect that the STORE instruction’s source operand is

the destination of the INC instruction. We assume that the

interlock circuit detects such instruction after it passes the F

segment. After detecting this situation, the interlock will cause

the STORE instruction to be delayed for 2 clock cycles which is

the minimum number of cycles for the INC instruction to write

the result to the destination (R3). The timing of the pipeline will

become as follows:

F1 D1 E1 W1

F2 D2 E2 W2

F3 D3 E3 W3

F4 - - D4

Clock cycles 1 2 3 4 5 6 7 8 9

InstructionsLOAD

ADD

INC

STORE

Time

E4 W4

F1 D1 E1 W1F1 D1 E1 W1

F2 D2 E2 W2F2 D2 E2 W2

F3 D3 E3 W3F3 D3 E3 W3

F4 - - D4

Clock cycles 1 2 3 4 5 6 7 8 9

InstructionsLOAD

ADD

INC

STORE

Time

E4 W4

As shown above, at step 5 and 6, the situation will be as follows:

Instruction Step 5 Step 6

LOAD Completed Completed

ADD Writing the result to

destination (R2)

Completed

Page 17: 2.2

INC Performing the

operation (R4 + 1)

Writing the result to

destination (R3)

STORE Delayed Delayed

(b)

In a pipeline system using operand forwarding, as sown in Fig.2-

9, the result of an instruction can be fed back directly into the E

segment in the case of data dependency. In the above program,

hardware will detect data dependency between the INC and the

STORE instructions. This will cause the result of the INC

instruction to be passed directly into the input of the E segment.

The timing of the pipeline will become as follows:

F1 D1 E1 W1

F2 D2 E2 W2

F3 D3 E3 W3

F4 D4 E4 W4

Clock cycles 1 2 3 4 5 6 7

InstructionsLOAD

ADD

INC

STORE

Time

F1 D1 E1 W1F1 D1 E1 W1

F2 D2 E2 W2F2 D2 E2 W2

F3 D3 E3 W3F3 D3 E3 W3

F4 D4 E4 W4

Clock cycles 1 2 3 4 5 6 7

InstructionsLOAD

ADD

INC

STORE

Time

In the above figure, it can be seen that, no delays are

needed as the operand will be forwarded directly from the output

of the E segment to its input. In this situation in the D segment of

the STORE instruction, fetching the source operand will be

skipped. At step 5 and 6, the situation will be as follows: Instruction Step 5 Step 6

LOAD Completed Completed

ADD Writing the result to

destination (R2)

Completed

INC Performing the

operation (R4 + 1)

Writing the result to

destination (R3)

STORE Decode the

instruction

Performing the

operation (no

operation to be

performed)

2.2.6.3 Instruction hazards

Page 18: 2.2

The purpose of the instruction fetch unit is to supply the

execution units with a steady stream of instructions. Whenever

this stream is interrupted, the pipeline stalls, as in the case of a

cache miss. A branch instruction may also cause the pipeline to

stall. Fig.2-10 illustrates the effect of branching on the instruction

stream.

F1 D1 E1 W1

F2 D2 E2

F3 D3 E3 W3

Fk Dk Ek Wk

Clock cycles 1 2 3 4 5 6 7 8 9

InstructionsI1

(Branch) I2

Time

Ik

F4 D4 E4 W4

Ik+1

Fk+1 Dk+1 Ek+1 Wk+1

<I3>

<I4>

F1 D1 E1 W1F1 D1 E1 W1

F2 D2 E2

F3 D3 E3 W3

Fk Dk Ek WkFk Dk Ek Wk

Clock cycles 1 2 3 4 5 6 7 8 9

InstructionsI1

(Branch) I2

Time

Ik

F4 D4 E4 W4F4 D4 E4 W4

Ik+1

Fk+1 Dk+1 Ek+1 Wk+1Fk+1 Dk+1 Ek+1Fk+1 Dk+1Fk+1 Dk+1 Ek+1 Wk+1

<I3>

<I4>

(a)successful branch condition

Fig.2-10

F1 D1 E1 W1

F2 D2 E2

F3 D3 E3 W3

Clock cycles 1 2 3 4 5 6 7 8 9

InstructionsI1

(Branch) I2

<I3>

Time

I3

F4 D4 E4 W4

I4

D3 E3 W3

F4 D4 E4 W4

<I4>

F1 D1 E1 W1F1 D1 E1 W1

F2 D2 E2

F3 D3 E3 W3

Clock cycles 1 2 3 4 5 6 7 8 9

InstructionsI1

(Branch) I2

<I3>

Time

I3

F4 D4 E4 W4F4 D4 E4 W4

I4

D3 E3 W3

F4 D4 E4 W4F4 D4 E4 W4

<I4>

(b) Unsuccessful branch condition

Fig.2-10

The above figure shows a sequence of instructions being

executed in a four-stage pipeline. Instructions I1 to I4 are stored

at successive memory addresses, and I2 is a branch instruction.

Let the branch target be instruction Ik. In clock cycle 3, the fetch

Page 19: 2.2

operation for instruction I3 is in progress at the same time the

branch instruction is being decoded and the target address

computed. In clock cycle 4, if the branch is taken (Fig.2-10a), the

processor must discard I3 and fetch instruction Ik. If the branch is

not taken(Fig.2-10b), I3 can be used. In the meantime, the

hardware unit responsible for the Execute (E) step must be tolled

to do nothing during that clock period. Thus, the pipeline is

stalled for two clock cycles. The time lost as a result of a branch

instruction is often referred to as branch penalty.

Pipelined computers employ various hardware techniques

to minimize the performance degradation caused by instruction

branching.

Pre-fetch target instruction

One way of handling a conditional branch is to pre-fetch the

target instruction in addition to the instruction following the

branch. Both are saved until the branch is executed. If the branch

condition is successful, the pipeline continues from the branch

target instruction.

Branch target buffer (BTB)

Another possibility is the use of a branch target buffer or BTB.

BTB is an associative memory included in the fetch segment of

the pipeline. Each entry of the BTB consists of the address of a

previously executed branch instruction and the target instruction

of that branch. It also stores the next few instructions after the

branch target instruction. When the pipeline decodes a branch

instruction, it searches the BTB for the address of the in

instruction. If it is in the BTB, the instruction is available directly

and fetch continues from the new path.

Loop buffer

Loop buffer is a small very high speed register file included in the

fetch segment of the pipeline. When a program loop is detected

in the program, it is stored in the loop buffer including all

branches. The program loop can be executed directly without

accessing memory until the loop mode is removed by the final

branch.

Page 20: 2.2

Branch prediction

A pipeline with branch prediction uses some additional logic to

guess the outcome of a conditional branch instruction before it is

executed. A correct prediction eliminates the wasted time caused

by the branch penalties.

2.2.6.4 Structural hazards

This is the situation when two instructions require the use of a

given hardware resource at the same time. The most common

case is in access to memory. One instruction may need to access

memory as part as part of Execute(E) or Write(W) stage, while

another instruction is being fetched. Many processors use

separate instruction and data memories to avoid this conflict.

2.2.7 RISC Instruction Pipeline

Reduced Instruction Set Computers (RISC) are able to use an

efficient instruction pipeline, taking in advance the following

notations over the RISC characteristics:

• Because of the fixed-length instruction format, the

decoding of the operation can occur at the same time as the

register selection.

• All data manipulated instructions have register-to-register

operations. Since all operands are in registers, there is no

need for calculating the effective address or fetching of

operands from memory.

Single cycle instruction execution

One of the major advantages of RISC over CISC is that RISC can

achieve pipeline segments, requiring just one clock cycle, while

CISC uses many segments in its pipeline, with the longest

segment requiring two or more clock cycles.

Compiler support

Page 21: 2.2

Instead of designing hardware to handle difficulties (hazards)

associated with data conflicts and branch penalties, RISC

processors relay on the efficiency of the compiler to detect and

minimize the delays.

2.2.7.1 Three-Segment Instruction Pipeline

The instruction cycle for a RISC computer can be divided into

three sub-operations and implemented in three segments:

I Instruction fetch: fetches the instruction from the

program memory.

A ALU operation: decodes the instruction and

performs an ALU operation. An ALU operation may

be one of three types, data manipulation, data

transfer, or program control.

E Execute instruction: directs the output of the ALU to

the destination.

Fig.2-11 shows the operation of the four-segment pipeline.

I A E

Clock cycles 1 2 3 4 5 6 7

InstructionsI1

I2

Time

I A E

I3 I A E

I A EI A E

Clock cycles 1 2 3 4 5 6 7

InstructionsI1

I2

Time

I A EI A E

I3 I A EI A E

Fig.2-11

The following sections illustrate how RISC computers use the

compiler to handle data, instruction hazards.

Page 22: 2.2

Delayed load

The compiler of RISC computers is designed to detect the data

conflicts (data hazards) and reorder the instructions as necessary

to delay the loading of the conflicting data by inserting no-

operation instructions. For example, the two instructions (written

in Berkeley RISC I format).

ADD R1, R2, R3 R3 � R1 + R2

SUB R3, R4, R5 R5 � R3 - R4

Give rise to data dependency (data hazard). The result of the Add

instruction is placed into register R3, which in tern is one of the

two source operands of the Subtract instruction. There will be a

conflict in the Subtract instruction. This can be seen from the

pipeline timing shown in Fig.2-12a. The A segment in clock

cycle 3 is using data from R3 which will not be the correct value

since the Addition operation is not yet completed. The compiler

when detects such situation, it searches for a useful to put after

the Add, if it cannot find such an instruction, it inserts no-

operation instruction, illustrated in Fig 2-12b. This is the type of

instruction that fetched from memory but has no operation, thus

wasting a clock cycle.

I A E

Clock cycles 1 2 3 4 5 6 7

InstructionsADD

SUB

Time

I A E

I A E

Clock cycles 1 2 3 4 5 6 7

InstructionsADD

NOP

Time

I A E

I A ESUB

(a) Pipeline timing with data conflict

(b) Pipeline timing with delayed load

I A EI A E

Clock cycles 1 2 3 4 5 6 7

InstructionsADD

SUB

Time

I A EI A E

I A EI A E

Clock cycles 1 2 3 4 5 6 7

InstructionsADD

NOP

Time

I A EI A E

I A EI A ESUB

(a) Pipeline timing with data conflict

(b) Pipeline timing with delayed load

Page 23: 2.2

Fig.2-12

Delayed Branch

In Fig.2-10a, the processor fetches instruction I3 before it

determines whether the current instruction I2 is a branch

instruction. When execution of I2 is completed and a branch is to

be made, the processor must discard I3 and fetch the instruction

at the branch target Ik. The location following the branch

instruction is called a branch delay slot. The instructions in the

delay slots are always fetched and at least partially executed

before the branch decision is made and the branch target address

is computed. The compiler for a processor that uses delayed

branches is designed to analyze the instructions before and after

the branch and rearrange the program sequence by inserting

useful (or no-operation) instructions in the delay slots.

RISC computers use delayed branch to handle branch related

instruction hazards. An example of delayed branch is shown in

Fig.2-13. The program sequence for this example consists of the

following instructions (written in Berkeley RISC I format).

AND R0, #10, R0 R0 ���� R0 ∧ #10

SLL R2, #1, R1 R1 ���� Shift-left R2 once

ADD R3, R4, R5 R5 ���� R3 + R4

JMP Z, #2(R6) PC ���� R6 + 2 (Conditional jump, result = 0)

SUB R7, #3, R6 R6 ���� R7 – 3

In the above program sequence, the branch (JMP) instruction

must be followed by two delay slots to avoid incorrect fetching

for the SUB instruction (in the case of successful branch

condition). When the compiler searches for two useful

instructions to insert in the delay slots instead of inserting no-

operation (as in Fig.2-13a), the AND instruction and the shift-

left (SLL) instruction will be a good choice as there source

operands and result are not related to any other instruction. The

program will be rearranged as follows.

ADD R3, R4, R5 R5 ���� R3 + R4

JMP Z, #2(R6) PC ���� R6 + 2 (Conditional jump, result = 0)

AND R0, #10, R0 R0 ���� R0 ∧ #10

SLL R2, #1, R1 R1 ���� Shift-left R2 once

SUB R7, #3, R6 R6 ���� R7 – 3

Page 24: 2.2

The pipeline timing for this new arrangement is shown in Fig.2-

13b

I A E

Clock cycles 1 2 3 4 5 6 7 8 9

Instructions

SLL

JMP

Time

I A E

NOP

I A E

SUB

I A E

I A E

ADD

AND

I A E

I A ENOP

I A EI A E

Clock cycles 1 2 3 4 5 6 7 8 9

Instructions

SLL

JMP

Time

I A EI A E

NOP

I A EI A E

SUB

I A EI A E

I A EI A E

ADD

AND

I A EI A E

I A EI A ENOP

(a) Using no-operation instruction

Time

I A E

Clock cycles 1 2 3 4 5 6 7

InstructionsADD

JMP I A E

AND I A E

SUB

I A ESLL

I A E

Time

I A EI A E

Clock cycles 1 2 3 4 5 6 7

InstructionsADD

JMP I A EI A E

AND I A EI A E

SUB

I A EI A ESLL

I A EI A E

(a) Rearranging the instructions

Fig.2-13