What We Have Learn About Pipeline So Far

Savio Chau

What We Have Learn About Pipeline So Far• Pipelining Helps the Throughput of the Entire Workload But

Doesn’t Help the Latency of a Single Task

• Pipeline Rate is Limited by the Slowest Pipeline Stage

• Multiple Instructions are Operating Simultaneously

• Potential Speedup = Number of Pipeline Stages Under The Ideal Situations That All Instructions Are Independent and No Branch Instructions

• Soon, We Will Learn About Hazards That Degrade The Performance Of The Idea Pipeline

Savio Chau

Pipeline Hazards • Pipelining Limitations: Hazards are Situations that Prevent the

Next Instruction from Executing During its Designated Cycle– Structural Hazard:

Resource Conflict When Several Pipelined Instructions Need the Same Functional Unit Simultaneously

– Data Hazard:An Instruction Depends on the Result of a Prior Instruction that is Still in the Pipeline

– Control Hazard:Pipelining of Branches and Other Instructions that Change the PC

• Common Solution:Stall the Pipeline by Inserting “Bubbles” Until the Hazard is Resolved

Savio Chau

Graphical Representation to Analyze Pipeline Hazards

Instruction Mem Reg Mem RegALU

Operations

Bypass

Savio Chau

Structural Hazard: Conflict in Resources

Instruction 3 fetching instruction from the same memory

Example: Assuming Instructions and Data Share the Same MemoryLoad reading data from memory

Savio Chau

Resolution Option 1: Don’t Share the Memory

IM

IM

IM

DM

DM

DM

DMIM

IM

DM

Use different memory for instructions and data (just like the single cycle data path)

Savio Chau

Resolution Option 2: Using a Two-Port MemoryUse a 2-port memory that has two read output ports and can be read and written at the same time

Load Mem

Mem

Load instruction reading memory from output port #2 of the same memory

Load instruction reading memory from output port #1

Savio Chau

Resolution Option 2: Using a Two-Port Memory

Store Mem

Mem

Instruction 3 fetching instruction from the same memory

Store writing data to memory

Use a 2-port memory that has two read output ports and can be read and written at the same time

Savio Chau

Resolution Option 3: Stall the PipelineDelay the start of conflicting successor instructions (i.e., for Load instructions, delay the 3rd succeeding instructions by 3 clocks)

Savio Chau

To Insert a BubbleDon’t Change PC, Keeps Fetching Same Instruction, Sets All Control Signals in The ID/EX Pipeline Register to Benign Values (0). More discussion about the implementation later.

sub r4, r1 ,r3All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

sub r4, r1 ,r3(refetch)

sub r4, r1 ,r3(refetch)

(execute)

Each refetch creates a bubble

(I.e., do nothting)

(I.e., do nothting)

(I.e., do nothting)

Do not update PC

Savio Chau

Data Hazard: Dependencies Backwards in Time

Reg

Sub needs r1 2 clocks before add can supply it

Note: The register file design allows date be written in first half of clock cycle and read in the second half of clock cycle

And needs r1 1 clocks before add can supply it

R1 ready for xor

Or gets the data in the same clock when add is done

Savio Chau

Data Hazard Example

AddIFetch

r2=4r3=6

subIFetch

6+4

r1=3r3=6

subIFetch

r8=-50

3-6

r1=3 r7=40

subIFetch

r1=10

3-40

r1=10r9=60

subIFetch

r4=–3

10-6

0r1=10r11=80

r6=-37

10-8

0 r10=-70

sub

sub

sub

sub

r1r2r3r4r5r6r7r8r9r10r11

346

10

040

060

080

10464

-3040

-5060

-7080

0 0

Correct Answer

Current Value

10

-3

-37

-50

-70

Savio Chau

Resolution Option 1: HW Stalls

See structural hazard solution 2 for how to generate a bubble

Savio Chau

Resolution Option 2: Reordering of Instructions

add r5, r6, r7

sub r8, r9, r10

Software inserts independent instructions instead of bubbles. May have to inserts NOP instructions if no independent instructions found.

Savio Chau

Resolution Option 3: Forwarding• Insight: The Needed Data is Actually Available! It is Contained

in the Pipeline Registers.

Savio Chau

Hardware Change for Forwarding• Add Paths From Pipeline Registers to Stages That Need the Data • Add Multiplexors to Select The Pipeline Registers• Register File Forwarding: Register Read During Write Gets New

Value (write in 1st half of the clock cycle and read in 2nd half)

RegFile

Savio Chau

Data Hazard Detection For Forwarding4 types of instruction dependencies cause data hazard:

1a. Rd of instruction in execution = Rs of instruction in operand fetch(EX/MEM.RegisterRd = ID/EX.RegisterRs)1b. Rd of instruction in execution = Rt of instruction in operand fetch(EX/MEM.RegisterRd = ID/EX.RegisterRt)2a. Rd of instruction writing back = Rs of instruction in execution(MEM/WB.RegisterRd = ID/EX.RegisterRs)2b. Rd of instruction writing back = Rt of instruction in execution(MEM/WB.RegisterRd = ID/EX.RegisterRt)

add r1 ,r2, r3 IM Reg DM RegALU

IM Reg DM RegALU

IM Reg DM RegALU

sub r4,r1 ,r3

and r6,r1 ,r7

Type 1a Type 2a

r1 not valid yet

r1 not valid yet

Savio Chau

Forwarding Control

• For Mux A– Select ALU operands from previous ALU result in EX/MEM (Type 1a)

if (EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd = ID/EX.RegRs))– Select ALU operands from MEM/WB (Type 2a)

if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (MEM/WB.RegRd = ID/EX.RegRs))• For Mux B

– Same as Mux A except replacing Rs with Rt

Control Output of the Forwarding Unit

RegFile

Forwarding Unit

exmwb

mwb wb

Control

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB Mu

x

0

12

012

Savio Chau

Forwarding Exampleadd r1 ,r2, r3sub r4, r1 ,r3and r6, r7 ,r1

RegFile

exmwb

mwb wb

Control

Mux A

Mux B

Data MemoryA

LU

Mux

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux add

r1

r2

r3

A=R[rs]

B=R[rt]

A+B add

r1

sub

r4

r1

r3

B=R[rt]

A=R[rs]

A - B A+B add

r1

sub

r6

r7

r1

B=R[rt]

A=R[rs]

A • B

A+B

and

A-B

r4Forwarding

Unitrs

rdrt rd rd

Type 1a Hazard Type 2b Hazard

01

10

Savio Chau

One More Problem

Answer: Forward the EX/MEM because it is more update than MEM/WB. Therefore, MEM/WB is forwarded only if rd in all three stages are not the same. That is:

– For Mux A, Select ALU operands from MEM/WB (Type 2a): if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (EX/MEM.RegRd ID/EX.RegRd) and (MEM/WB.RegRd = ID/EX.RegRs))

– For Mux B: Same as Mux A except replacing Rs with Rt (Type 2b)

Question: If Rd is used Repeatedly such that rd in all three stages are the same (i.e., MEM/WB.RegRd = EX/MEM.RegRd = ID/EX.RegRs (or ID/EX.RegRt)). In that case, should EX/MEM or MEM/WB be forwarded?

RegFile

Forwarding Unit

exmwb

mwb wb

Control

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB Mu

x

0

12

012

Savio Chau

Forwarding Removes Data Hazard in Most Cases

add r1, r2, r3

add r4, r1, r3

add r5, r4, r1

sw r5, 0(r4)

Savio Chau

Except in One Case: lw Instruction Problem: The lw instruction is still reading memory when the sub instruction

needs the data for EX. Still need to handle the 1 hazard cycleWhy not forward the EX/MEM register to ALU? Would that remove the 1 hazard cycle?

Savio Chau

The Case Forwarding Can’t Avoid Stallinglw r1 , 0(r2)sub r4, r1 ,r3and r6, r7 ,r1

RegFile

exmwb

mwb wb

Control

Mux A

Mux B

Data MemoryA

LU

Mux

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux lw

r1

r2

r3

A=R[rs]

addr

Forwarding Unit

rd

Problem: lw followed by R-type – the lw instruction is still reading memory when the sub instruction needs the data for EX. Need to stall 1 cycle

lw

r1

add

r4

r1

r3

B=R[rt]

A=R[rs]

A+ B addr

Type 1a Hazard, but cannot forward EX/MEM output. It is mem addr, not data, for lw

rs

rdrt rd

lwadd

Mem[addr]

Memory data for lw

r1

10

Forwarded as Type 2a

Savio Chau

Option 1: Software Solution• Software inserts independent instructions worst case inserts

NOP instructions

Savio Chau

Option 2: Hardware Solution

Do nothing

Do nothing

Do nothing

Do nothing

Already in reg file

• Control logic checks for data hazard and stall one cycle (i.e., insert a bubble) if necessary

Savio Chau

Hardware to Stall The Pipeline

• Step 1: Detecting the hazard (check if lw is being executed and if the memory data read by lw will be loaded to one of the operands in the next instruction)

– Stall = if (ID/EX.MemRead and ((ID/EX.rt = IF/ID.rs) or (ID/EX.rt = IF/ID.rt))) • Step 2: If Stall is true

– Do not fetch the next instruction by disabling the writing to PC and IF/ID registers– Disable all control signals of the current instruction

RegFile

Forwarding Unit

exmwb

mwb wb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PCW

r

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.opcode

Savio Chau

ID/EX

Stalling The Pipeline

RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd B

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.op

lw r1, 0(r2)sub r4, r1 ,r3and r6, r7 ,r1or r8, r1 ,r9

mwb wb

Fwd A

lwsub

ID/EX.MemRead = 1 lw instrcution

Sub

ID/EX.rt = R1

IF/ID.rs = R

1 MemRead = 1, MemWr = 0

RegWr = 1

exmwb

Savio Chau


RegFile

Forwarding Unit

mwb wb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.op


PCW

r=0

lwsub

ID/EX.MemRead = 1 lw instrcution

Sub

ID/EX.rt = R1

IF/ID.rs = R

1

IF/IDW

r = 0

exmwb

MemRead = 1, MemWr = 0

RegWr = 1

Savio Chau


RegFile

Forwarding Unit

exmwb wb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.op


lw

subSub

IF/ID.rs = R

1 MemRead = 0, MemWr = 0

RegWr = 0

mwb M

emR

ead = 1M

emW

r = 0RegWr = 1

Rea

ding same instructio

n agai

n

subN

ot doing anythi

ng (i.e., it

is a bubble

!)

Savio Chau


RegFile

Forwarding Unit

mwb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.op


lwsub

Mem

Read = 0

Mem

Wr = 0

RegWr = 0 RegWr = 1

and

wb

exmwb


RegWr = 1

sub

bubble

Savio Chau


RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.op


sub

Mem

Read = 0

Mem

Wr = 0

RegWr = 1 RegWr = 0

and

wb

exmwb


RegWr = 1

mwb

or lw data

sub

bubble

Savio Chau


RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.op


sub

Mem

Read = 0

Mem

Wr = 0

RegWr = 1 RegWr = 1

and

wb


RegWr = 1

mwb

lw data

or

exmwb

The bubble has not changed any state of the pipeline

Savio Chau


RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/

WB Mux

Mux

Control

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/IDW

r

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID.rt

IF/ID.rs

IF/ID.op


Mem

Read = 0

Mem

Wr = 0

RegWr = 1 RegWr = 1

and

wb

or

exmwb

mwb

lw datasub data

The bubble has not changed any state of the pipeline

Savio Chau

Control Hazard: Change in Control Flow Due to Branching

beq $1,$ 3,36

ld $4, $7, 100

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

Result of comparison branch to target

Waiting for result of comparison



Branch target

Have to stall 3 Cycles before branch decision is made

Savio Chau

Option 1: Static Branch PredictionPredict Branch Not Taken

Result of comparison not to branch

Assume branch not taken



Prediction is correct, branching does not cause any penalty

PC=12

PC=16

PC=20

PC=24

PC=28 or $15,$7,$3

Savio Chau

Penalty of Wrong Prediction




Branch target

PC=12

PC=16

PC=20

PC=24

PC=36 Result of comparison branch taken

Prediction is incorrect, need to flush pipe, penalty = without branch prediction (3 cycles). Note: Need to make sure no instructions after beq has updated the register file or memory.

Savio Chau

Example of Wrong Prediction Penalty(e.g.,Incorrectly Predict Branch Not Taken)

rt

rd

ID/E

XPC Addr

InstructionMemory

Rd Reg1RdReg2

RegistersWr RegWr Data

AddrRd Data

DataMemory

Wr Data

PCsrc IF/ID

4 Reg

Writ

e

ALU

src

ALUop

RegDst

Branch

Mem

Wr

Mem

toR

zero

out

<15:0>

Mem

Rd

rs

A

Zero

ALUout

0wb

exm

wb

IF/ID

mwb

EX/M

EM

MEM

/WB

1

rt

Extrt

rdMux

Con

trol

0

1

Add A

dd

B

ALUControl

Mux

ALU

0

1

rd

BMux

0

1

rd

ID/EX EX/MEM MEM/WB--/IF

Mux

ALU

out

mdo

<10:0>

<31:0>

<31:26>

Clock 2 Clock 3 Clock 4 Clock 5Clock 1

x4

See Set 8 Class Example

flush

Ck PC

1 00 lw $2, 0($3)2 04 add $4, $0, $53 08 sw $6, 4($3)4 12 beq $7, $2, 55 16 add $8, $2, $56 20 add $9, $2, $47 24 sub $10, $4, $78 28 add $11, $7, $89 32 j 118 36 sub $8, $2, $59 40 sub $9, $2, $4

Branch Hazard Detect

flushflushflush

Savio Chau

Example of Wrong Prediction Penalty(e.g.,Incorrectly Predict Branch Not Taken)

Reset by (Brand Zero) No permanent change to Register File or Memory Pipe flushed

Savio Chau

To Reduce Branch Penalty

1st clock delay

2nd clock delay

3rd clock delay

PC+4

Ext(imm16)

Branch addressNeed a separate comparator since ALU is computing branch address

Let’s Take a Look at Why 3 Cycles of Penalty Are Needed – the decision is made after the execution stage

Savio Chau

To Reduce Branch Penalty

1st clock delay

PC+4

Ext(imm16)

Branch address

But in fact all the information required to make decision can be obtained in the “decode/operand fetch” stage. That means we can move address calculation forward.

Savio Chau

Predict branch taken

Pipeline After Branch Penalty Reduction

lw $4, 100($7)

Penalty = 1 cycle, instead of 3 cycles

Predict branch not taken

Branch Decision and Calculate Address Done in Decode Stage

All ctrl set to 0

Prediction is wrong. But need to flush IF/ID only

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

add $5, $6, $7

sw $5, 100($7)

PC

20

24

56

60

64

…

Savio Chau

Add

r for

beq

= 2

0

Flushing Pipe in the New Approach if Prediction is Wrong

Ctr

l sig

nals

Ctr

l sig

nals

Branch Hazard Detection

Ctr

l sig

nals

Ctr

l sig

nals

beq

C

trl f

or b

eq

… …

and fetched as if branch

not taken

Ctr

l for

beq

Ctr

l for

and

beq

and

flush

But Zero has decided that the branch should have taken

00…

000

…0

Ctr

l for

lw

00…

0

Ctr

l for

beq

beq

lw

00…

0

Ctr

l for

beq

Ctr

l for

and

Ctr

l for

lw

00…

0

Ctr

l for

add

beq

Ctr

l for

and

lw

00…

0

add

Add

r for

and

= 2

0A

ddr f

or lw

= 5

6A

ddr f

or a

dd =

60

Add

r for

sw

= 6

4

Savio Chau

Need Both Styles of Branch Prediction:Branch Taken and Branch Not Taken

• MIPS code that favors branch taken:Loop: mult $19,$10 # (HI, LO) regs = i * 4

mflo $9 # reg $9 least sig. 32 product bitslw $8, Aaddr($9) # Temporary reg $8 = A[i*4]add $17,$17,$8 # g = g + A[ i]add $19,$19,$20 # i = i + jbne $19,$18, Loop # goto Loop if i != h

• MIPS code that favors branch not taken:Loop: mult $19,$10 # (HI, LO) regs = i * 4

mflo $9 # reg $9 least sig. 32 product bitslw $8, Aaddr($9) # Temporary reg $8 = A[i*4]add $17,$17,$8 # g = g + A[ i]add $19,$19,$20 # i = i + jbeq $19,$18, Exit # if i = h, get out of the loopJ Loop # otherwise, goto Loop

Branch to Loop most of the time

Do not branch to Exit most of the time

Savio Chau

Option 2: Dynamic Branch Prediction• Rather than always assuming branch not taken, use a branch

history table (also call branch prediction buffer) to achieve better prediction

• The branch history table is implemented as a one or two bit registerExample: state transition of a 2-bit history table

State 00predict taken

State 01predict taken

State 10predict not taken

State 11predict not taken

taken not taken

takennot taken

taken

not takentaken

If branch test is in Instruction N, then:predict taken means PC set to the target address by default, and set to N+4 if wrongpredict not taken means PC set to N+4 by default, and set to target address if wrong

Savio Chau

Option 3: Delayed Branch• Make use of the time while the branch decision is being made: Execute

an unrelated instruction subsequent to the branch instruction• Where To Get Instructions to Fill Branch Delay Slot? Three Strategies:

• Compiler Effectiveness for Single Branch Delay Slot:– Fills About 60% of Branch Delay Slots– About 80% of Instructions Executed in Branch Delay Slots Useful in

Computation– About 50% (60% x 80%) of Slots Usefully Filled

• Worst Case, Compiler Inserts NOP into Branch Delay

Before Branch Instruction(best if possible)

From Target(good if always branch)

From Fall Through(good if always don’t branch)

add $s1, $s2, $s3

If $s2=0 then

Delay slot

add $s1, $s2, $s3

If $s1=0 then

Delay slot

sub $t4, $t5, $t6 add $s1, $s2, $s3

If $s1=0 then

Delay slot

sub $t4, $t5, $t6

add $s1, $s2, $s3

sub $t4, $t5, $t6

sub $t4, $t5, $t6

What We Have Learn About Pipeline So Far

Documents