Savio Chau What We Have Learn About Pipeline So Far • Pipelining Helps the Throughput of the Entire Workload But Doesn’t Help the Latency of a Single Task • Pipeline Rate is Limited by the Slowest Pipeline Stage • Multiple Instructions are Operating Simultaneously • Potential Speedup = Number of Pipeline Stages Under The Ideal Situations That All Instructions Are Independent and No Branch Instructions • Soon, We Will Learn About Hazards That Degrade The Performance Of The Idea Pipeline
What We Have Learn About Pipeline So Far. Pipelining Helps the Throughput of the Entire Workload But Doesn’t Help the Latency of a Single Task Pipeline Rate is Limited by the Slowest Pipeline Stage Multiple Instructions are Operating Simultaneously - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Savio Chau
What We Have Learn About Pipeline So Far• Pipelining Helps the Throughput of the Entire Workload But
Doesn’t Help the Latency of a Single Task
• Pipeline Rate is Limited by the Slowest Pipeline Stage
• Multiple Instructions are Operating Simultaneously
• Potential Speedup = Number of Pipeline Stages Under The Ideal Situations That All Instructions Are Independent and No Branch Instructions
• Soon, We Will Learn About Hazards That Degrade The Performance Of The Idea Pipeline
Savio Chau
Pipeline Hazards • Pipelining Limitations: Hazards are Situations that Prevent the
Next Instruction from Executing During its Designated Cycle– Structural Hazard:
Resource Conflict When Several Pipelined Instructions Need the Same Functional Unit Simultaneously
– Data Hazard:An Instruction Depends on the Result of a Prior Instruction that is Still in the Pipeline
– Control Hazard:Pipelining of Branches and Other Instructions that Change the PC
• Common Solution:Stall the Pipeline by Inserting “Bubbles” Until the Hazard is Resolved
Savio Chau
Graphical Representation to Analyze Pipeline Hazards
Instruction Mem Reg Mem RegALU
Operations
Bypass
Savio Chau
Structural Hazard: Conflict in Resources
Instruction 3 fetching instruction from the same memory
Example: Assuming Instructions and Data Share the Same MemoryLoad reading data from memory
Savio Chau
Resolution Option 1: Don’t Share the Memory
IM
IM
IM
DM
DM
DM
DMIM
IM
DM
Use different memory for instructions and data (just like the single cycle data path)
Savio Chau
Resolution Option 2: Using a Two-Port MemoryUse a 2-port memory that has two read output ports and can be read and written at the same time
Load Mem
Mem
Load instruction reading memory from output port #2 of the same memory
Load instruction reading memory from output port #1
Savio Chau
Resolution Option 2: Using a Two-Port Memory
Store Mem
Mem
Instruction 3 fetching instruction from the same memory
Store writing data to memory
Use a 2-port memory that has two read output ports and can be read and written at the same time
Savio Chau
Resolution Option 3: Stall the PipelineDelay the start of conflicting successor instructions (i.e., for Load instructions, delay the 3rd succeeding instructions by 3 clocks)
Savio Chau
To Insert a BubbleDon’t Change PC, Keeps Fetching Same Instruction, Sets All Control Signals in The ID/EX Pipeline Register to Benign Values (0). More discussion about the implementation later.
sub r4, r1 ,r3All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
sub r4, r1 ,r3(refetch)
sub r4, r1 ,r3(refetch)
(execute)
Each refetch creates a bubble
(I.e., do nothting)
(I.e., do nothting)
(I.e., do nothting)
Do not update PC
Savio Chau
Data Hazard: Dependencies Backwards in Time
Reg
Sub needs r1 2 clocks before add can supply it
Note: The register file design allows date be written in first half of clock cycle and read in the second half of clock cycle
And needs r1 1 clocks before add can supply it
R1 ready for xor
Or gets the data in the same clock when add is done
Savio Chau
Data Hazard Example
AddIFetch
r2=4r3=6
subIFetch
6+4
r1=3r3=6
subIFetch
r8=-50
3-6
r1=3 r7=40
subIFetch
r1=10
3-40
r1=10r9=60
subIFetch
r4=–3
10-6
0r1=10r11=80
r6=-37
10-8
0 r10=-70
sub
sub
sub
sub
r1r2r3r4r5r6r7r8r9r10r11
346
10
040
060
080
10464
-3040
-5060
-7080
0 0
Correct Answer
Current Value
10
-3
-37
-50
-70
Savio Chau
Resolution Option 1: HW Stalls
See structural hazard solution 2 for how to generate a bubble
Savio Chau
Resolution Option 2: Reordering of Instructions
add r5, r6, r7
sub r8, r9, r10
Software inserts independent instructions instead of bubbles. May have to inserts NOP instructions if no independent instructions found.
Savio Chau
Resolution Option 3: Forwarding• Insight: The Needed Data is Actually Available! It is Contained
in the Pipeline Registers.
Savio Chau
Hardware Change for Forwarding• Add Paths From Pipeline Registers to Stages That Need the Data • Add Multiplexors to Select The Pipeline Registers• Register File Forwarding: Register Read During Write Gets New
Value (write in 1st half of the clock cycle and read in 2nd half)
RegFile
Savio Chau
Data Hazard Detection For Forwarding4 types of instruction dependencies cause data hazard:
1a. Rd of instruction in execution = Rs of instruction in operand fetch(EX/MEM.RegisterRd = ID/EX.RegisterRs)1b. Rd of instruction in execution = Rt of instruction in operand fetch(EX/MEM.RegisterRd = ID/EX.RegisterRt)2a. Rd of instruction writing back = Rs of instruction in execution(MEM/WB.RegisterRd = ID/EX.RegisterRs)2b. Rd of instruction writing back = Rt of instruction in execution(MEM/WB.RegisterRd = ID/EX.RegisterRt)
add r1 ,r2, r3 IM Reg DM RegALU
IM Reg DM RegALU
IM Reg DM RegALU
sub r4,r1 ,r3
and r6,r1 ,r7
Type 1a Type 2a
r1 not valid yet
r1 not valid yet
Savio Chau
Forwarding Control
• For Mux A– Select ALU operands from previous ALU result in EX/MEM (Type 1a)
if (EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd = ID/EX.RegRs))– Select ALU operands from MEM/WB (Type 2a)
if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (MEM/WB.RegRd = ID/EX.RegRs))• For Mux B
Answer: Forward the EX/MEM because it is more update than MEM/WB. Therefore, MEM/WB is forwarded only if rd in all three stages are not the same. That is:
– For Mux A, Select ALU operands from MEM/WB (Type 2a): if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (EX/MEM.RegRd ID/EX.RegRd) and (MEM/WB.RegRd = ID/EX.RegRs))
– For Mux B: Same as Mux A except replacing Rs with Rt (Type 2b)
Question: If Rd is used Repeatedly such that rd in all three stages are the same (i.e., MEM/WB.RegRd = EX/MEM.RegRd = ID/EX.RegRs (or ID/EX.RegRt)). In that case, should EX/MEM or MEM/WB be forwarded?
RegFile
Forwarding Unit
exmwb
mwb wb
Control
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB Mu
x
0
12
012
Savio Chau
Forwarding Removes Data Hazard in Most Cases
add r1, r2, r3
add r4, r1, r3
add r5, r4, r1
sw r5, 0(r4)
Savio Chau
Except in One Case: lw Instruction Problem: The lw instruction is still reading memory when the sub instruction
needs the data for EX. Still need to handle the 1 hazard cycleWhy not forward the EX/MEM register to ALU? Would that remove the 1 hazard cycle?
Savio Chau
The Case Forwarding Can’t Avoid Stallinglw r1 , 0(r2)sub r4, r1 ,r3and r6, r7 ,r1
RegFile
exmwb
mwb wb
Control
Mux A
Mux B
Data MemoryA
LU
Mux
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux lw
r1
r2
r3
A=R[rs]
addr
Forwarding Unit
rd
Problem: lw followed by R-type – the lw instruction is still reading memory when the sub instruction needs the data for EX. Need to stall 1 cycle
lw
r1
add
r4
r1
r3
B=R[rt]
A=R[rs]
A+ B addr
Type 1a Hazard, but cannot forward EX/MEM output. It is mem addr, not data, for lw
• Control logic checks for data hazard and stall one cycle (i.e., insert a bubble) if necessary
Savio Chau
Hardware to Stall The Pipeline
• Step 1: Detecting the hazard (check if lw is being executed and if the memory data read by lw will be loaded to one of the operands in the next instruction)
– Stall = if (ID/EX.MemRead and ((ID/EX.rt = IF/ID.rs) or (ID/EX.rt = IF/ID.rt))) • Step 2: If Stall is true
– Do not fetch the next instruction by disabling the writing to PC and IF/ID registers– Disable all control signals of the current instruction
The bubble has not changed any state of the pipeline
Savio Chau
Control Hazard: Change in Control Flow Due to Branching
beq $1,$ 3,36
ld $4, $7, 100
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
Result of comparison branch to target
Waiting for result of comparison
Waiting for result of comparison
Waiting for result of comparison
Branch target
Have to stall 3 Cycles before branch decision is made
Savio Chau
Option 1: Static Branch PredictionPredict Branch Not Taken
Result of comparison not to branch
Assume branch not taken
Assume branch not taken
Assume branch not taken
Prediction is correct, branching does not cause any penalty
PC=12
PC=16
PC=20
PC=24
PC=28 or $15,$7,$3
Savio Chau
Penalty of Wrong Prediction
Assume branch not taken
Assume branch not taken
Assume branch not taken
Branch target
PC=12
PC=16
PC=20
PC=24
PC=36 Result of comparison branch taken
Prediction is incorrect, need to flush pipe, penalty = without branch prediction (3 cycles). Note: Need to make sure no instructions after beq has updated the register file or memory.
Savio Chau
Example of Wrong Prediction Penalty(e.g.,Incorrectly Predict Branch Not Taken)
Example of Wrong Prediction Penalty(e.g.,Incorrectly Predict Branch Not Taken)
Reset by (Brand Zero) No permanent change to Register File or Memory Pipe flushed
Savio Chau
To Reduce Branch Penalty
1st clock delay
2nd clock delay
3rd clock delay
PC+4
Ext(imm16)
Branch addressNeed a separate comparator since ALU is computing branch address
Let’s Take a Look at Why 3 Cycles of Penalty Are Needed – the decision is made after the execution stage
Savio Chau
To Reduce Branch Penalty
1st clock delay
PC+4
Ext(imm16)
Branch address
But in fact all the information required to make decision can be obtained in the “decode/operand fetch” stage. That means we can move address calculation forward.
Savio Chau
Predict branch taken
Pipeline After Branch Penalty Reduction
lw $4, 100($7)
Penalty = 1 cycle, instead of 3 cycles
Predict branch not taken
Branch Decision and Calculate Address Done in Decode Stage
All ctrl set to 0
Prediction is wrong. But need to flush IF/ID only
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
add $5, $6, $7
sw $5, 100($7)
PC
20
24
56
60
64
…
Savio Chau
Add
r for
beq
= 2
0
Flushing Pipe in the New Approach if Prediction is Wrong
Ctr
l sig
nals
Ctr
l sig
nals
Branch Hazard Detection
Ctr
l sig
nals
Ctr
l sig
nals
beq
C
trl f
or b
eq
… …
and fetched as if branch
not taken
Ctr
l for
beq
Ctr
l for
and
beq
and
flush
But Zero has decided that the branch should have taken
00…
000
…0
Ctr
l for
lw
00…
0
Ctr
l for
beq
beq
lw
00…
0
Ctr
l for
beq
Ctr
l for
and
Ctr
l for
lw
00…
0
Ctr
l for
add
beq
Ctr
l for
and
lw
00…
0
add
Add
r for
and
= 2
0A
ddr f
or lw
= 5
6A
ddr f
or a
dd =
60
Add
r for
sw
= 6
4
Savio Chau
Need Both Styles of Branch Prediction:Branch Taken and Branch Not Taken
• MIPS code that favors branch taken:Loop: mult $19,$10 # (HI, LO) regs = i * 4
mflo $9 # reg $9 least sig. 32 product bitslw $8, Aaddr($9) # Temporary reg $8 = A[i*4]add $17,$17,$8 # g = g + A[ i]add $19,$19,$20 # i = i + jbne $19,$18, Loop # goto Loop if i != h
• MIPS code that favors branch not taken:Loop: mult $19,$10 # (HI, LO) regs = i * 4
mflo $9 # reg $9 least sig. 32 product bitslw $8, Aaddr($9) # Temporary reg $8 = A[i*4]add $17,$17,$8 # g = g + A[ i]add $19,$19,$20 # i = i + jbeq $19,$18, Exit # if i = h, get out of the loopJ Loop # otherwise, goto Loop
Branch to Loop most of the time
Do not branch to Exit most of the time
Savio Chau
Option 2: Dynamic Branch Prediction• Rather than always assuming branch not taken, use a branch
history table (also call branch prediction buffer) to achieve better prediction
• The branch history table is implemented as a one or two bit registerExample: state transition of a 2-bit history table
State 00predict taken
State 01predict taken
State 10predict not taken
State 11predict not taken
taken not taken
takennot taken
taken
not takentaken
If branch test is in Instruction N, then:predict taken means PC set to the target address by default, and set to N+4 if wrongpredict not taken means PC set to N+4 by default, and set to target address if wrong
Savio Chau
Option 3: Delayed Branch• Make use of the time while the branch decision is being made: Execute
an unrelated instruction subsequent to the branch instruction• Where To Get Instructions to Fill Branch Delay Slot? Three Strategies:
• Compiler Effectiveness for Single Branch Delay Slot:– Fills About 60% of Branch Delay Slots– About 80% of Instructions Executed in Branch Delay Slots Useful in
Computation– About 50% (60% x 80%) of Slots Usefully Filled
• Worst Case, Compiler Inserts NOP into Branch Delay