EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A
EECS 470Pipeline Control Hazards
Lecture 5Coverage: Chapter 3 & Appendix A
Pipeline function for BEQ
• Fetch: read instruction from memory
• Decode: read source operands from reg
• Execute: calculate target address and test for equality
• Memory: Send target to PC if test is equal
• Writeback: Nothing left to do
Control Hazards
beq 1 1 10sub 3 4 5
time
fetch decode execute memory writeback
fetch decode execute
beq
sub
Approaches to handling control hazards
• Avoidance– Make sure there are no hazards in the code
• Detect and Stall– Delay fetch until branch resolved.
• Speculate and Squash if wrong– Go ahead and fetch more instruction in case
it is correct, but stop them if they shouldn’t have been executed
Handling branch hazards: avoid all hazards
• Don’t have branch instructions! – Maybe a little impractical
• Predication can eliminate some branches– If-conversion– Hyperblocks
if-conversion
if (a == b) { x++; y = n / d;}
sub t1 a, bjnz t1, PC+2add x x, #1div y n, d
sub t1 a, badd(t1) x x, #1div(t1) y n, d
sub t1 a, badd t2 x, #1div t3 n, dcmov(t1) x t2cmov(t1) y t3
Removing hazards by refining a branch instruction
• Redefine branch instructions: ptbeq regA regB offset
prepare to branch if equal
If (R[regA] = = R[regB]) execute instructions at PC+1, PC+2, PC+3 then PC+1+offset
ptbnz example
t = 5n = 7g = c + 2bnz g, PC + 1m = 5a = 3
g = c + 2bnz g, PC + 4t = 5n = 7noopm = 5a = 3
Problems with this solution
• Old programs (legacy code) may not run correctly on new implementations– Longer pipelines tend to need more noops
• Programs get larger as noops are included– Especially a problem for machines that try to execute
more than one instruction every cycle– Harder to find useful instructions
• Program execution is slower– CPI is one, but some I’s are noops
Handling control hazards: detect and stall
• Detection:– Must wait until decode– Compare opcode to beq or jalr– Alternately, this is just another control signal
• Stall:– Keep current instructions in fetch– Pass noop to decode stage (not execute!)
PC Instmem
REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
bnz r1
PC Instmem
REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
noop
MUX
Control Hazards
beq 1 1 10sub 3 4 5
time
fetch decode execute memory writeback
fetch fetch fetch
beq
sub fetch
or
fetchTarget:
Problems with detect and stall• CPI increases every time a branch is detected!
• Is that necessary? Not always!– Only about ½ of the time is the branch taken
• Let’s assume that it is NOT taken…– In this case, we can ignore the beq (treat it like a noop)– Keep fetching PC + 1
• What if we are wrong?– OK, as long as we do not COMPLETE any instructions we
mistakenly executed (i.e. don’t perform writeback)
Handling data hazards: speculate and squash
• Speculate: assume not equal– Keep fetching from PC+1 until we know that
the branch is really taken
• Squash: stop bad instructions if taken– Send a noop to:
• Decode, Execute and Memory
– Send target address to PC
PC REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
equal
MUX
beqsubaddnand
add
sub
beq
beq
Instmem
noop
noop
noop
Problems with fetching PC+1
• CPI increases every time a branch is taken!– About ½ of the time
• Is that necessary?
No!, but how can you fetch from the targetbefore you even know the previous instructionis a branch – much less whether it is taken???
PC Instmem
REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
beq
bpc
MUX
target
targ
et
eq?
Branch Target Buffer
Fetch PC
Predicted target PC
Send PCto BTB
found?
Yes
usetarget
usePC+1
No
Branch prediction
• Predict not taken: ~50% accurate– No BTB needed; always use PC+1
• Predict backward taken: ~65% accurate– BTB holds targets for backward branches (loops)
• Predict same as last time: ~80% accurate– Update BTB for any taken branch
What about indirect branches?
• Could use same approach– PC+1 unlikely indirect target– Indirect jumps often have multiple targets (for
same instruction)• Switch statements• Virtual function calls• Shared library (DLL) calls
Indirect jump: Special Case
• Return address stack– Function returns have deterministic behavior
(usually)• Return to different locations (BTB doesn’t work well)• Return location known ahead of time
– In some register at the time of the call
– Build a specialize structure for return addresses• Call instructions write return address to R31 AND RAS• Return instructions pop predicted target off stack
– Issues: finite size (save or forget on overflow?);– Issues: long jumps (clear when wrong?)
Branch prediction
• Pentium: ~85% accurate
• Pentium Pro: ~92% accurate
• Best paper designs: ~96% accurate
Costs of branch prediction/speculation
• Performance costs?– Minimal: no difference between waiting and squashing; and it is
a huge gain when prediction is correct!
• Power?– Large: in very long/wide pipelines many instructions can be
squashed• Squashed = # mispredictions pipeline length/width before target
resolved
• Area?– Can be large: predictors can get very big as we will see next
time
• Complexity?– Designs are more complex– Testing becomes more difficult
What else can be speculated?
• Dependencies– I think this data is coming from that store instruction)
• Values – I think I will load a 0 value
• Accuracy?– Branch prediction (direction) is Boolean (T,NT)– Branch targets are stable or predictable (RAS)– Dependencies are limited– Values cover a huge space (0 – 4B)