Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science
Branch Prediction High-Performance Computer Architecture
Joe CropOregon State University
School of Electrical Engineering and Computer Science
Chapter 2: A Five Stage RISC Pipeline 2
Control Hazard
beq r1,r3,label
and r2,r3,r5
or r6,r1,r7
add r8,r1,r9
label: xor r10,r1,r11 Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Chapter 2: A Five Stage RISC Pipeline 3
Branch Penalty Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
• Two part solution:– Determine branch taken or not sooner, AND– Compute taken branch address earlier
• MIPS branch tests if register = 0 or 0– beqz R4, name
• MIPS Solution:– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3
Chapter 2: A Five Stage RISC Pipeline 4
Adder
IF/ID
Modified MIPS DatapathMemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
MU
X
MU
X
SignExtend
Zero?
ME
M/W
B
EX
/ME
M
Next SEQ PC
rd rd rd
WB Data
Next PC
PC
rs
rt
ImmM
UX
ID/E
X
InstructionMemory Register
FileData
Memory
ALU
Adder
Chapter 2: A Five Stage RISC Pipeline 5
Branch Resolved in ID Stage
beq r1,r3,label
and r2,r3,r5
Label: xor r10,r1,r11
…
… Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Chapter 2: A Five Stage RISC Pipeline 6
Branch Prediction
• Predict Branch Not Taken– Execute successor instructions in sequence.– “Squash” instructions in pipeline if branch actually taken.– 47% MIPS branches not taken on average.– PC+4 already calculated, so use it to get next instruction.
• Predict Branch Taken– 53% MIPS branches taken on average.– But haven’t calculated branch target address yet
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before outcome
• Delay Branch Technique
Chapter 2: A Five Stage RISC Pipeline 7
Delay Branches• This technique involves using software making the delay slots valid
and useful. Some n number of instructions after the branch is executed regardless of whether the branch is taken.
branch instructionsequential successor1
sequential successor2
........sequential successorn
branch target if taken
• 1 delay slot allows proper decision and branch target address in 5 stage pipeline
• MIPS uses this.
Branch delay of length n
Chapter 2: A Five Stage RISC Pipeline 8
Performance Effect of Branch Penalty
Let
pb = the probability that an instruction is a branch
pt = the probability that a branch is takenb = the branch penaltyCPI = the average number of cycles per instruction.
Then
CPI = (1 - pb) + pb[pt(1 + b) + (1 - pt)]
CPI = 1 + bptpb
Chapter 2: A Five Stage RISC Pipeline 9
Delay Branch Technique
Chapter 2: A Five Stage RISC Pipeline 10
Delay Branch Technique (1)
A:=B+CIf B>C Then Goto Next Delay Slot...
Next:
becomes
If B>C Then Goto Next A:=B+C....
Next:
“From before”
Chapter 2: A Five Stage RISC Pipeline 11
Delay Branch Technique (2)
Next: X := Y * Z
...B := A + CIf B > C Then Goto Next Delay Slot
becomes
X := Y * ZNext: ...
...B := A + CIf B > C Then Goto Next X := Y * Z
“From target”
Must be OK to executewhen not taken
May need to duplicate
Chapter 2: A Five Stage RISC Pipeline 12
Delay Branch Technique (3)
B := A + CIf B > C Then Goto Next Delay SlotX := Y * Z...
Next:becomes
B := A + CIf B > C Then Goto Next X := Y * Z...
Next:
“From fall through”
Must be OK to executewhen taken
Chapter 2: A Five Stage RISC Pipeline 13
Delay Branch Technique (cont.)The performance of Delay Branches can be modeled by the following equation:
CPI = 1+bpbpnop
where pnop is the fraction of the b delay slots filled with nops. Thus, if fi is the probability that the delay slot i is filled with a useful instruction, then
pnop = 1 - (f1 + f2 + …+ fb)/b
Example: Suppose we have the following characteristic
b=4, f1 =0.6, f2 = 0.1, f3 = f4 =0, pb=0.2We have
CPI = 1 + 4 0.2 0.825 = 1.66
Chapter 2: A Five Stage RISC Pipeline 14
Delay Branch Technique (cont.)The concept of squashing or annulling can be used in conjunction with delay branches.
X := Y * ZNext: ...
…
B := A + C
If B > C Then Goto Next X := Y * Z =>This instruction is
nullifiedbne,a rs,rt,label a bit Branch outcome Delay inst. Executed?
taken yesnot taken yes
a taken yes
a not taken no (annulled)
Chapter 2: A Five Stage RISC Pipeline 15
Delay Branch Technique (cont.)
• For processors with this capability, the performance can be modeled as
CPI = 1 + bpb[pnop(1 - pnull) + pnull)]
where pnull=(1-pt) for nullify-on-branch-not-taken.
• Suppose b=4, f1=0.8, f2=0.3, f3=0.1, f4=0, pb=0.2, pnull= 0.35
=> CPI=1.644
Chapter 2: A Five Stage RISC Pipeline 16
Delayed Branch Performance
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots.– About 80% of instructions executed in branch delay
slots useful in computation.– About 50% (60% x 80%) of slots usefully filled.
Chapter 2: A Five Stage RISC Pipeline 17
Evaluating Branch Alternatives
Suppose Conditional & Unconditional = 14%, 65% change PC
Prediction Branch CPI speedup v. speedup v.scheme penalty unpipelined stallStall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31
18
Reducing Branch PenaltyBranch penalty in dynamically scheduled processors:
wasted cycles due to pipeline flushing on mis-predicted branches
Reduce branch penalty:
1. Predict branch/jump instructions AND branch direction (taken or not taken)
2. Predict branch/jump target address (for taken branches)
3. Speculatively execute instructions along the predicted path
19
What to Use and What to Predict
Available info:– Current predicted PC– Past branch history (direction
and target)
What to predict:– Conditional branch inst: branch
direction and target address– Jump inst: target address– Procedure call/return: target
address
May need instruction pre-decoded
IM
PC
Predictors
PC
pred_PC
pred info feedbackPC & Inst
20
Mis-prediction Detections and Feedbacks
Detections:• At the end of decoding
– Target address known at decoding, and not match
– Flush fetch stage
• At commit (most cases)– Wrong branch direction or target
address not match– Flush the whole pipeline
Feedbacks:• Any time a mis-prediction is
detected• At a branch’s commit(at EXE: called speculative update)
FETCH
RENAME
SCHD
REB/ROB
COMMIT
WB
EXE
predictors
21
Branch Direction Prediction
• Predict branch direction: taken or not taken (T/NT)
• Static prediction: compilers decide the direction• Dynamic prediction: hardware decides the direction
using dynamic information1. 1-bit Branch-Prediction Buffer2. 2-bit Branch-Prediction Buffer3. Correlating Branch Prediction Buffer4. Tournament Branch Predictor5. and more …
Not taken
taken BNE R1, R2, L1
…L1: …
22
Predictor for a Single Branch
state2. Predict
Output T/NT
1. Access
3. Feedback T/NT
T
Predict TakenPredict Taken1 0
T
NT
General Form
1-bit prediction
NT
PC
Feedback
23
Branch History Table of 1-bit Predictor
BHT also Called Branch Prediction Buffer in textbook
• Can use only one 1-bit predictor, but accuracy is low
• BHT: use a table of simple predictors, indexed by bits from PC
• Similar to direct mapped cache
• More entries, more cost, but less conflicts, higher accuracy
• BHT can contain complex predictors
PredictionPrediction
K-bitBranchaddress
2k
24
1-bit BHT Weakness
• Example: in a loop, 1-bit BHT will cause 2 mispredictions
• Consider a loop of 9 iterations before exit:for (…){ for (i=0; i<9; i++) a[i] = a[i] * 2.0;}– End of loop case, when it exits instead of looping as
before– First time through loop on next time through code,
when it predicts exit instead of looping– Only 80% accuracy even if loop 90% of the time
25
• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)
• Gray: stop, not taken• Blue: go, taken• Adds hysteresis to decision making process
2-bit Saturating Counter
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not Taken
11 10
01 00T
NT
T
NT
NT
26
Correlating Branches
Code example showing the potential
If (d==0)
d=1;
If (d==1)
…
Assemble code
BNEZ R1, L1
DADDIU R1,R0,#1
L1: DADDIU R3,R1,#-1
BNEZ R3, L2
L2:
…
Observation: if BNEZ1 is not taken, then BNEZ2 is taken
Chapter 3 - Exploiting ILP 27
(1, 1) Predictor• (1,1) predictor - last branch, 1-bit prediction• We use a pair of bits where the first bit being the prediction if the
last branch in the program was not taken, and the second bit being the prediction if the last branch was taken.
Prediction BitsPrediction If
Last branch Not Taken Last Branch Taken
NT/NT Not Taken Not Taken
NT/T Not Taken Taken
T/NT Taken Not Taken
T/T Taken Taken
Chapter 3 - Exploiting ILP 28
(1, 1) Predictor: Example• Consider the following code assuming d is assigned to R1.
if (d==0)d=1;
if (d==1)
bnez R1,L1 ; branch b1 (d!=0)addi R1,R0,#1 ; d==0, so d=1
L1: subi R3,R1,#1bnez R3,L2 ; branch b2 (d!=1)...
L2:
• Suppose d alternates between 2 and 0, (1, 1) predictor initialized to not
taken. Bold indicate prediction.
• The only misprediction is on the first iteration, when d=2, because the b1 was not correlated with the previous prediction of b2
d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred
2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
Chapter 3 - Exploiting ILP 29
(1, 1) Predictor: Example• If we had use a 1-bit predictor
• We would have had all the branches mispredicted!
d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred
2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
Chapter 3 - Exploiting ILP 30
(m, n) Predictor(m,n) Predictor:In general, (m,n) predictor uses the behavior of last m branches (using shift register) to choose from 2m
branch predictors, each of which is a n-bit predictor for a single branch.
Chapter 3 - Exploiting ILP 31
Performance of (2, 2) Predictor• Improvement is most
noticeable in integer benchmarks.
• (m,n) predictor outperforms 2-bit predictor, even with unlimited entries!
Integer benchmarks
Chapter 3 - Exploiting ILP 32
Tournament Predictors• Uses multiple predictors, usually one based on local
information and one based on global information.– Local predictors are better for some branches– Global predictors are better at utilizing correlation
• A selector is used to choose among the predictors, usually a 2-bit saturating counter.
n/m means:• n - left predictor• m - right predictor
0/1 means:• 0 - Incorrect• 1 - Correct
11
1001
00
Chapter 3 - Exploiting ILP 33
Example: Alpha 21264 Branch Predictor
21264 uses the most sophisticated branch predictor.
Last 10 outcomes of this branch
3-bit saturatingcounter
2-bitpredictor
2-bitsaturatingcounter
Last 12 outcomes of all the branches
Tournament Predictor in Alpha 21264• Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted
– Next level Selected entry from the local history table is used to index a table of 1K entries consisting 3-bit saturating counters, which provide the local prediction
• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors)
1K 10 bits
1K 3 bits
% of predictions from local predictor in Tournament Prediction Scheme
98%
100%
94%
90%
55%
76%
72%
63%
37%
69%
0% 20% 40% 60% 80% 100%
nasa7
matrix300
tomcatv
doduc
spice
fpppp
gcc
espresso
eqntott
li
94%
96%
98%
98%
97%
100%
70%
82%
77%
82%
84%
99%
88%
86%
88%
86%
95%
99%
0% 20% 40% 60% 80% 100%
gcc
espresso
li
fpppp
doduc
tomcatv
Profile-based
2-bit counter
Tournament
Accuracy of Branch Prediction
• Profile: branch profile from last execution(static in that is encoded in instruction, but profile)
fig 3.40
Accuracy v. Size (SPEC89)
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)
Con
ditio
nal b
ran
ch m
isp
redi
ctio
n r
ate
Local - 2 bit counters
Correlating - (2,2) scheme
Tournament
Power Consumption
BlueRISC’s Compiler-driven Power-Aware Branch PredictionComparison with 512 entry BTAC bimodal (patent-pending)
Copyright 2007 CAM & BlueRISC
Pitfall: Sometimes dumber is better• Alpha 21264 uses tournament predictor (29 Kbits)• Earlier 21164 uses a simple 2-bit predictor with 2K
entries (or a total of 4 Kbits)• SPEC95 benchmarks, 21264 outperforms
– 21264 avg. 11.5 mispredictions per 1000 instructions– 21164 avg. 16.5 mispredictions per 1000 instructions
• Reversed for transaction processing (TP) !– 21264 avg. 17 mispredictions per 1000 instructions– 21164 avg. 15 mispredictions per 1000 instructions
• TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)
• What about power?– Large predictors give some increase in prediction rate but for a
large power cost
Chapter 3 - Exploiting ILP 40
Branch Target BufferBTB acts as a cache for BTAs. This eliminates cycles wasted per branch required to calculate the BTAs.
Chapter 3 - Exploiting ILP 41
BTB (cont.)
BTA and the outcome of the branch is known by end of ID stage
…but not relayed until EX stage
Chapter 3 - Exploiting ILP 42
BTB (cont.)
Chapter 3 - Exploiting ILP 43
Return Address PredictionBTB and BPB do a good job in predicting how future behavior will repeat. However, the subroutine call/return paradigm makes correct prediction difficult.
The BTB then contains the following after the second subroutine is called:
Inst. Addr Target Addr.100 500520 104112 500
When we return from subr, we get a hit on a valid entry in the BTB (Inst. Addr. = 520) and predict that we will return to address 104. However, this is not correct. The next instruction should be 116!
Chapter 3 - Exploiting ILP 44
Subroutine Return StackIn order to detect such mispredictions, subroutine return stack can be used to augment the BTB.
Chapter 3 - Exploiting ILP 45
Performance of SRSSPEC 95
Pentium 4’s Branch Predictor
• “Unveiling the Intel Branch Predictors”– Pentium 4– http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026
46
Natural Branch Predictors
• “Towards a High Performance Neural Branch Predictor”– http://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdf– The main advantage of the neural predictor is its ability to
exploit long histories while requiring only linear resource growth
– Used in IA-64 simulators
47
Core 2’s Branch Predictor?• TAGE: Tagged Geometric
Chapter 3 - Exploiting ILP 48
TAGE Performance
49
To Learn More
Chapter 3 - Exploiting ILP 50