Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science.

Branch Prediction High-Performance Computer Architecture

Joe CropOregon State University

School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 2

Control Hazard

beq r1,r3,label

and r2,r3,r5

or r6,r1,r7

add r8,r1,r9

label: xor r10,r1,r11 Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg


Branch Penalty Impact

• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!

• Two part solution:– Determine branch taken or not sooner, AND– Compute taken branch address earlier

• MIPS branch tests if register = 0 or 0– beqz R4, name

• MIPS Solution:– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3


Adder

IF/ID

Modified MIPS DatapathMemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

MU

X

MU

X

SignExtend

Zero?

ME

M/W

B

EX

/ME

M

Next SEQ PC

rd rd rd

WB Data

Next PC

PC

rs

rt

ImmM

UX

ID/E

X

InstructionMemory Register

FileData

Memory

ALU

Adder


Branch Resolved in ID Stage

beq r1,r3,label

and r2,r3,r5

Label: xor r10,r1,r11

…

… Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg


Branch Prediction

• Predict Branch Not Taken– Execute successor instructions in sequence.– “Squash” instructions in pipeline if branch actually taken.– 47% MIPS branches not taken on average.– PC+4 already calculated, so use it to get next instruction.

• Predict Branch Taken– 53% MIPS branches taken on average.– But haven’t calculated branch target address yet

• MIPS still incurs 1 cycle branch penalty

• Other machines: branch target known before outcome

• Delay Branch Technique


Delay Branches• This technique involves using software making the delay slots valid

and useful. Some n number of instructions after the branch is executed regardless of whether the branch is taken.

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

• 1 delay slot allows proper decision and branch target address in 5 stage pipeline

• MIPS uses this.

Branch delay of length n


Performance Effect of Branch Penalty

Let

pb = the probability that an instruction is a branch

pt = the probability that a branch is takenb = the branch penaltyCPI = the average number of cycles per instruction.

Then

CPI = (1 - pb) + pb[pt(1 + b) + (1 - pt)]

CPI = 1 + bptpb


Delay Branch Technique


Delay Branch Technique (1)

A:=B+CIf B>C Then Goto Next Delay Slot...

Next:

becomes

If B>C Then Goto Next A:=B+C....

Next:

“From before”



Next: X := Y * Z

...B := A + CIf B > C Then Goto Next Delay Slot

becomes

X := Y * ZNext: ...

...B := A + CIf B > C Then Goto Next X := Y * Z

“From target”

Must be OK to executewhen not taken

May need to duplicate



B := A + CIf B > C Then Goto Next Delay SlotX := Y * Z...

Next:becomes

B := A + CIf B > C Then Goto Next X := Y * Z...

Next:

“From fall through”

Must be OK to executewhen taken


Delay Branch Technique (cont.)The performance of Delay Branches can be modeled by the following equation:

CPI = 1+bpbpnop

where pnop is the fraction of the b delay slots filled with nops. Thus, if fi is the probability that the delay slot i is filled with a useful instruction, then

pnop = 1 - (f1 + f2 + …+ fb)/b

Example: Suppose we have the following characteristic

b=4, f1 =0.6, f2 = 0.1, f3 = f4 =0, pb=0.2We have

CPI = 1 + 4 0.2 0.825 = 1.66


Delay Branch Technique (cont.)The concept of squashing or annulling can be used in conjunction with delay branches.

X := Y * ZNext: ...

…

B := A + C

If B > C Then Goto Next X := Y * Z =>This instruction is

nullifiedbne,a rs,rt,label a bit Branch outcome Delay inst. Executed?

taken yesnot taken yes

a taken yes

a not taken no (annulled)


Delay Branch Technique (cont.)

• For processors with this capability, the performance can be modeled as

CPI = 1 + bpb[pnop(1 - pnull) + pnull)]

where pnull=(1-pt) for nullify-on-branch-not-taken.

• Suppose b=4, f1=0.8, f2=0.3, f3=0.1, f4=0, pb=0.2, pnull= 0.35

=> CPI=1.644


Delayed Branch Performance

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots.– About 80% of instructions executed in branch delay

slots useful in computation.– About 50% (60% x 80%) of slots usefully filled.


Evaluating Branch Alternatives

Suppose Conditional & Unconditional = 14%, 65% change PC

Prediction Branch CPI speedup v. speedup v.scheme penalty unpipelined stallStall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31

18

Reducing Branch PenaltyBranch penalty in dynamically scheduled processors:

wasted cycles due to pipeline flushing on mis-predicted branches

Reduce branch penalty:

1. Predict branch/jump instructions AND branch direction (taken or not taken)

2. Predict branch/jump target address (for taken branches)

3. Speculatively execute instructions along the predicted path

19

What to Use and What to Predict

Available info:– Current predicted PC– Past branch history (direction

and target)

What to predict:– Conditional branch inst: branch

direction and target address– Jump inst: target address– Procedure call/return: target

address

May need instruction pre-decoded

IM

PC

Predictors

PC

pred_PC

pred info feedbackPC & Inst

20

Mis-prediction Detections and Feedbacks

Detections:• At the end of decoding

– Target address known at decoding, and not match

– Flush fetch stage

• At commit (most cases)– Wrong branch direction or target

address not match– Flush the whole pipeline

Feedbacks:• Any time a mis-prediction is

detected• At a branch’s commit(at EXE: called speculative update)

FETCH

RENAME

SCHD

REB/ROB

COMMIT

WB

EXE

predictors

21

Branch Direction Prediction

• Predict branch direction: taken or not taken (T/NT)

• Static prediction: compilers decide the direction• Dynamic prediction: hardware decides the direction

using dynamic information1. 1-bit Branch-Prediction Buffer2. 2-bit Branch-Prediction Buffer3. Correlating Branch Prediction Buffer4. Tournament Branch Predictor5. and more …

Not taken

taken BNE R1, R2, L1

…L1: …

22

Predictor for a Single Branch

state2. Predict

Output T/NT

1. Access

3. Feedback T/NT

T

Predict TakenPredict Taken1 0

T

NT

General Form

1-bit prediction

NT

PC

Feedback

23

Branch History Table of 1-bit Predictor

BHT also Called Branch Prediction Buffer in textbook

• Can use only one 1-bit predictor, but accuracy is low

• BHT: use a table of simple predictors, indexed by bits from PC

• Similar to direct mapped cache

• More entries, more cost, but less conflicts, higher accuracy

• BHT can contain complex predictors

PredictionPrediction

K-bitBranchaddress

2k

24

1-bit BHT Weakness

• Example: in a loop, 1-bit BHT will cause 2 mispredictions

• Consider a loop of 9 iterations before exit:for (…){ for (i=0; i<9; i++) a[i] = a[i] * 2.0;}– End of loop case, when it exits instead of looping as

before– First time through loop on next time through code,

when it predicts exit instead of looping– Only 80% accuracy even if loop 90% of the time

25

• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)

• Gray: stop, not taken• Blue: go, taken• Adds hysteresis to decision making process

2-bit Saturating Counter

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not Taken

11 10

01 00T

NT

T

NT

NT

26

Correlating Branches

Code example showing the potential

If (d==0)

d=1;

If (d==1)

…

Assemble code

BNEZ R1, L1

DADDIU R1,R0,#1

L1: DADDIU R3,R1,#-1

BNEZ R3, L2

L2:

…

Observation: if BNEZ1 is not taken, then BNEZ2 is taken

Chapter 3 - Exploiting ILP 27

(1, 1) Predictor• (1,1) predictor - last branch, 1-bit prediction• We use a pair of bits where the first bit being the prediction if the

last branch in the program was not taken, and the second bit being the prediction if the last branch was taken.

Prediction BitsPrediction If

Last branch Not Taken Last Branch Taken

NT/NT Not Taken Not Taken

NT/T Not Taken Taken

T/NT Taken Not Taken

T/T Taken Taken


(1, 1) Predictor: Example• Consider the following code assuming d is assigned to R1.

if (d==0)d=1;

if (d==1)

bnez R1,L1 ; branch b1 (d!=0)addi R1,R0,#1 ; d==0, so d=1

L1: subi R3,R1,#1bnez R3,L2 ; branch b2 (d!=1)...

L2:

• Suppose d alternates between 2 and 0, (1, 1) predictor initialized to not

taken. Bold indicate prediction.

• The only misprediction is on the first iteration, when d=2, because the b1 was not correlated with the previous prediction of b2

d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred

2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

0 T/NT NT T/NT NT/T NT NT/T


(1, 1) Predictor: Example• If we had use a 1-bit predictor

• We would have had all the branches mispredicted!

d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT


(m, n) Predictor(m,n) Predictor:In general, (m,n) predictor uses the behavior of last m branches (using shift register) to choose from 2m

branch predictors, each of which is a n-bit predictor for a single branch.


Performance of (2, 2) Predictor• Improvement is most

noticeable in integer benchmarks.

• (m,n) predictor outperforms 2-bit predictor, even with unlimited entries!

Integer benchmarks


Tournament Predictors• Uses multiple predictors, usually one based on local

information and one based on global information.– Local predictors are better for some branches– Global predictors are better at utilizing correlation

• A selector is used to choose among the predictors, usually a 2-bit saturating counter.

n/m means:• n - left predictor• m - right predictor

0/1 means:• 0 - Incorrect• 1 - Correct

11

1001

00


Example: Alpha 21264 Branch Predictor

21264 uses the most sophisticated branch predictor.

Last 10 outcomes of this branch

3-bit saturatingcounter

2-bitpredictor

2-bitsaturatingcounter

Last 12 outcomes of all the branches

Tournament Predictor in Alpha 21264• Local predictor consists of a 2-level predictor:

– Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted

– Next level Selected entry from the local history table is used to index a table of 1K entries consisting 3-bit saturating counters, which provide the local prediction

• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors)

1K 10 bits

1K 3 bits

% of predictions from local predictor in Tournament Prediction Scheme

98%

100%

94%

90%

55%

76%

72%

63%

37%

69%

0% 20% 40% 60% 80% 100%

nasa7

matrix300

tomcatv

doduc

spice

fpppp

gcc

espresso

eqntott

li

94%

96%

98%

98%

97%

100%

70%

82%

77%

82%

84%

99%

88%

86%

88%

86%

95%

99%

0% 20% 40% 60% 80% 100%

gcc

espresso

li

fpppp

doduc

tomcatv

Profile-based

2-bit counter

Tournament

Accuracy of Branch Prediction

• Profile: branch profile from last execution(static in that is encoded in instruction, but profile)

fig 3.40

Accuracy v. Size (SPEC89)

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Con

ditio

nal b

ran

ch m

isp

redi

ctio

n r

ate

Local - 2 bit counters

Correlating - (2,2) scheme

Tournament

Power Consumption

BlueRISC’s Compiler-driven Power-Aware Branch PredictionComparison with 512 entry BTAC bimodal (patent-pending)

Copyright 2007 CAM & BlueRISC

Pitfall: Sometimes dumber is better• Alpha 21264 uses tournament predictor (29 Kbits)• Earlier 21164 uses a simple 2-bit predictor with 2K

entries (or a total of 4 Kbits)• SPEC95 benchmarks, 21264 outperforms

– 21264 avg. 11.5 mispredictions per 1000 instructions– 21164 avg. 16.5 mispredictions per 1000 instructions

• Reversed for transaction processing (TP) !– 21264 avg. 17 mispredictions per 1000 instructions– 21164 avg. 15 mispredictions per 1000 instructions

• TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)

• What about power?– Large predictors give some increase in prediction rate but for a

large power cost


Branch Target BufferBTB acts as a cache for BTAs. This eliminates cycles wasted per branch required to calculate the BTAs.


BTB (cont.)

BTA and the outcome of the branch is known by end of ID stage

…but not relayed until EX stage


BTB (cont.)


Return Address PredictionBTB and BPB do a good job in predicting how future behavior will repeat. However, the subroutine call/return paradigm makes correct prediction difficult.

The BTB then contains the following after the second subroutine is called:

Inst. Addr Target Addr.100 500520 104112 500

When we return from subr, we get a hit on a valid entry in the BTB (Inst. Addr. = 520) and predict that we will return to address 104. However, this is not correct. The next instruction should be 116!


Subroutine Return StackIn order to detect such mispredictions, subroutine return stack can be used to augment the BTB.


Performance of SRSSPEC 95

Pentium 4’s Branch Predictor

• “Unveiling the Intel Branch Predictors”– Pentium 4– http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026

46

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026

Natural Branch Predictors

• “Towards a High Performance Neural Branch Predictor”– http://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdf– The main advantage of the neural predictor is its ability to

exploit long histories while requiring only linear resource growth

– Used in IA-64 simulators

47

http://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdf

Core 2’s Branch Predictor?• TAGE: Tagged Geometric


TAGE Performance

49

To Learn More


Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science.

Documents