Top Banner
1 ture 8: Instruction Fetch, ILP Limits day: advanced branch prediction, limits of ILP ections 3.4-3.5, 3.8-3.14)
23

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

1

Lecture 8: Instruction Fetch, ILP Limits

• Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

Page 2: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

2

1-Bit Prediction

• For each branch, keep track of what happened last time and use that outcome as the prediction

• What are prediction accuracies for branches 1 and 2 below:

while (1) { for (i=0;i<10;i++) { branch-1 … } for (j=0;j<20;j++) { branch-2 … } }

Page 3: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

3

2-Bit Prediction

• For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1)

• If (counter >= 2), predict taken, else predict not taken

• Advantage: a few atypical branches will not influence the prediction (a better measure of “the common case”)

• Especially useful when multiple branches share the same counter (some bits of the branch PC are used to index into the branch predictor)

• Can be easily extended to N-bits (in most processors, N=2)

Page 4: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

4

Correlating Predictors

• Basic branch prediction: maintain a 2-bit saturating counter for each entry (or use 10 branch PC bits to index into one of 1024 counters) – captures the recent “common case” for each branch

• Can we take advantage of additional information? If a branch recently went 01111, expect 0; if it recently went 11101, expect 1; can we have a separate counter for each case? If the previous branches went 01, expect 0; if the previous branches went 11, expect 1; can we have a separate counter for each case?

Hence, build correlating predictors

Page 5: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

5

Local/Global Predictors

• Instead of maintaining a counter for each branch to capture the common case,

Maintain a counter for each branch and surrounding pattern If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor

Page 6: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

6

Global Predictor

A single register that keeps trackof recent history for all branches

00110101

Branch PC

8 bits6 bits

Table of16K entries

of 2-bitsaturatingcounters

Also referred to as a two-level predictor

Page 7: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

7

Local Predictor

Branch PC

Table of16K entries

of 2-bitsaturatingcounters

Table of 64 entries of 14-bithistories for a single branch

10110111011001

Use 6 bits of branch PC toindex into local history table

14-bit historyindexes into

next level

Also a two-level predictor that onlyuses local histories at the first level

Page 8: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

8

Tournament Predictors

• A local predictor might work well for some branches or programs, while a global predictor might work well for others

• Provide one of each and maintain another predictor to identify which predictor is best for each branch

TournamentPredictor

Branch PC

Table of 2-bitsaturating counters

LocalPredictor

GlobalPredictor

MUX

Alpha 21264:1K entries in level-11K entries in level-2

4K entries12-bit global history

4K entries

Total capacity: ?

Page 9: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

9

Predictor Comparison

• Note that predictors of equal capacity must be compared• Sizes of each level have to be selected to optimize prediction accuracy• Influencing factors: degree of interference between branches, program likely to benefit from local/global history

Page 10: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

10

Branch Target Prediction

• In addition to predicting the branch direction, we must also predict the branch target address

• Branch PC indexes into a predictor table; indirect branches might be problematic

• Most common indirect branch: return from a procedure – can be easily handled with a stack of return addresses

Page 11: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

11

Multiple Instruction Issue

• The out-of-order processor implementation can be easily extended to have multiple instructions in each pipeline stage

• Increased complexity (lower clock speed!): more reads and writes per cycle to register map table more read and write ports in issue queue more tags being broadcast to issue queue every cycle higher complexity for bypassing/forwarding among FUs more register read and write ports more ports in the LSQ more ports in the data cache more ports in the ROB

Page 12: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

12

ILP Limits

• The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target prediction Perfect memory disambiguation Perfect instruction and data caches Single-cycle latencies for all ALUs Infinite ROB size (window of in-flight instructions) No limit on number of instructions in each pipeline stage

• The last instruction may be scheduled in the first cycle

• The only constraint is a true dependence (register or memory RAW hazards) (with value prediction, how would the perfect processor behave?)

Page 13: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

13

Infinite Window Size and Issue Rate

Page 14: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

14

Effect of Window Size

• Window size is effected by register file/ROB size, branch mispredict rate, fetch bandwidth, etc.• We will use a window size of 2K instrs and a max issue rate of 64 for subsequent experiments

Page 15: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

15

Imperfect Branch Prediction

• Note: no branch mispredict penalty; branch mispredict restricts window size• Assume a large tournament predictor for subsequent experiments

Page 16: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

16

Effect of Name Dependences

• More registers fewer WAR and WAW constraints (usually register file size goes hand in hand with in-flight window size)• 256 int and fp registers for subsequent experiments

Page 17: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

17

Memory Dependences

Page 18: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

18

Limits of ILP – Summary

• Int programs are more limited by branches, memory disambiguation, etc., while FP programs are limited most by window size

• We have not yet examined the effect of branch mispredict penalty and imperfect caching

• All of the studied factors have relatively comparable influence on CPI: window/register size, branch prediction, memory disambiguation

• Can we do better? Yes: better compilers, value prediction, memory dependence prediction, multi-path execution

Page 19: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

19

Pentium III (P6 Microarchitecture) Case Study

• 14-stage pipeline: 8 for fetch/decode/dispatch, 3+ for o-o-o, 3 for commit branch mispredict penalty of 10-15 cycles

• Out-of-order execution with a 40-entry ROB (40 temporary or virtual registers) and 20 reservation stations

• Each x86 instruction gets converted into RISC-like micro-ops – on average, one CISC instr 1.37 micro-ops

• Three instructions in each pipeline stage 3 instructions can simultaneously leave the pipeline ideal CPI = 0.33 ideal CPI = 0.45

Page 20: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

20

Branch Prediction

• 512-entry global two-level branch predictor and 512-entry BTB 20% combined mispredict rate

• For every instruction committed, 0.2 instructions on the mispredicted path are also executed (wasted power!)

• Mispredict penalty is 10-15 cycles

Page 21: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

21

Where is Time Lost?

• Branch mispredict stalls

• Cache miss stalls (dominated by L1D misses)

• Instruction fetch stalls (happens often because subsequent stages are stalled, and occasionally because of an I-cache miss

Page 22: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

22

CPI Performance

• Owing to stalls, the processor can fall behind (no instructions are committed for 55% of all cycles), but then recover with multi-instruction commits (31% of all cycles) average CPI = 1.15 (Int) and 2.0 (FP)• Overlap of different stalls CPI is not the sum of individual stalls• IPC is also an attractive metric

Page 23: 1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

23

Title

• Bullet