Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in Huntsville {uzelacv | milenka}@ece.uah.edu
33
Embed
Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Microbenchmarks and Mechanisms for Reverse Engineering
of Branch Predictor Structures
Vladimir Uzelac and Aleksandar MilenkovićLaCASA Laboratory
Electrical and Computer Engineering Department
The University of Alabama in Huntsville
{uzelacv | milenka}@ece.uah.edu
2
Outline
Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
If we know branch predictor organization we could … Implement predictor-aware compiler optimizations
Code alignment to avoid BTB conflicts in critical code sections Code split to replace long correlations with shorter ones Camino environment [PLDI `05]
Have a “golden standard” for academic research Design tools for rapid BP
design space exploration and verification But, details are rarely publicly disclosed
In spite of hints in software optimization manuals Develop microbenchmarks and mechanisms for reverse
engineering of modern branch predictor units
4
Goals
Microbenchmarks and mechanisms developed to reverse engineer Pentium M’s branch predictor including
Target predictor BTB and IBTB
Outcome predictor Loop predictor Global outcome predictor Bimodal predictor
Branch predictor parameters Organization and size of all branch predictor structures Indexing, allocation, update, replacement policies Interdependencies between these structures
Validation of our effort through a functional PIN model
5
Presentation Outline
Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
Use B taken branches at the distance D from each other
Code executed many times to amplify effects on performance counters
Control how these branchesare presented to BTB
To cope with different allocation policies Here, we execute each branch twice consecutively
Missprediction rate (MPR) as function of B and D is sufficient to conclude on BTB parameters
9
Branch 1
Branch 2
Branch B
~~
D
128256
5121024
0%
20%
40%
60%
80%
100%
2 4 8 16 32 64128
MPR
DB
10
BTB Capacity Tests
Try to fill whole BTB using very small distances between branches Example: 4-way BTB with 512 entries, BTB index = IP[10:4] NBTB branches can fit for three distances
Branches fill sets consecutively For larger D, MPR = f(B,D)
Branches jump over sets
For very small D, there aremore branches in the line than sets
MPR exist for any D if B>NBTB
MPR = f(B,D, BTB parameters)can be mathematically formalized
Branch 4
WAY 3
Branch 3Branch 2
EvictOne
Branch 1
WAY 1
WAY 2
NSET
0
WAY 3
Branch 5
Branch 5
Branch 4
WAY 3
Branch 3Branch 2
EvictOne
Branch 1
WAY 1
WAY 2
NSET
0
WAY 3
Branch 5
BTB Set Tests
Try to fill one BTB set varying distance D When D > NSET all branches
collide in one set MPR is a function of B only
(only 4 branches can fit) Helps finding NWAYS and Index MSB
When D > NSET, change D’ between lasttwo to find Index LSB
D’ for which MPR disappear determines Index LSB
When D over Tag MSB distance, false hits occur
Only two branches produce MPR
11
...Branch 1
WAY 1
WAY N
Branch 2
False Hit
NSET
0
Index OffsetTagNot UsedIP
D=2TAG.MSB + 1
12
BTB Findings
Number of BTB entries: 2048 Number of sets: 512 Number of ways : 4 Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] Offset algorithm: When multiple hits, selects the target with the
lowest offset yet no smaller than the current IP Bogus branches handling: Evict whole set Replacement policy: Tree based pseudo LRU
Index = IP [12:4] Tag = IP [21:13]
Way 0Way 3
Branch target buffer (BTB)
0
511
Target(32 bits)
BTB hit
BTB target
Type (2-3 bits)
Tag (9 bits)
BTB typeOffset (4 bits)
PLRU(3 bits)
IP[31:4]
13
Outline
Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
What do we know? Each entry has two counters Counter MAX_VAL stores the loop
branch maximum count value Counter CURR_VAL stores the loop
branch current iteration
Assumptions: Loop BP is an IP indexed cache
Try to find: Counters’ length Size and organization of the loop branch predictor buffer (Loop BPB) Allocation policy (when a branch becomes a candidate for a loop branch) Training policy – how new loop branch MAX_VAL is set
CURR_VAL MAX_VAL Prediction+1
0=
Prediction
22
Loop Counters Size Test
Test:
“spy” loop (LSpy) has loop modulo L
MPR exists if L > MAX_VAL counter length
Results: Maximum predictable L is 64 (6-bit counters)
LSpy
L times Enter
Exit
23
Loop BPB Capacity and Set Tests
Similar to the BTB Capacity/Set tests
Employ B loops at the distance D
from each other
MPR is a function of B, D and Loop BPB
parameters similarly as for the BTB
Branch 1
Branch 2
Branch B
~~
D
Increase Counter
COUNTER =COUNTER MAX.
Increase Counter
COUNTER =COUNTER MAX.
Increase Counter
COUNTER =COUNTER MAX.
Loop B
Loop 1
Loop 2
D
~~
24
Loop BPB Capacity and Set Tests
Counters’ length: 6 bits Size and organization of the loop branch predictor buffer
Two-way cache with 128 entries Index = IP[9:4], Tag = IP[15:10]
Allocation policy: Branch allocated on first opposite outcome Training policy: Set MAX_VAL during 2nd loop iteration
MAX_VAL6 bits
CUR_VAL6 bits
Way 0
Hit
(Loop BPB)
Index = IP [9:4] Tag = IP [15:10]
Prediction0
64
Tag 6 bits
Way 1
Pred.1 bit
25
Outline
Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction