Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2 , Mary Inaba 1 , and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation
Cache Replacement Policy UsingMap-based Adaptive Insertion
Cache Replacement Policy UsingMap-based Adaptive Insertion
Yasuo Ishii1,2, Mary Inaba1, and Kei Hiraki1
1 The University of Tokyo2 NEC Corporation
Introduction
Modern computers have multi-level cache system
Performance improvement of LLC is the key to achieve high performance
LLC stores many dead-blocksElimination of dead-blocks in LLC improves system performance
CORE
L2
LLC(L3)
Memory
L1
Introduction
Many multi-core systems adopt shared LLC
Shared LLC make issuesThrashing by other threadsFairness of shared resource
Dead-block elimination is more effective for multi-core systems
Shared LLC(L3)
Memory
CORE1
L2
・・・・・・
L1
CORE2
L2
L1
COREN
L2
L1
・・・
Trade-offs of Prior Works
Replacement Algorithm
Dead-blockElimination
AdditionalHW Cost
LRU Insert to MRU None NoneDIP[2007 Qureshi+]
Partially Random Insertion
Some Several counters
LightLRF[2009 Xiang+]
Predicts from reference pattern
Strong Shadow tag, PHT
Heavy
Problem of dead-block predictionInefficient use of data structure
(c.f. shadow tag)
Map-based Data Structure・・・・
・・・・Line Size
ACCESS
ACCESS
ACCESS
Shadow Tag
40bit/tag
Memory Address Space
Zone Size
Map-based data structure improves cost- efficiency when there is spatial locality
Cost: 40bit/line
I I I IIIA AA
40bit/tag 1 bit/line
Map-base HistoryCost: 15.3bit/line (=40b+6b/3line)
Map-based Adaptive Insertion (MAIP)Modifies insertion position(1)Cache bypass(2)LRU position(3)Middle of MRU/LRU(4)MRU position
Adopts map-based data structure for tracking many memory accesses
Exploits two localities for reuse possibility estimation
Low Reuse Possibility
High Reuse Possibility
Hardware ImplementationMemory access map
Collects memory access history & memory reuse history
Bypass filter tableCollects data reuse
frequency of memory access instructions
Reuse possibility estimationEstimates reuse possibility
from information of other components
Estimation Logic
Mem
ory
Acc
ess
M
ap
Bypass
Filt
er
Tab
le
Last
Level C
ach
e
Memory Access Information
Insertion Position
Memory Access Map (1)
ACCESS
・・・・
・・・・ I I I I
Init Access
DataReuse
State Diagram
FirstTouchZone Size
Line Size
II
ACCESS
ACCESS
A AA
ACCESS
Detects one information(1)Data reuse
The accessed line is previously touched ?
MapTag
AccessCount
ReuseCount
Memory Access Map (2)
A A I I
Init Access
ReuseCountAccess
Count
AI
Detects one statistics(2)Spatial locality
How often the neighboring lines are reused?
Access Map
Attaches counters to detect spatial locality
Data Reuse MetricReuse CountAccess Count
=
Memory Access Map (3)
ImplementationMaps are stored in
cache like structureCost-efficiency
Entry has 256 statesTracks 16KB memory
16KB = 64B x 256stats
Requires ~ 1.2bit for tracking 1 cache line at the best case
Tag Access Map
CacheOffset
MapOffset
MapIndex
MapTag
= =ACCESS
MUX
2563030
4
8
Memory Address
Count
Reuse Count
Bypass Filter Table
Each entry is saturating counterCount up on data reuse / Count down on first
touch
Program Counter
Bypass Filter Table(8-bit x 512-entry)
BYPASSUSELESSNORMALUSEFULREUSE
Rarely Reused
Frequently Reused
Detects one statistic(3)Temporal locality:
How often the instruction reuses data?
Reuse Possibility Estimation Logic
Uses 2 localities & data reuse informationData Reuse
Hit / Miss of corresponding lookup of LLC Corresponding state of Memory Access Map
Spatial Locality of Data Reuse Reuse frequency of neighboring lines
Temporal Locality of Memory Access Instruction Reuse frequency of corresponding instruction
Combines information to decide insertion policy
Additional OptimizationAdaptive dedicated set reduction(ADSR)
Enhancement of set dueling [2007Qureshi+]
Reduces dedicated sets when PSEL is strongly biased
Set 7
LRU Dedicated Set
Set 6Set 5Set 4Set 3Set 2Set 1Set 0
Set 7Set 6Set 5Set 4Set 3Set 2Set 1Set 0
MAIP Dedicated SetAdditional FollowerFollower Set
Evaluation
BenchmarkSPEC CPU2006, Compiled with GCC 4.2Evaluates 100M instructions (skips 40G inst.)
MAIP configuration (per-core resource)Memory Access Map: 192 entries, 12-wayBypass Filter: 512 entries, 8-bit countersPolicy selection counter: 10 bit
Evaluates DIP & TADIP-F for comparison
Cache Miss Count (1-core)
MAIP reduces MPKI by 8.3% from LRUOPT reduces MPKI by 18.2% from LRU
40
0.p
erl
40
1.b
zip
42
9.m
cf
43
3.m
ilc
43
4.z
eu
s
43
6.c
act
43
7.l
esl
45
0.s
op
l
45
6.h
mm
e
45
9.G
em
s
46
2.l
ibq
46
4.h
26
4
47
0.l
bm
47
1.o
mn
e
47
3.a
sta
48
1.w
rf
48
2.s
ph
i
48
3.x
ala
Ave
rag
e
0
20
40
60
LRU DIP MAIP OPT
Mis
s p
er
10
00
in
sts.
Speedup (1-core & 4-core)
4-core result
403429433483
429450456482
401434456470
450464473483
401433450462
401450450482
403434450464
403456459473
434450482483
400429473483
400450456462
433434450462
433450470483
433434450462
400416456464
gmean
-6%
0%
6%
12%
18%TADIP MAIP
We
igh
ted
S
pe
ed
up
400.p
erl
401.b
zip
429.m
cf
433.m
ilc
434.z
eus
436.c
act
437.lesl
450.s
opl
456..
..
459..
..
462.lib
q
464.h
264
470.lbm
471..
..
473.a
sta
481.w
rf
482.s
phi
483.x
ala
gm
ean
-10%
0%
10%
20%DIP MAIP
Sp
ee
du
p
1-core result
48
3.x
al
a
Cost Efficiency of Memory Access Map
Requires 1.9 bit / line in average~ 20 times better than that of shadow tag
Covers >1.00MB(LLC) in 9 of 18 benchmarks
Covers >0.25MB(MLC) in 14 of 18 benchmarks
40
0.p
erl
42
9.m
cf
43
4.z
eu
s
43
7.l
esl
45
6.h
mm
e
46
2.l
ibq
47
0.l
bm
47
3.a
sta
48
2.s
ph
i
Ave
rag
e0.0 0.5 1.0 1.5 2.0 2.5 3.0
Co
ve
red
Are
a (
MB
)
Related Work
Uses spatial / temporal localityUsing spatial locality [1997, Johnson+]Using different types of locality [1995,
González+]Prediction-base dead-block elimination
Dead-block prediction [2001, Lai+]Less Reused Filter [2009, Xiang+]
Modified Insertion PolicyDynamic Insertion Policy [2007, Qureshi+]Thread Aware DIP[2008, Jaleel+]
Conclusion
Map-based Adaptive Insertion Policy (MAIP)Map-base data structure
x20 cost-effectiveReuse possibility estimation exploiting
spatial locality & temporal locality Improves performance from LRU/DIP
Evaluates MAIP with simulation studyReduces cache miss count by 8.3% from LRUImproves IPC by 2.1% in 1-core, by 9.1% in
4-core
ComparisonReplacement Algorithm
Dead-blockElimination
AdditionalHW Cost
LRU Insert to MRU None NoneDIP[2007 Qureshi+]
Partially Random Insertion
Some Several countersLight
LRF[2009 Xiang+]
Predicts from reference pattern
Strong Shadow tag, PHT
HeavyMAIP Predicts based
on two localities
Strong Mem access map
Medium
Improves cost-efficiency by map data structure
Improves prediction accuracy by 2 localities
How to Detect Insertion Position
function is_bypass()
if(Sb = BYPASS) return true if(Ca > 16 x Cr) return true return false
endfunction
function get_insert_position()
integer ins_pos=15 if(Hm) ins_pos = ins_pos/2 if(Cr > Ca) ins_pos=ins_pos/2 if(Sb=REUSE) ins_pos=0 if(Sb=USEFUL) ins_pos=ins_pos/2 if(Sb=USELESS) ins_pos=15 return ins_pos
endfunction