A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang and Ann Gordon-Ross + University of Florida Department of Electrical and Computer Engineering This work was supported by the National Science Foundation (CNS- 0953447) and (ECCS-0901706)
23
Embed
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches. Wei Zang and Ann Gordon-Ross + University of Florida Department of Electrical and Computer Engineering. + Also affiliated with NSF Center for High-Performance Reconfigurable Computing . - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches
+ Also affiliated with NSF Center for High-Performance Reconfigurable Computing
Wei Zang and Ann Gordon-Ross+
University of FloridaDepartment of Electrical and Computer Engineering
This work was supported by the National Science Foundation (CNS-0953447) and
(ECCS-0901706)
• Caches are a good candidate for system performance and energy optimization
• Different applications have vastly different cache requirements– Configurable cache parameters: total size, line size, associativity
(111) 111 01Conflicts: blocks that map to the same cache set as processed
addressConflict evaluation: determine the config. that results in a hit
Conflicts # = 1cache associativity >= 2,
hit
Stack update
Challenges in Unified L2 Cache Analysis• U-SPaCS targets exclusive hierarchy
– Storage space and simulation time efficiency (T-SPaCS)• L1 instruction (I) and data (D) caches share L2 cache -> introduces
interference in the unified (U) L2 cache
• M L1 I cache configurations and N L1 D cache configurations generates M∙N unique eviction orderings– Efficiently maintain and process the orderings for large design space
Interdependency Depends on relative eviction ordering of the
two L1 caches
U-SPaCS Overview
13
Execute application
Access trace fileT[N]:
T[t]:
T[3]T[2]T[1]
U-SPaCSB: block size
Accumulated L1 I & D cache misses
L2 cache misses & write-backs
for all cache config.
L1 I cache config.
L1 D cache config.
L2 U cache config.
Instruction stacks for different B
Data stacks for different B
Stack updateL1 analysisIn
stru
ctio
n T[
t] L2 analysis
L1 miss
Data
T[t
]
L1 analysis L2 analysis Stack updateL1 miss
L2 analysis
L2 analysis
Eviction Order Recording• L2 cache storage depends on relative eviction ordering• Stack maintains cache access ordering of unique addresses
– Not sufficient to determine eviction ordering• Explicitly record eviction ordering
14
Instruction (data) stack
Block address X E
For each c1_inst(c1_data)
c1_inst: L1 I cache configurationc1_data: L1 D cache configurationS2: # of sets in L2 cache
Eviction time arrayE: The time when X is evicted from L1 I (D) cache with L1 config. c1_inst(c1_data) E=‘0’: X is still in L1, not evicted yetProcess block address A:L2 analysisA’s S2 conflicts that are evicted to L2 later than old A’s eviction (E of conflicts > E of A) determine A’s L2 hit/miss
-> 8
U-SPaCS Processing Example
15
L2 cache (S = 23, W = 2)
L1 I cache (S = 22, W = 2)
L1 D cache (S = 23, W = 1)
Trace
D1(010)
I1(010)
I2(110)
D2(010)
D3(010)
I1(010)
I3(010)
I4(110)
I Stack
D Stack
123578 46
D1(010)
I1(010)
I2(110)
I3(010)
D2(010)
I4(110)
D3(010)
D1(010)
I1(010)
I2(110)
I3(010)
D2(010)
I4(110)
D3(010)
I1(010)
Trace order 1: Process D1(010)Compulsory missTrace order 2: Process I1(010)Compulsory missTrace order 3: Process I2(110)Compulsory missConflicts for S1: I1(010)W1 = 2, no eviction
-> 0
-> 0-> 0
-> 0
Trace order 4: Process I3(010)Compulsory missConflicts for S1: I2(110) I1(010)W1 = 2, I1(010) is evicted
-> 4
-> 0
Trace order 5: Process D2(010)Compulsory missConflicts for S1: D1(010)W1 = 1, D1(010) is evicted
-> 5
-> 0
Trace order 6: Process I4(110)Compulsory missConflicts for S1: I3(010) I2(110) I1(010)W1 = 2, I2(110) is evicted
-> 6
-> 0
Trace order 7: Process D3(010)Compulsory missConflicts for S1: D2(010) D1(010)W1 = 1, D2(110) is evicted
-> 7
-> 0
Trace order 8: Process I1(010)Conflicts for S1: I4(110) I3(010) I2(110)W1 = 2, L1 miss, I3(010) is evictedConflicts for S2: From I Stack: I3(010)->0From D Stack: D3(010)->0 D2(010)->7; D1(010)->5Eviction time of I1(010) is 4
Eviction time > 4W2 = 2, L2 miss
I7I6I1I2I3I4I5
Special Case - Occupied Blank Labeling
16
TraceI4I6 I2 I3I2 I1 I5I7I5 L2 set (4 ways)
L2 set (4 ways)
L1 set (2 ways)
I Stack
Access I5
BLK
Occupied blank (invalid due to exclusive)
(fetching I2 evicts I6 that maps to a different L2 set)
• From cache: miss in L1/L2• Analysis from Stack: Conflicts for S2 : I2->0 I1->7 I3->5 I4->4 (no conflicts in D
Stack) Eviction time > 3; L2 hitI5->3
Solution: occupied blank labeling
I4I6 I2 I3I2 I1 I5I7I5
Access I2 Hit in L2 I5
Inaccurate! L2 conflicts should count BLK
S1 < S2
-> 0
234689 57 1
-> 3
-> 0
-> 4
-> 0
-> 5-> 9
-> 0
-> 6
-> 0
-> 7
-> 0
-> 8
-> 0-> 0-> 0
Data Address Processing• Similar to instruction address processing• Additional dirty data write-back counting
– Writing a dirty cache block avoids previous writing to memory– #of write-back = # of total writes - # of write avoids
17
D stackD1 1 0 1 1
0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 1
Bit array for dirty status indication
D2
D3
D4
D5
Now process D4, during stack update:Remove the old D4, and push new D4 on top
For each c_hier(c1_inst, c1_data, c2)‘1’: dirty; ‘0’: clear
c_hier : cache hierarchy configurationc1_inst: L1 I cache configurationc1_data: L1 D cache configurationc2: L2 cache configuration
D4 is a write: Bit array of old D4
Bit array of new D4 (All dirty)
1 0 1 0
1 1 1 1
Data Address Processing• Similar to instruction address processing• Additional dirty data write-back counting
– Writing a dirty cache block avoids previous writing to memory– #of write-back = # of total writes - # of write avoids
18
D stackD1 1 0 1 1
0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 1
Bit array for dirty status indication
D2
D3
D4
D5
Now process D4, during stack update:Remove the old D4, and push new D4 on top
For each c_hier(c1_inst, c1_data, c2)‘1’: dirty; ‘0’: clear
c_hier : cache hierarchy configurationc1_inst: L1 I cache configurationc1_data: L1 D cache configurationc2: L2 cache configuration
D4 is a read:Bit array of old D4
Bit array of new D4
1 0 1 0
1 0 0 0
This c_hier results in L2 miss
Experiment Setup• Design space
– L1 I & D caches: cache size (2k8k B); block size (16B64B); associativity (direct-mapped4-way)