Presented by: Swapnil Bhosale Advisor: Dr. Sudeep Pasricha Committee members: Dr. Sourajeet Roy Dr. Wim Bohm SLAM: High performance and energy efficient shared hybrid last level cache architecture in multicore systems 1
Presented by: Swapnil Bhosale
Advisor: Dr. Sudeep Pasricha
Committee members:Dr. Sourajeet RoyDr. Wim Bohm
SLAM: High performance and energy efficient shared hybrid last level cache architecture in multicore systems
1
• Introduction
• Related work
• Analysis of prior works
• Motivation
• Proposed SLAM framework
• Experimental setup
• Results
• Conclusion and future work
Overview
2
• Most of the modern computing systems are multicore with multi-level cache memory
• Last level cache (LLC) is generally shared between private caches
Introduction
Source: http://www.eedailynews.com/2012/02/freescale-claims-highest-performance.html
Freescale semiconductor’s multi-core processor (B4860)
Source: http://www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/index.html
3
Introduction
Source: http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf
4
Introduction
Source: http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf
5
• Processor-memory
performance gap continues
to increase
• Traditional SRAM based
caches cannot cope up
with increasing gap
• Need for alternate memory
that can provide
o High capacity
o Less energy consumption
o Be closer to processor
Introduction
Source: https://mzh.io/%E5%A6%82%E4%BD%95%E8%AE%A9Go%E7%A8%8B%E5%BA%8F%E6%9B%B4%E5%BF%AB
6
• Researchers proposed Spin Transfer Torque Random Access Memory (STTRAM)
• Attributes
o High density
o Low static power consumption
o Non-volatility
o Future scalability
o High endurance
o Small read latency
Introduction
Potential replacement for SRAM in cache memory hierarchy
Source: https://www.mram-info.com/stt-mram
7
• Basic storage element is MTJ (Magnetic Tunnel Junction)
• Data is stored as relative magnetic orientation of two ferromagnetic layers
Source: https://www.mram-info.com/stt-mramSource: https://www.embedded.com/design/real-time-and-performance/4026000/The-future-of-scalable-STT-RAM-
as-a-universal-embedded-memory
Introduction
8
SRAM STTRAM
Cell structure
Leakage power (for 1MB, 45nm tech node)
14.63mW2.32mW
(5-6x lesser than SRAM)
Area (for 1MB, 45nm tech node) 3.77sqmm0.95sqmm
(4-5x denser than SRAM)
Write latency 3.18ns12.01ns
(approx. 4x of SRAM)
Write energy 0.08nJ0.64nJ
(approx. 8x of SRAM)
Source: J. Ahn, S. Yoo and K. Choi, "DASCA: Dead Write
Prediction Assisted STT-RAM Cache Architecture," 2014
IEEE 20th International Symposium on High Performance
Computer Architecture (HPCA), Orlando, FL, 2014, pp. 25-
36.
Introduction
Need for techniques to overcome the drawbacks of STTRAM
9
Introduction
• Related work
• Analysis of prior works
• Motivation
• Proposed SLAM framework
• Experimental setup
• Results
• Conclusion and future work
Overview
10
• Some works focus on reducing
write energy by tuning MTJ
device propertieso “Relaxing non-volatility for fast and energy-
efficient STT-RAM caches”, [C. Smullen, et al,
IEEE HPCA, 2011]
o “Delivering on the promise of universal
memory for spin-transfer torque RAM (STT-
RAM)”, [C. Smullen, et al, IEEE ISLPED, 2011]
Related work
MTJ thickness MTJ write time
MTJ thickness MTJ Retention time
Tough compromise between MTJ write time and MTJ retention time
∝
∝
11
• Other works focus on reducing write energy
at cell level
• Basic idea is to update bits with different
values
o "Energy reduction for STT-RAM using early write termination“,
[P. Zhou, et al, IEEE ICCAD, 2009]
o “Coding last level STT-RAM cache for high endurance and low
power”, [S. Yazdanshenas, et al, IEEE Computer Architecture
Letters, 2013]
These works do not consider non-uniformity of writes across the cache
Related work
1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0
Cache DATA field
Incoming bit pattern
1 0 1 0 1 1 1 1
Cache DATA field
12
comparator
• There are some works that focus on reducing write
energy using hybrid last level cache (LLC) architecture
• Basic idea is to migrate write intensive cache lines to
SRAM region
o “Exploiting non-uniformity of write accesses for designing a high-
endurance hybrid last level cache”, [P. Safayenikoo, et al, IEEE CCECE,
2017]
o "High-endurance and performance-efficient design of hybrid cache
architectures through adaptive line replacement“, [A. Jadidi, et al, IEEE
ISLPED, 2011]
Provides better energy savings by taking advantage of both SRAM and STTRAM
way-0 SRAM
way-1 SRAM
way-2 STTRAM
way-3 STTRAM
way-4 STTRAM
way-5 STTRAM
way-6 STTRAM
way-7 STTRAM
8-way set in
hybrid cache
Related work
13
Introduction
Related work
• Analysis of prior works
• Motivation
• Proposed SLAM framework
• Experimental setup
• Results
• Conclusion and future work
Overview
14
Analysis of prior work PTHCM
TAG DATA
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
Main memory
• PTHCM (Prediction table hybrid cache management) use hybrid last level cache (LLC) comprised of SRAM and STTRAM
Prediction table
• Use prediction table to predict write intensive cache lines
• Migrate write intensive cache lines in STTRAM region to SRAM region
Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097
What does
PTHCM do?
15
Analysis of prior work PTHCM
Main memory
TAG WC AC WLC ALC
Prediction table
• Counters to keep access history of each cache lineo AC – actual access count of a cache line
(read/write)o WC – actual write count of a cache lineo ALC – prediction of access count of a cache lineo WLC – prediction of write count of a cache line
• Migration will happen ono Misso Write hit
• Prediction table is populated ono Eviction
TAG WC AC
Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based
management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on
Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097
DATA
4-way
SRAM
12-way
STTRAM
A set in hybrid LLCHow does it
work?
16
Analysis of prior work PTHCM
M
A
I
N
M
E
M
O
R
Y
x.TAG x.WC x.AC
TAG WC=0 AC=0 WLC ALC DATA
Line with minimum
WLC is replaced
Miss at line ‘x’
LRU line evicted
Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based
management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on
Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097
Prediction table
TAG WC AC
x.TAG x.WC x.AC
Insert in SRAM to avoid write operation to
STTRAM
If entry not found, WLC and ALC are initialized to user set thresholds
DATA
Line ‘y’
Line ‘z’
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
• Counters to keep access history of each cache lineo AC – actual access count of a
cache line (read/write)o WC – actual write count of a
cache lineo ALC – prediction of access
count of a cache lineo WLC – prediction of write
count of a cache line
• Migration will happen on-o Misso Write hit
• Prediction table is populated on-o Eviction
17
Analysis of prior work PTHCM
Main memory
TAG WC AC WLC ALC
Write hit
TAG WC++ AC++ WLC - - ALC - -
WC >
threshold
Line with minimum
WLC is replaced
Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based
management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on
Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097
swap
Prediction table
TAG WC AC
• Counters to keep access history of each cache lineo AC – actual access count of
a cache line (read/write)o WC – actual write count of
a cache lineo ALC – prediction of access
count of a cache lineo WLC – prediction of write
count of a cache line
• Migration will happen on-o Misso Write hit
• Prediction table is populated on-o Eviction
DATA
Line ‘x’
Line ‘y’
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
18
Analysis of prior work PTHCM
y.TAG y.WC y.AC
TAG WC AC WLC ALC
Main memory
Eviction
Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based
management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on
Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097
Prediction table
TAG WC AC
y.TAG y.WC y.AC
If entry not found, make new entry in empty slot
If no empty slot, delete entry with minimum AC
• Counters to keep access history of each cache lineo AC – actual access count of a cache
line (read/write)o WC – actual write count of a cache
lineo ALC – prediction of access count of a
cache lineo WLC – prediction of write count of a
cache line
• Migration will happen on-o Misso Write hit
• Prediction table is populated on-o Eviction
DATA
Line ‘x’
Line ‘y’
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
19
• Hardware overhead
o 3 bits to represent each of WC, AC, WLC and ALC
o 12 bits extra added to every cache line in LLC
o 65536 cache lines in 4MB hybrid LLC and 64B blocksize
o 12*65536 ~ 98kB additional space in LLC
o Considering 14 bits to represent TAG
o Each entry in prediction table is 20 bits in size
o 65536 entries in prediction table
o Size of prediction is 20*65536 ~ 163kB
o Size of swap/migration buffer ~ 68B
o Total hardware overhead = 262kB ~ 6.39% of LLC
WC AC WLC ALC
TAG DATA
Prediction table
TAG WC AC
Analysis of prior work PTHCM
163kB prediction table
Notable hardware overhead
20
TAG DATA
68B swap/migration buffer
4MB hybrid LLC
12 bits of extra fields
per cache line
P1 P3
L1:
x=5L1:
Shared mem:
x=5
P2
L1:
x=5
Background: Cache coherence
P1 P3
L1:
x=6L1:
Shared mem:
x=5
P2
L1:
x=INV
eviction
Write-back
Rd->xWr->x
• Uniformity of shared resource data
• Achieved by writing back modified data to shared memory when o Evicted by ownero Requested by peer
processor
Coherent view of memory Non-coherent view of memory
21
P1 P3
L1:
x=5L1:
Shared mem:
x=5
P2
L1:
x=5
Background: Cache coherence
P1 P3
L1:
x=6L1:
Shared mem:
x=6
P2
L1:
x=INV
• Uniformity of shared resource data
• Achieved by writing back modified data to shared memory when o Evicted by ownero Requested by peer
processor
Coherent view of memory Coherent view of memory
22
Main memory
Analysis of prior work RWEEHCCitation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), Tallinn, 2016, pp. 1-6
TAG DATA
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
• RWEEHC (Restricting writes to energy efficient hybrid cache) use hybrid last level cache (LLC) comprised of SRAM and STTRAM
• Exploit cache coherency to predict write intensive cache lines
• Migrate write intensive cache lines in STTRAM region to SRAM region
What does
RWEEHC do?
23
Analysis of prior work RWEEHC
• Adds extra states (STT_STATE) to predict write intensive cache block
• STT_STATES• P: Dataless entry into STTRAM region • ST-D: Possible candidate for migration
to SRAM• SR-C: Block migrated to SRAM region
• Migration is done on• Writeback to a block in ST-D state in
STTRAM region
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
Main memory
TAG STT_STATE DATA
4-way
SRAM
12-way
STTRAM
A set in hybrid LLCHow does it
work?
24
Block ‘x’
TAG DATA
L1
Miss at line ‘x’
Analysis of prior work RWEEHC
• STT_STATES• P: Dataless entry into STTRAM region • ST-D: Possible candidate for migration
to SRAM• SR-C: Block migrated to SRAM region
• Migration is done on• Writeback to a block in
ST-D state in STTRAM region
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to predict write intensive cache block
TAG STT_STATE DATA
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
Main memory
25
Block ‘x’
TAG DATA
10100110 101111001010
Core0 - L1
Miss at line ‘x’
Dataless
entry
x
Analysis of prior work RWEEHC
• STT_STATES• P: Dataless entry into STTRAM region • ST-D: Possible candidate for migration
to SRAM• SR-C: Block migrated to SRAM region
• Migration is done on• Writeback to a block
in ST-D state in STTRAM region
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to predict write intensive cache block
TAG STT_STATE DATA
10100110 P
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
x
Main memory
26
Block ‘x’
TAG DATA
10100110 101111011111
Core 0 - L1
x
TAG DATA
Core 1 - L1
Analysis of prior work RWEEHC
Rd ‘x’
Writeback
• STT_STATES• P: Dataless entry into
STTRAM region • ST-D: Possible candidate
for migration to SRAM• SR-C: Block migrated to
SRAM region
• Migration is done on• Writeback to a block in
ST-D state in STTRAM region
Transition to ST-D state on writeback in P state
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to to predict write intensive cache block
TAG STT_STATE DATA
10100110 P
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
x
Main memory
27
Block ‘x’
TAG DATA
10100110 101111011111
Core 0 - L1
x
TAG DATA
Core 1 - L1
Dirty
eviction
Analysis of prior work RWEEHCTransition to ST-D state on writeback in P state
Writeback
• STT_STATES• P: Dataless entry into
STTRAM region • ST-D: Possible candidate
for migration to SRAM• SR-C: Block migrated to
SRAM region
• Migration is done on• Writeback to a block in
ST-D state in STTRAM region
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to to predict write intensive cache block
TAG STT_STATE DATA
10100110 P
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
x
Main memory
28
Block ‘x’
TAG DATA
10100110 101111011111
Core 0 - L1
x
TAG DATA
Core 1 - L1
Analysis of prior work RWEEHC
• STT_STATES• P: Dataless entry into
STTRAM region • ST-D: Possible candidate
for migration to SRAM• SR-C: Block migrated to
SRAM region
• Migration is done on• Writeback to a block in
ST-D state in STTRAM region
Possible candidate for migrationCitation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to to predict write intensive cache block
TAG STT_STATE DATA
10100110 ST-D 101111011111
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
x
Main memory
29
Block ‘x’
TAG DATA
10100110 101111011100
Core 0 - L1
x
TAG DATA
10100110 101111011100
Core 1 - L1
PAUSE
Migrate to
SRAM
region
Analysis of prior work RWEEHC
x
• STT_STATES• P: Dataless entry into
STTRAM region • ST-D: Possible
candidate for migration to SRAM
• SR-C: Block migrated to SRAM region
• Migration is done on• Writeback to a block in
ST-D state in STTRAM region
Writeback
to ‘x’
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to to predict write intensive cache block TAG STT_STATE DATA
10100110 ST-D 101111011111
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
x
Main memory
30
Block ‘x’
Analysis of prior work RWEEHC
TAG DATA
10100110 101111011111
Core 0 - L1
x
TAG DATA
10100110 101111011111
Core 1 - L1
x
• STT_STATES• P: Dataless entry into
STTRAM region • ST-D: Possible candidate
for migration to SRAM• SR-C: Block migrated to
SRAM region
• Migration is done on• Writeback to a block in
ST-D state in STTRAM region
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to to predict write intensive cache block
TAG STT_STATE DATA
10100110 SR-C 101111011111
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
x
Main memory
31
Analysis of prior work RWEEHC
TAG DATA
10100110 101111011111
Core 0 - L1
x
TAG DATA
10100110 101111011111
Core 1 - L1
x
Resume
the
Writeback
• STT_STATES• P: Dataless entry into
STTRAM region • ST-D: Possible
candidate for migration to SRAM
• SR-C: Block migrated to SRAM region
• Migration is done on• Writeback to a block in
ST-D state in STTRAM region
SR-C is stable state
Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-
core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-
SoC), Tallinn, 2016, pp. 1-6
• Adds extra states (STT_STATE) to to predict write intensive cache block
TAG STT_STATE DATA
10100110 SR-C 101111011100
4-way
SRAM
12-way
STTRAM
A set in hybrid LLC
x
Block ‘x’
Main memory
32
Analysis of prior work RWEEHCTAG STT_STATE DATA
• Hardware overhead
o 2 bits to represent STT_STATE
o 65536 cache lines in 4MB hybrid LLC
and 64B blocksize
o 66B for swap/migration buffer
o 2*65536 + 528 ~ 16kB additional
space in LLC
o Total hardware overhead = 16kB ~
0.39% of LLC
Negligible hardware overhead
33
TAG STT_STATE DATA
66B swap/migration buffer
4MB hybrid LLC
16kB space for STT_STATE
Analysis of prior work RWEEHC
• Performance overhead
o Dataless entry cause high writebacks
to LLC
o Writeback buffer gets full more often
o Hence system stalls more often
P1 P3
L1:
x=6 L1:
Shared mem:
P2
L1:
x=INV
Eviction
Write-back ‘x’
(clean/dirty)
Rd->xWr->x
Main memory
On miss at line ‘x’
34
Analysis of prior work RWEEHC
• Performance overhead
o Dataless entry cause high writebacks
to LLC
o Writeback buffer gets full more often
o Hence system stalls more often
Performance affected due to stalling
P1 P3
L1:
x=6 L1:
Shared mem:
x=6
P2
L1:
x=INV
Eviction
Write-back ‘x’
(clean/dirty)
Rd->xWr->x
Main memory
On miss at line ‘x’
35
Introduction
Related work
Analysis of prior works
• Motivation
• Proposed SLAM framework
• Experimental setup
• Results
• Conclusion and future work
Overview
36
• Use hybrid last level cache (LLC) comprised of SRAM and STTRAM
• Use existing cache block state to track eviction of dirty block from L1
• Avoid writebacks to STTRAM region of LLC due to eviction of dirty block from L1
Motivation
What does
SLAM do?
37
System configuration
CPU x86, 2.66GHz, 4-cores, out of order execution
L1 cache 32kB SRAM split I/D caches8-way, 64B blocksize4-cycle read and write latencyLRU replacement policywrite-invalidate, write-backdirectory-based MESI
L2 cache/ LLC 4MB 16-way inclusive hybrid (1MB SRAM + 3MB STTRAM)4-way SRAM and 12-way STTRAM, 64B blocksize8-cycle SRAM read and write latency8-cycle STTRAM read latency32 cycle STTRAM write latencyLRU replacement policywrite-back cache
Simulator used SNIPER v6.1 (multi-core, parallel, trace-driven, high-speed and accurate x86 simulator)
Benchmarks used PARSEC-2.1 and SPLASH-2
Motivation
38
Sources of writes to LLC
Coherency writes constitute 60% of all the writes
Motivation
39
coherency core prefetch
Motivation
Writebacks due to eviction of dirty blockconstitute 88% of all coherency writes
40
Writebacks due to dirty eviction
Writebacks due to request from another
core
Coherency writes
Can we avoid coherency
writes to LLC?
Copy is requested by
another core
(priority writeback)
Copy is NOT requested
by another core
(NOT a priority writeback)
Writeback due to dirty eviction
Motivation
Writeback due to request from peer processor
41
Introduction
Related work
Analysis of prior works
Motivation
• Proposed SLAM framework
• Experimental setup
• Results
• Conclusion and future work
Overview
42
C0
L1 L1
43
C3
state
LRU
bits
count
Tag Data
M 5
M 3
M 4
M 2
M 7
E 6
M 1
S 0
Line
index
0
1
2
3
4
5
6
7
A set in L1 cache (8-way)
4-way SRAM region
12-way STTRAM region
TAG DATA
SLAM frameworkLi
ne
4 is
LR
U
Check if writeback in STTRAM region
Search for clean block
Drop clean block silently
16-way set in hybrid L2/LLC
SLAM
SLAM• Hardware overhead
o 32-bit buffer for each L1 to hold address
of actual LRU dirty block selected for
eviction from L1
o Two 2-bit registers for each L1 to
represent one cache block state from
{M,E,S,I}
o Total hardware overhead = 4*32 + 4*2*2
= 144 bits = 18B
o Negligible compared to 4MB LLC
Negligible hardware overhead
44
32-bit address buffer
TAG DATA
4MB hybrid LLC
2-bit register2-bit register
x4
SLAM
45
• Performance overhead
o Extra access to 4MB LLC cost 8 cycles
o 1 cycle to load cache block states from L1 into
buffer for comparison
o 1 cycle for performing comparison
o Clean block is searched iteratively in entire 8-
way set of L1
Extra LLC access cycles Extra execution cycles Extra total cycles
Best case 8 2 10
Worst case 8 14 22
o Best case- Clean block is found in first iteration
Number of cycles = 2 + 8 = 10 cycles
o Worst case- Clean block is found in last iteration
Number of cycles = 2*7 + 8 = 22 cycles
o Each writeback to STTRAM region needs 32 cycles
(write latency of STTRAM)
o Hence performance of overall system is maintained
WC AC WLC ALC
TAG STT_STATE DATA
PTHCM RWEEHC SLAM
TAG DATA
Hardware overhead comparison
46
TAG DATA
68B swap/migration buffer 66B swap/migration buffer
TAG WC AC
TAG STT_STATE DATA
163kB prediction table
12 bits for extra fields
per cache line
4MB hybrid LLC 4MB hybrid LLC
16kB space for STT_STATE
32-bit address buffer
2-bit register
x4
x4
TAG DATA
4MB hybrid LLC
2-bit register
SRAM LLC
2MB
Hybrid LLC
4MB(1MB SRAM + 3MB
STTRAM)
STTRAM LLC
8MB
44.7305 𝑚𝑚2 44.7305 𝑚𝑚244.7305 𝑚𝑚2
SLAM
47
SRAM-STTRAM partition in hybrid LLC
• Total energy least for 4-12 combination (4-
way SRAM and 12-way STTRAM) partition
is least
• 4-12 combination is the best fit for the
selected LLC on-chip area
• Results shown for only 4 workloads for
brevity; conclusions are the same across
other workloads
SLAM
48
Introduction
Related work
Analysis of prior works
Motivation
Proposed SLAM framework
• Experimental setup
• Results
• Conclusion and future work
Overview
49
Benchmark selection
• PARSEC-2.1 and SPLASH-2
• Parallel and multi-threaded
• Diverse application domain
• Large usage and exchange of shared data
Workload Application Domain%
coherency writes
swaptions Financial analysis 68%
freqmine Data mining 68%
fluidanimate Animation 30%
raytrace Graphics 32%
cholesky Sparse matrix factorization kernel 66%
barnes N-body problem (3D) 65%
fmm N-body problem (2D) 39%
lu.cont Dense matrix factorization kernel 89%
fft Blocked matrix transpose kernel 36%
ocean.cont Large-scale ocean movements 94%
radix Integer radix sort kernel 75%
Experimental setup
50
Power and energy parameters
• Extracted from CACTI and NVSim for
STTRAM
• Scaled for 45nm technology from various
previous works
• Used to evaluate total LLC energy
consumption
2MB SRAM LLC
8MB STTRAM LLC
4MB Hybrid LLC (SRAM/STTRAM)
Readenergy
(nJ/access)0.3072 0.1484 0.3072/0.1484
Write energy
(nJ/access)0.3072 2.78 0.3072/2.78
Static power (mW)
3825 1040 2302.5
Experimental setup
51
Experimental setup
52
• Applications were run to completion on all four cores while exploiting cache coherence with detailed
models of cores, caches and interconnection networks
• Number of LLC accesses were collected for entire application runtime to evaluate-
o Minimized writes to STTRAM
o Decreased total energy consumption of LLC
• Performance is measured in terms of IPC (Instructions Per Cycle)
• Comparison of SLAM’s energy and performance with
o PTHCM [B. Quan, et al, IEEE ICCCT, 2012]
o RWEEHC [S. Agarwal, et al, IEEE VLSI-SoC, 2016]
Simulator setup
• Simulator used – SNIPER v6.1 (multi-core, parallel, trace-driven, high-speed and accurate x86 simulator)
• Used two metrics for evaluation- Total LLC energy and overall system performance
Introduction
Related work
Analysis of prior works
Motivation
Proposed SLAM framework
Experimental setup
• Results
• Conclusion and future work
Overview
53
Energy evaluation for SLAM
• The use of hybrid LLC architecture saved energy compared to SRAM-only and STTRAM-only LLC architectures
• Negligible use of external hardware led to significant energy savings compared to PTHCM and RWEEHC
Comparisonarchitecture
Average LLC energy savings
SRAM 18.94%
STTRAM 32.31%
PTHCM 38.79%
RWEEHC 8.97%
Results
54
Performance evaluation for SLAM
• SLAM outperforms SRAM-only and STTRARM-only LLC architectures by avoiding writeback operations, thus
avoiding saturating the writeback buffer
• SLAM outperforms PTHCM and RWEEHC by eliminating migration/swapping between SRAM and STTRAM regions
Comparisonarchitecture
Average IPC improvement
SRAM 4.631%
STTRAM 0.607%
PTHCM 6.863%
RWEEHC 0.407%
Results
55
Introduction
Related work
Analysis of prior works
Motivation
Proposed SLAM framework
Experimental setup
Results
• Conclusion and future work
Overview
56
Conclusion• Designed a framework that
o Tracks writeback operations to LLC
o Avoid writeback operations to STTRAM region of LLC due to dirty eviction from L1
• Did comprehensive energy and performance comparison based on same area
constraint with
o Baseline SRAM based LLC architecture
o Baseline STTRAM based LLC architecture
o PTHCM based hybrid LLC architecture
o RWEEHC based hybrid LLC architecture
• Compared to SRAM, STTRAM, PTHCM and RWEEHC
o Achieved 18.94%, 32.31%, 38.79% and 8.97% total LLC energy savings respectively
o Achieved 4.631%, 0.607%, 6.863% and 0.407% improvement in performance respectively
57
Future work
• Three and higher level cache hierarchy where writeback operations to LLC may vary with levels of cache
• Exclusive LLC as it is populated only through writebacks due to eviction from L1
• Write-through LLC wherein writebacks due to conflict miss at L1 are part of non-idle CPU
• Lower nanometer technologies wherein writes to STTRAM are unstable because of smaller
MTJ thickness
There are several potential extensions to our work, for example, consideration of
58
Thank you
59