Counting Stream Registers: An Efficient and Effective Snoop Filter
Architecture
Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo
Charbon (TU Delft), Paolo Ienne (EPFL)
2 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Multicore Embedded Systems
• Increasing number of multiprocessor based embedded systems.
• Low energy requirement with little compromise on performance.
• Significant energy consumption in the memory subsystem (caches, shared bus, main memory).
3 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Symmetric Multiprocessor System
SharedMemory
D$I$
CPU 1
D$I$
CPU 2
D$I$
CPU n
4 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Cache Coherency Problem
SharedMemory
D$I$
CPU 1
D$I$
CPU 2
D$I$
CPU n
5 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Snoopy Hardware Coherence Protocols
SharedMemory
D$I$
CPU 1
D$I$
CPU 2
D$I$
CPU n
Snoop misses consume
excessive energy
6 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Snoop Filters
SharedMemory
D$I$
CPU 1
D$I$
CPU 2
D$I$
CPU n
SF SF SF
Snoop filter lookup costs lesser energy than a cache
lookup
7 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Snoop Filters in Prior Art
• Include, Exclude and Hybrid JETTY– Expensive for an embedded system in terms of
area.– Energy consumed by the JETTYs itself is
significant.• Stream Registers
– Present in IBM's BlueGene Supercomputer.– Inclusive filter.– Uses a base and mask register pair to track the
cache lines.
8 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Stream Registers
1 0 0 1 1 1 1 1 10b1001
1 0 0 1 1 1 0 0 10b1010
--- --- 0
Base Mask Valid
No general mechanism to remove address from SR
without compromising correctness
Addresses with 10XX result in snoop filter hit
9 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Drawbacks of Stream Register based Snoop Filters
• No efficient way to update the registers when a line is removed from cache– Degraded filtering performance over time– Additional logic units introduced but not
efficient (e.g., cache wrap detection)
10 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Our Contribution
• Counting Stream Registers– Eliminates cache wrap detection logic– Counter to track cache lines– More robust to workload variability– Better or similar energy savings compared to
SRs
11 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Counting Stream Registers
1 0 0 1 1 1 1 1 0x010b1001
1 0 0 1 1 1 0 0 0x020b1010
--- --- 0
Base Mask Counter
Removes the need for extra logic such as cache wrap detection, active register
history etc.
Invalidated cache lines can be trackedby decrementing the counter
12 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Snoop Filter Architecture
Index to direct mapped snoop filter table
Set of cache lines grouped into a page
Used for comparison with base register
13 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Experimental Analysis
• Virtex 2 FPGA running OpenRISC soft cores– Configurable no. of processors, associativity and
size of data and instruction cache, cache type and coherence protocol
• EEMBC Multibench Benchmarks• CACTI 5.3 energy model
– Total memory subsystem energy accounted for main memory r/w energy, data and instruction cache r/w energy, leakage and snoop energy
14 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Cache Design Space Exploration
15 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Results: Filtering Percentage
CSR achieves higher filtering % for smaller number of
registers
16 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Analysis: RGB2CMYK Benchmark
17 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Discussion: Energy Consumption
• For most benchmarks, snoop energy was around 8-10% of the total memory subsystem energy without snoop filters
• CSR filters more effective for certain benchmarks (H.264, Image rotation)– Better filtering performance with smaller no. of stream
registers.• Small reduction in overall energy
– Platform limited to 32 MB of off-chip SDRAM– No complex data sharing and limited no. of multiple
producers of same data
18 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Summary
• Introduced counting stream registers based snoop filter architecture– Lesser hardware complexity and ability to track cache
line invalidations• Experimental evaluation shows better filtering
percentage than stream registers with lesser performance variation for different workloads.