QuickTime™ and a TIFF (Uncompressed) decompre are needed to see this pic Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana- Champaign *IBM Research http://iacoma.cs.uiuc.edu
20
Embed
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Flexible Snooping:
Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors
Karin Strauss, Xiaowei Shen*, Josep Torrellas
University of Illinois at Urbana-Champaign
*IBM Research
http://iacoma.cs.uiuc.edu
Karin Strauss Flexible Snooping 2QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Motivation
• CMPs are becoming standard components
• cheaper to build medium size machines– 32 to 128 cores (multi-CMP)
• shared memory, cache coherent– easier to program, easier to manage
• supporting cache coherence is difficult
Karin Strauss Flexible Snooping 3QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Cache coherence solutions
long latenciessimplenosnoopy
embedded ring
difficult to scale
simpleyessnoopy
broadcast bus
indirection,
extra hardwarescalableno
directory based protocol
consprosordered
network?strategy
• other proposals (e.g. token coherence)
Karin Strauss Flexible Snooping 4QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Contributions
compared to fastest state-of-the-art scheme
performance energyconsumption
Superset Aggressive
performance energyconsumption
Superset Conservative
• family of adaptive coherence protocols for rings
• two were chosen as best options
high performance scheme energy conscious scheme
Karin Strauss Flexible Snooping 5QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Multi-CMP multiprocessor
local network
CMP Proc + L1 + L2
memory
• coherence protocol used: only one supplier if line is cached
Karin Strauss Flexible Snooping 6QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Ring in actionR
S
R
S
R
S
supplierpredictor
snoop
request
cmp
Lazy Eager Oracle
response
datadata
data
Karin Strauss Flexible Snooping 7QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Ring in actionR
S
R
S
R
S
latency
snoops
messages
• goal: adaptive schemes that approximate Oracle’s behavior
Lazy Eager Oracle
Karin Strauss Flexible Snooping 8QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Primitive snooping actions
X X
• snoop and then forward
• forward and then snoop
• forward only
+ fewer messages
+ shorter latency
+ fewer snoops+ shorter latency– false negative predictions not allowed
Karin Strauss Flexible Snooping 9QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Predictors and algorithms
snoopforwardExact
forward
then snoopAgg
forward
snoopforward
then snoopSubset
action on positive
prediction
action on negative
prediction
predictor / algorithm
Superset
Consnoop then
forward
node can supply
in predictor
set of addresses:
Karin Strauss Flexible Snooping 10QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Eager
Subset
Lazy
SupersetAgg
SupersetCon
Oracle
Algorithms
/ Exact
number of snoops
snoop messagelatency
number of messages
Per miss service:
algorithm negative positive
Subsetforward
then snoop
snoop
S
u
p
e
r
set
C
o
n forward
snoop then
forward
A
gg
forward then
snoop
Exact forward snoop
Karin Strauss Flexible Snooping 11QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Predictor implementation• Subset
– associative table:
subset of addresses that can be supplied by node
• Superset– bloom filter: superset of addresses that can be supplied by node– associative table (exclude cache):
addresses that recently suffered false positives
• Exact– associative table: all addresses that can be supplied by node
– downgrading: if address has to be evicted from predictor table,
corresponding line in node has to be downgraded
Karin Strauss Flexible Snooping 12QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Downgrading
AB
ES Negative effects:
• writes by this node need to snoop other nodes
• reads and writes by other nodes need to fetch line from memory
A
Karin Strauss Flexible Snooping 13QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Experiments• 8 CMPs, 4 ooo cores each = 32 cores
– private L2 caches
• on-chip bus interconnect
• off-chip 2D torus interconnect with embedded unidirectional ring
• per node predictors: latency of 3 processor cycles
• sesc simulator (sesc.sourceforge.net)
• SPLASH-2, SPECjbb, SPECweb
Karin Strauss Flexible Snooping 14QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Execution time
0
0.2
0.4
0.6
0.8
1
1.2
SPLASH-2 SPECjbb SPECweb
Normalized
execution time
Lazy Eager Oracle Subset SupersetCon
SupersetAggExact
• the fastest of all algorithms is SupersetAgg
• performance of most flexible snooping algorithms is similar to Eager
Karin Strauss Flexible Snooping 15QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Miss service energy
0
0.5
1
1.5
2
SPLASH-2 SPECjbb SPECweb
Normalized
energy consumption Lazy
Eager Oracle Subset SupersetCon
SupersetAggExact
3.22
• SupersetCon is least energy-hungry algorithm
• algorithms that eagerly forward messages use more energy
Karin Strauss Flexible Snooping 16QuickTime™ and aTIFF (Uncompressed) decompressor