March 24 2005 University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet CS 7698
Feb 22, 2016
March 24 2005 University of Utah CS 7698
Token Coherence: Decoupling Performance and Correctness
Article by: Martin, Hill & WoodPresented by: Michael Tabet
CS 7698
March 24 2005 CS 7698
A Tale of Two Methods Snooping based
Uses totally ordered broadcasts to preserve correctness
Uses lots of bandwidth Big (large busses) = BAD!
Directory based Uses indirection to preserve bandwidth Indirection adds latency Needs a directory controller
March 24 2005 CS 7698
Potential work aroundsSnooping Snooping is fast, but requires a bus. Big
fast busses are complex -> Use a virtual bus to virtual broadcast!
Directory Networks require lots of logic (especially
big ones) -> Use glueless networks!
March 24 2005 CS 7698
Token CoherenceProvides for both indirection, and speed
up through unordered broadcasts
Two components: Correctness substrate Performance protocol
March 24 2005 CS 7698
CorrectnessSpeed is Good, Correctness is Better!
Need to guarantee ordered reads/writes!
Thus, use a correctness “substrate”
March 24 2005 CS 7698
Correctness Invariants1. At all times, each block has T tokens2. A processor can only write a block if
it holds all T tokens3. A processor can read a block only if it
holds at least one token4. If a coherence message contains one
or more tokens, it must contain data
March 24 2005 CS 7698
Invariant 1 ImplicationsAllows for precise control of blocks of data.
March 24 2005 CS 7698
Invariant 2 ImplicationsEnables write control mechanism to allow in order writes
March 24 2005 CS 7698
Invariant 3 ImplicationsRestricts reads
March 24 2005 CS 7698
Invariant 4 ImplicationsProvides a method to ensure cache coherence
March 24 2005 CS 7698
StarvationInvariants allow of ordered reads/writes, but
how do we prevent starvation?
Persistent requests:1. A processor times out on transient requests2. Raises a persistent request (only one per block)3. All nodes must forward blocks to the node
But repeated & persistent requests only make up 1-3% of the messages
March 24 2005 CS 7698
Persistent Request State Diagram
March 24 2005 CS 7698
Performance protocolBut if you always follow the rules, it can
get slow and tedious!
Tokens allow for unordered responses to requests. This opens the door for all sorts of optimizations
March 24 2005 CS 7698
TokenBA New Contender
Akin to MSI snooping protocol: Requests broadcast Data exists either in
Modified (All tokens) Shared (Some tokens) Invalid (No tokens)
But: Performance protocol allows for better performance!
March 24 2005 CS 7698
TokenB: Optimized Token CountingMSI was a bit of a lie, can optimize token
counting by altering invariants 1,3,4:
1. At all times, each block has T tokens, one of which is the owner token
3. A processor can read a block only if it holds at least one token for that block and has valid data
4. If a coherence message contains the owner token, it must contain data
March 24 2005 CS 7698
TokenB ContinuedThe Good Stuff
Performance in: Tokens allow replies to be sent
unordered, and indirectly (no broadcast)
This means: 15-28% faster than snooping 17-54% faster than directory 21-25% less bandwidth than snooping
March 24 2005 CS 7698
An ExampleP1 reads then P2 writes then P1 reads
Presume a 4 node systems, where P1 has an invalid copy, P2 has a shared copy, and P3 is the “home/owner” node
March 24 2005 CS 7698
ExampleThe Snooping Way
P1 P2 P3 P4
1 2 3 4 5
All messages broadcast!
March 24 2005 CS 7698
ExampleThe Directory Way
P1 P2 P3 P4
Directory
1 3 2 4 4 44 5 6
Directory process messages 13 4 5!
March 24 2005 CS 7698
ExampleThe Token Way
P1 P2 P3 P4
1(broadcast) 2 3(broadcast) 4 4 45(broadcast) 6
March 24 2005 CS 7698
Real world resultsExamined on a tree structure (virtual
broadcast), and on a 2d torus
Migratory optimization: a read request after a write is forwarded all tokens
Benchmarked on OLTP, SPECjbb, Apache
March 24 2005 CS 7698
ResultsToken vs Snooping: TOKEN Wins!
March 24 2005 CS 7698
ResultsDirectory vs Token: Token mostly wins!
March 24 2005 CS 7698
ConclusionTokenB offers a good performance for
small-middle sized parallel systems
Broadcasts limits scalability past 16 nodes
But other performance implementations could be scaled larger!