8/2/2019 Distributed Shared Mem
1/42
Distributed Shared Memory
8/2/2019 Distributed Shared Mem
2/42
Directory Based CacheCoherence
Why snooping is a bad idea? Broadcasting is expensive
Directory Maintain the cache state explicitly
List of caches that have a copy: many read-only copies but onewritable copy
P1
$
CA
Scalable interconnection network
P2
$
CA
MemoryDirectory
8/2/2019 Distributed Shared Mem
3/42
Directory Protocol
Directory Memory
X
Interconnection Network
Cache CacheX
8/2/2019 Distributed Shared Mem
4/42
Terminology
Home node The node in whose main memory the block is allocated
Dirty node The node that has a copy of the block in its cache in modified state
Owner node The node that currently hosts the valid copy of a block
Exclusive node The node that has a copy in exclusive state
Local node, or requesting node The node containing the processor that issues a request for theblock
Local block Blocks whose home is local to the issuing processor
8/2/2019 Distributed Shared Mem
5/42
Basic Operations
Read miss to a block in modified state
P
$
CA
M
em/Dir
Requestor Home
Owner
P
$
CA
M
em/Dir
P
$
CA
Mem/Dir
1. Read request
2. Response with owner identifier
3. Read request
4a. Data reply4b. Revision message
8/2/2019 Distributed Shared Mem
6/42
Basic Operations (Contd)
Write miss to a block with two sharers
P
$
CA
M
em/Dir
P
$
CA
M
em/Dir
P
$
CA
Mem/Dir
P
$
CA
Mem/Dir
Requestor Home
Sharer Sharer
1. RdEx request
2. Response with sharers identifiers
3. Invalidation
requests
4. Invalidation acknowledgement
8/2/2019 Distributed Shared Mem
7/42
Alternatives for OrganizingDirectories
Directory storage schemes
Flat Centralized Hierarchical
Memory-based Cache-based
How to find source of directory information?
How to locate copies?
Directory information co-located withmemory modules that is home
Stanford DASH/FLASH, SGI Origin, etc
Caches holding a copy of thememory block form a linked list
IEEE SCI, Sequent NUMA-Q
Directory information is ina fixed place: home
Hierarchy of cache thatguarantee the inclusion property
8/2/2019 Distributed Shared Mem
8/42
Flat Directory Schemes
Full-Map Directory
Limited Directory
Chained Directory
8/2/2019 Distributed Shared Mem
9/42
Memory-Based Directory
SchemesFull bit vector (full-map) directory
Most straightforward
Low latency: parallel invalidation Main disadvantage: storage overhead P*B
Increase the cache block
Access time and network traffic increased due to false
sharing
Use hierarchical protocol: Stanford DASH:
Node has bus-based 4-processor
8/2/2019 Distributed Shared Mem
10/42
Storage Reducing Optimization:
Directory WidthDirectory width Bits per directory entry
Motivation Mostly only a few caches have a copy of a block
Limited (pointer) directory Storage overhead: log P * k (number of copies)
Overflow methods are needed Diri X
i indicates number of pointers (i < P)
X indicates invalidation methods: broadcast or non-broadcast
8/2/2019 Distributed Shared Mem
11/42
Overflow Methods for Limited
ProtocolDiri B (Broadcast) Set the broadcast bit in case of overflow
Broadcast invalidation messages to all nodes
Simple Increase write latency
Wasting communication bandwidth
Diri NB (Not Broadcast) Invalidate the copy of one sharer
Bad for widely shared read-mostly data
Degradation for intensive sharing of read-only and read-mostlydata due to increased miss ratio
8/2/2019 Distributed Shared Mem
12/42
Overflow Methods for Limited
Protocol (Contd)Diri CVr (Coarse Vector)
Representation changes to a coarse bit vector
If P i, each bit stands for a region of r processors (coarse
vector)
Invalidations to the regions of caches
SGI Origin
Robust to different sharing patterns
70% less memory message traffic than broadcast and at
least 8% less than other schemes
8/2/2019 Distributed Shared Mem
13/42
Coarse Bit Vector Scheme
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
0 4bits 4bits Overflow bit
overflow
1 1 1 1
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
2 pointers 8 pointers
8/2/2019 Distributed Shared Mem
14/42
Overflow Methods for Limited
Protocol (Contd)Diri SW (Software) The current i pointers and a pointer to the new sharer are saved
into a special portion of the local main memory by software
MIT Alewife: LimitLESS Cost of interrupts and software handling is high.
Diri DP (Dynamic Pointers) Directory entry contains a hardware pointer into the local memory.
Similar to software mechanism without software overhead.
Difference is list manipulation is done in a special-purpose protocolprocessor rather than by the general-purpose processor
Stanford FLASH
Directory overhead: 7-9% of main memory
8/2/2019 Distributed Shared Mem
15/42
Dynamic Pointers
Circular sharing list
Director: continuous region in main memory
8/2/2019 Distributed Shared Mem
16/42
Stanford DASH Architecture
DASH => Directory Architecture for SHaredmemory Nodes connected by scalable interconnect
Partitioned shared memory
Processing nodes are themselves multiprocessors
Distributed directory-based cache coherence
Interconnection network
P1
$
P1
$
Memory Directory
Dirty bitPresence bit
P1
$
P1
$
8/2/2019 Distributed Shared Mem
17/42
Conclusions
Full map is most appropriate up to amodest number of processors
Diri CVr , Diri DP are most likelycandidates
Coarse vector: lack of accuracy on overflow
Dynamic pointer: processing cost due hardware list
manipulation
8/2/2019 Distributed Shared Mem
18/42
Storage Reducing Optimization:Directory Height
Directory height Total number of directory entries
Motivation
The total amount of cache memory is much less than the total mainmemory
Sparse directory Organize the directory as a cache
This cache has no need for a backing store
When an entry is replaced, send invalidations to the nodes with copies Spatial locality is not an issue: one entry per block
References stream is heavily filtered, consisting of only thosereferences that were not satisfied in the processor caches.
With directory size factor = 8, associativity of 4, and LRU
replacement: very close to that of full-map directory
8/2/2019 Distributed Shared Mem
19/42
Protocol Optimization
Two major goals + one
Reduce the number of network transactions per
memory operation Reduce the bandwidth demand
Reduce the number of actions on the critical path
Reduce the uncontended latency
Reduce the endpoint assist occupancy per transaction Reduce the uncontended latency as well as endpoint contention
8/2/2019 Distributed Shared Mem
20/42
Latency Reduction
L H R1. req
2. res
3. intervention
4a. revise
4b. response
(a) strict request-response
L H R
1. req
4. response
2. intervention
3. response
(b) intervention forwarding
L H R
1. req
3a. revise
3b. response
2. intervention
(c) reply forwarding
A read request to a block in exclusive state
8/2/2019 Distributed Shared Mem
21/42
Cache-Based Directory Schemes
Directory is a doubly linked list of entries
Read miss
Insert the requestor at the head of the list
Write miss
Insert the requestor at the head of the list
Invalidate the sharers by traversing the list: long latency!
Write back
Delete itself from the list
Main memory (home)
Node 0 Node 1 Node 2
$
Head pointer
8/2/2019 Distributed Shared Mem
22/42
Tradeoffs with Cache-Based
SchemesAdvantages
Small overhead: single pointer
Easier to provide fairness and to avoid livelock Sending invalidations is not centralized at the home, but
rather distributed among sharers
Disadvantages
Long latency, long occupancy of communication assist
Modifying the sharing list requires careful coordination
and mutual exclusion.
8/2/2019 Distributed Shared Mem
23/42
Latency Reduction
H S1 S22. ack S31. inv
3. inv5. inv
4. ack6. ack
H S1 S22b. ack
S3
1. inv
3b. ack
4b. ack
2a. inv 3a. inv
H S1 S2 S3
1. inv 2. inv 3. inv
4. ack
(a) strict request-response
(b) intervention forwarding
(c) reply forwarding
8/2/2019 Distributed Shared Mem
24/42
Sequent NUMA-Q
IEEE standard Scalable Coherent Interface(SCI) protocol
Targeted toward commercial workloads Databases and transaction processing
Commodity hardware
Intel SMP as the processing nodes DataPump network interface from Vitesse semiconductor
Custom IQ Link: directory logic, remote cache (32MB, 4-way)
IQ-Link
P P P P
Peripheral
interface I/O
PCI
I/OMem
Quad
Quad
Quad
Quad ring
Snooping
within Quad
8/2/2019 Distributed Shared Mem
25/42
IQ Link Implementation
Directory for locallyallocated data
Tags for remotelyallocated but locallycached data
Orion bus controller Manage snooping and requesting
logic
DataPump
GaAs chip for transport protocolof the SCI standard
SCI link interfacecontroller Manage SCI coherence protocol -
programmable
DataPump
Directorycontroller
(SCLIC)
Orion bus
controller(OBIC)
Quad bus
Remote
tag
Local
directory
Remotedata&tag Localdirectory
SCI ring
(1GB/s)
Bus side tag: SRAM
Network side tag: SDRAM
8/2/2019 Distributed Shared Mem
26/42
Directory States
Home
No remote cache (quad) in the system contains a copy of the block
A processor cache in the home quad itself may have a copy since
this is not visible to SCI protocol, but managed by bus protocolwithin the quad.
Fresh
One or more remote caches may have a read-only copy
Gone Another remote cache contains a writable (exclusive or dirty) copy.
No valid copy exists on the local node
8/2/2019 Distributed Shared Mem
27/42
Remote Cache States
7 bits: 29 stable states + many pendingor transient states
Each stable state has two parts First part: where the cache entry is located:
ONLY, HEAD, TAIL, MID
Second part: actual state
Dirty, clean (exclusive state in MESI), fresh (data may not bewritten until memory is informed), copy (unmodified and
readible), and so on
8/2/2019 Distributed Shared Mem
28/42
SCI Standard
Three primitive operations
List construction: adding a new node to the head of the list
Rollout: remove a node from a list
Purge (invalidation): the head node invalidate all other nodes
Levels of protocol
Minimal protocol: does not permit read sharing: only one copy
Typical protocol: NUMA-Q
Full protocol: implement all standards of SCI
8/2/2019 Distributed Shared Mem
29/42
Handling Read Requests
If the directory state is HOME The home updates the blocks state to FRESH and sends data
The requestor updates from PENDING to ONLY_FRESH
If the directory state is FRESH
Insert the node at the head The previous head changes its state from HEAD_FRESH to
MID_VALID or from ONLY_FRESH to TAIL_VALID
The requestor changes its state from PENDING to HEAD_FRESH
If the directory state is GONE
The home stays in the GONE state and sends a pointer to theprevious head
The previous head changes its state from HEAD_DIRTY toMID_VALID or from ONLY_DIRTY to TAIL_VALID
The requestor sets its state to HEAD_DIRTY
8/2/2019 Distributed Shared Mem
30/42
An Example of Read Miss
INVALIDONLY_
FRESH
FRESH
Requestor Old head
Home memory
PENDINGONLY_
FRESH
FRESH
Requestor Old head
Home memory
PENDINGONLY_
FRESH
FRESH
Requestor Old head
Home memory
HEAD_
FRESH
TAIL_
VALID
FRESH
Requestor Old head
Home memory
1st Round
2nd Round
8/2/2019 Distributed Shared Mem
31/42
Handling Write Requests
Only the head node is allowed to write a blockand issue invalidation When the write is in the middle of the list, remove itself from the
list (rollout) and add itself again to the head (list construction)
If the writer cache block is in HEAD_DIRTYstate Purge sharing list: strict request-response manner
Writer stays in the pending state while the purging is in progress
If in HEAD_FRESH state
Writer goes into pending state and sends a request to the home thatchanges from FRESH to GONE and replies
Writer goes into a different pending state and purge the list
Eventually, the writer state goes toONLY_DIRTY
8/2/2019 Distributed Shared Mem
32/42
Handling Writeback andReplacements
Need of rollouts Replacement, invalidation, write
Middle node rollback First sets itself to a pending state to prevent the race condition. The
node closer to the tail has priority and is rolled out first.
Set the state to invalid
Head node rollback Goes into a pending state and the downstream node change its state
(ex: MID_VALID -> HEAD_DIRTY)
Updates the pointer in the home node. If the node is the only node,change the home node state to HOME
Writeback upon a miss Serve writeback first since buffering is complicated and miss on
the remote cache will be infrequent (vs. write-buffer in bus system)
8/2/2019 Distributed Shared Mem
33/42
Hierarchical Coherence
8/2/2019 Distributed Shared Mem
34/42
Snoop-Snoop System
Simplest way to build large scale cache-coherent MPs
Coherence monitor
Remote (access) cache Local state monitor - keep state information on data locally
allocated, remotely cached
Remote cache Should be larger than the sum of processor caches and quite
associative Should be lockup-free
Issue invalidation request to the local bus when a block is replaced
8/2/2019 Distributed Shared Mem
35/42
Snoop-Snoop with GlobalMemory
First level cache Highest performance SDRAM
caches
B1 follows a standard snoopingprotocol
Second level cache Much larger than L1 caches (set
associative)
Must maintain inclusion
L2 cache acts as a filter for B1-busand L1-caches
L2 cache can be DRAM basedsince fewer references get to it
B2
M
P1
$
P1
$
Coherence
monitor
P1
$
P1
$
Coherence
monitor
B1
8/2/2019 Distributed Shared Mem
36/42
Snoop-Snoop with Global
Memory (Contd)Advantages Misses to main memory just require single traversal to
the root of the hierarchy.
Placement of shared data is not an issue.
Disadvantages Misses to local data structures (e.g., stack) also have to
traverse the hierarchy, resulting in higher traffic and
latency. Memory at the global bus must be highly interleaved.
Otherwise bandwidth to it will not scale.
8/2/2019 Distributed Shared Mem
37/42
Cluster Based Hierarchies
Key idea Main memory is distributed among clusters.
reduces global bus traffic (local data & suitably placed shared data)
reduces latency (less contention and local accesses are faster)
example machine: Encore GigamaxL2 cache can be replaced by a tag-only router-coherence switch.
B2
M
P1
$
P1
$
Coherence
monitor
P1
$
P1
$
Coherence
monitor
B1
M
8/2/2019 Distributed Shared Mem
38/42
Summary
Advantages: Conceptually simple to build (apply snooping
recursively)
Can get merging and combining of requests inhardware
Disadvantages: Physical hierarchies do not provide enough bisection
bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-dgrid problems)
Latencies often larger than in direct networks
8/2/2019 Distributed Shared Mem
39/42
Hierarchical Directory Scheme
The internal nodes containonly directory information L1 directory tracks which of its children
processing nodes have a copy of thememory block.
L2 directory tracks which of its childrenL1 directories have a copy the memoryblock.
It also tracks which local memoryblocks are cached outside
Inclusion is maintained betweenprocessor caches and L1 directory
Logical trees may beembedded in any physicalhierarchy
L1 directory
L2 directory
Processing nodes
8/2/2019 Distributed Shared Mem
40/42
A Multirooted Hierarchical
Directory
p0 p7
Directory treefor p0s memory
Internal nodes
Processing nodes at leaves
All three circles in this rectangle represent the same processing node
8/2/2019 Distributed Shared Mem
41/42
Organization and Overhead
Organization
Separate directory structure for every block
Storage overhead
Each level has about the same amount of memory
C: cache size, b: branch factor, M: memory size, B: block size
Performance overhead
Reduce the number of network hops.
But, increase the end-to-end transactions, increase latency
Root becomes the bottleneck
MB
PCb
log
8/2/2019 Distributed Shared Mem
42/42
Performance Implication of
Hierarchical CoherenceAdvantages
Combining of requests for a block
Reduce traffic and contention
Locality effect
Reduce transit latency and contention
Disadvantages
Long uncontended latency Bandwidth requirements near the root of the hierarchy