Distributed Shared Memory (part 1)
Distributed Shared Memory (part 1)
Distributed Shared Memory (DSM)
mem0
proc0
mem1
proc1
mem2
proc2
memN
procN
network
...
shared memory
Shared memory programming
• Standard – pthread• synchronizations
– Barriers – Locks– Semaphores
Sequential SOR
for some number of timesteps/iterations {for (i=0; i<n; i++ )
for( j=1, j<n, j++ )temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1][j]
grid[i][j-1] + grid[i][j+1] );for( i=0; i<n; i++ )
for( j=1; j<n; j++ )grid[i][j] = temp[i][j];
}
Parallel SOR with Barriers (1 of 2)
void* sor (void* arg){
int slice = (int)arg;int from = (slice * (n-1))/p + 1;int to = ((slice+1) * (n-1))/p + 1;
for some number of iterations { … }}
Parallel SOR with Barriers (2 of 2)
for (i=from; i<to; i++) for (j=1; j<n; j++)
temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]);
barrier();for (i=from; i<to; i++)
for (j=1; j<n; j++) grid[i][j]=temp[i][j];
barrier();
Differences between SMP and Software DSM
• Delay: tradeoffs, such as block size• Software => traps: cost of
read/write misses• Goals of caches: multiprocessor =
performance, dist. system = transparency
• bus vs. long networks: reliance on serialization and broadcast.
Consequent differences in protocols and applications
• Bigger block size– Cost amortization, higher hit ratio for larger
blocks?– Reduced overhead
• But therefore...– Migration vs. Replication– False sharing increases
• DSM protocol more complex: Must handle lost, corrupted, and out-of-order packets
• Above, coupled with cost of traps, => SDSM consistency cost much higher!
Results of high consistency costs
• Manage sharing more carefully• Align data to page boundaries
Consistency Models
• Sequential Consistency– All processors observe the same order– Must correspond to some serial order– Only ordering constraint is that
reads/writes of P1 appear in the same order, but no restrictions on relative ordering between processors.
Common consistency protocols
• Write update– Multicast update to all replicas
• Write invalidate– Invalidate cached copies in p2, p3– Cache miss if p2/p3 access X
• Valid data from other cache
Conventional Implementation
• As proposed by Li & Hudak, TOCS ‘86.• Use virtual memory to implement
sharing.• Shared memory divided up by virtual
memory pages.• Use single-writer, multiple-reader write-
invalidate coherence protocol.• Keep pages in one of three states:
– invalid, read-only, read-write
Example
proc0 proc1 proc2 procN
shared memory
Example: Read Access Hit
proc0 proc1 proc2 procN
read
Example: Write Access Hit
proc0 proc1 proc2 procN
write
Example: Read Access Miss
proc0 proc1 proc2 procN
read
Example: Read Fault
proc0 proc1 proc2 procN
readfault
Example: Replication on Read
proc0 proc1 proc2 procN
read
Example: Write Access Miss
proc0 proc1 proc2 procN
write
Example: Write Fault
proc0 proc1 proc2 procN
writefault
Example: Write Invalidation
proc0 proc1 proc2 procN
write
Example: Write Access to Read-Only
proc0 proc1 proc2 procN
write
Example: Write Fault
proc0 proc1 proc2 procN
writefault
Example: Write Invalidation
proc0 proc1 proc2 procN
write
How to Remember Locations?
• Broadcast on miss (as in SMP).• Static home.• Dynamic home or owner.
Ownership and Owner Location
• Owner is the last writer.• Owner maintains copyset.• Every processor maintains
probable owner (not always the real owner).
Ownership Location
• Every read or write miss is sent to (local) probable owner.
• If owner, handle appropriately, else forward to probable owner.
Ownership Modification
• If write miss, new writer becomes owner, and all forwarders set probable owner to requester.
• If read miss, set probable owner to responding processor.
Example
• Initially, owner(page0) = p0, and probable owner(page0) = p0 everywhere.
• Write miss by p1, sends message to its probable owner (p0), handled there, new owner = p1, probable owner(0) on p0 = 1.
• Read miss by p2, sends message to probable owner (p0), forwarded to probable owner (p1), handled there, probable owner(0) on p2 becomes p1.
Implement synchronizations
• Use messages to implement synchronizations
Barriers
• Designate one processor as barrier manager.
• When a process waits at a barrier, it sends an arrival message to the barrier manager and waits.
• When barrier manager has received all messages, it sends a departure message to all processes.
Locks
• Designate one process as the lock manager for a particular lock.
• When a process acquires a lock, it sends an acquire message to the manager and waits.
• Manager forwards message to last acquirer.
• If lock free, send lock grant message.• If lock held, hold on to request until
free, and then send lock grant message.
Problem: False Sharing
• Concurrent access to different data within the same consistency unit.
• With page as consistency unit, lots of opportunity for false sharing.
• Two flavors:– read-write – write-write
Read-Write False Sharing
x
y
Read-Write False Sharing (Cont.)
w(x)
r(y) r(y) r(x)
synch
w(x) w(x)
Read-Write False Sharing (Cont.)
w(x)
r(y) r(y) r(x)
synch
w(x) w(x)
Write-Write False Sharing
w(x)
w(y) w(y) r(x)
synch
w(x) w(x)
Summary
• Software shared memory on distributed memory hardware.– Uses virtual memory.
• Home migration to improve locality– important because of high latencies.
• Sequential consistency suffers from false sharing