Page 1
Implementation and Performance of Munin Implementation and Performance of Munin (Distributed Shared Memory System)(Distributed Shared Memory System)
Dongying Li
Department of Electrical and Computer Engineering
University of Toronto
(Original Authors: J. B. Carter, et al.)
ECE 1147, Parallel ComputationOct. 30, 2006
Page 2
2
Distributed Shared Memory
• Shared address space spanning the processors of a distributed memory multiprocessor
proc1 proc3
X=0
X=0 X=0
proc2
X=0
Page 3
3
Distributed Shared Memory
mem0
proc0
mem1
proc1
mem2
proc2
memN
procN
network
...
shared memory
Page 4
4
Distributed Shared Memory
• Challenges– Good performance comparable to shared memory
programs
– No significant deviation from shared memory coding model
– Low communication and message passing overheads
Page 5
5
Munin System
• Characterized features– Software released consistency– Multiple consistency protocols
• Deviations from shared memory model– Annotated shared memory variable pattern– All Synchronization visible to system
Page 6
6
Contents
• Basic concepts– Shared object– Software release consistency– Multiple consistency protocols
• Software implementation– Prototype overview– Execution process– Advanced programming features– Data object directory and delayed update queue– Synchronization
• Performance• Overview of other DSM systems• Conclusion
Page 7
7
Basic Concepts
• Basic concepts– Shared object– Software release consistency– Multiple consistency protocols
• Software implementation– Prototype overview– Execution process– Advanced programming features– Data object directory and delayed update queue– Synchronization
• Performance• Overview of other DSM systems• Conclusion
Page 8
8
Shared Object
x
y
x
x
8-kilo 8-kilo 8-kilo
Page 9
9
Software Release Consistency
• Sequential Consistency– All processors observe the same order– Must correspond to some serial order– Only ordering constraint is that reads/writes of P1
appear in the same order, but no restrictions on relative ordering between processors.
• Synchronous read/write– Writes must be propagated before moving on to the
next operation
Page 10
10
Software consistency
• Problems– Message passing overhead– False sharing
w(x)
r(y) r(y) r(x)
w(x) w(x)
Page 11
11
Weak Consistency
• Data modifications only propagated at synchronization.• Works fine if program properly synchronized through
system primitives.
w(x)
r(y) r(y) r(x)
synch
w(x) w(x)
Page 12
12
Weak Consistency
w(x) w(x)
r(y) r(y) r(x)
synch
Page 13
13
Software Release Consistency
• Special weak consistency protocol
• Reduction of message passing overhead
• Two categories of shared variable operations– Ordinary access
• Read• Write
– Synchronization access (lock, semaphore, barrier)• Acquire• Release
Page 14
14
Software Release Consistency
• Before ordinary access (read, write) allowed, all previous acquire performed
• Before release allowed, all previous ordinary access performed
• Before acquire allowed, all previous release performed
• Before release allowed, all previous acquire performed
• In a word, results of writes prior to a release propagated before next processor acquiring this released lock
Page 15
15
Eager Release Consistency
• Write propagating at release
Page 16
16
Lazy Release Consistency
• Write propagating at acquire
Page 17
17
Multiple Consistency Protocols
• No single consistency protocol suitable for all parallelization purpose
• Shared variables accessed in different ways within single program
• Variable access pattern changes during execution
• Multiple protocols allow access pattern-oriented tuning for different shared variables
Page 18
18
Multiple Consistency Protocols
• High-level sharing pattern annotation– Specified in shared variable declaration– Combinations of low-level protocol parameters
• Low-level protocol parameter– Specified in shared variable directory– Specific aspect of protocol
Page 19
19
Protocol Parameters
• I: invalidate or update?
• R: Replicas allowed?
• D: Delayed operation allowed?
• FO: Having fixed owner?
• M: Multiple writers allowed?
• S: Stable access pattern?
• FL: Flushing changes to owner?
• W: Writable? (write protected?)
Page 20
20
Sharing annotations
• Read only– Simplest pattern: once initialized, no further access– Suitable for constant etc.
• Migratory– Only one thread can access at one period of time– Suitable for variables accessed only in critical session
• Write-shared– Can be written concurrently by multiple threads– Different threads update different words of variable
• Producer-consumer– Written only by one threads and read by others– Replicate and update the object, not invalidate
Page 21
21
Sharing annotations
• Example: producer-consumer
for some number of timesteps/iterations {for (i=0; i<n; i++ )
for( j=1, j<n, j++ )temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1][j]grid[i][j-1] + grid[i][j+1] );
for( i=0; i<n; i++ )for( j=1; j<n; j++ )
grid[i][j] = temp[i][j];}
back
Page 22
22
Sharing annotations
• Reduction– Accessed by fetching and operation (read, write then
release)– Example: min(), a++
• Result– Phase 1: multiple write allowed– Phase 2: one thread (the result) access exclusively
• Conventional– Conventional update protocol for shared variables
Page 23
23
Sharing annotations
Sharing Annotations
Protocol Parameters
I R D FO M S FL W
Read-only N Y - - - - - N
Migratory Y N - N N - N Y
Write-shared N Y Y N Y N N Y
Producer-Consumer
N Y Y N Y Y N Y
Reduction N Y N Y N - N Y
Result N Y Y Y Y - Y Y
Conventional Y Y N N N - N Y
Page 24
24
Software Implementation
• Basic concepts– Shared object– Software release consistency– Multiple consistency protocols
• Software implementation– Prototype overview– Execution process– Advanced programming features– Data object directory and delayed update queue– Synchronization
• Performance• Overview of other DSM systems• Conclusion
Page 25
25
Prototype Overview
• A simple processor converting annotations to suitable format
• A linker creating the shared memory segment
• Library routines linked into program
• Operating system support for fault handling and page table manipulation
Page 26
26
Execution Process
• Compiling
Sharing annotations
Munin processor
Auxiliary file
Linker
Shared data segment
Shared data description table
Page 27
27
Execution Process
• Initialization
P1
P2
Pn
.
.
Munin root thread
Munin worker thread
Munin worker thread
User_init()
Code copy
Data segment
Code copy
Data segment
user root thread
Page 28
28
Execution Process
• Synchronization
P1
P2
Pn
.
.
Munin root thread
Munin worker thread
Synchronization operation User thread
Page 29
29
Advanced Programming Features
• Associate data & Synch back
msg
acq(m) r(x) r(x)
rel(m)
msg
acq(m) r(x)
rel(m)
w(x)
Page 30
30
Advanced Programming Features
• PhaseChange()– Change the producer consumer relationship– Example: adaptive mesh sor
• ChangeAnnotation()– Change the access pattern in execution
• Invalidate()
• Flush()
• SingleObject()
• PreAcquire()
Page 31
31
Data Object Directory
• Start Address and Size• Protocol parameters• Object state (valid, writable, invalid)• Copyset (which remote has copies)• Synchq (corresponding synchronization object)• Probable owner• Home node• Access control semaphore• Links
Page 32
32
Delayed Update Queue
acq(m)w(x) w(y)
rel(m)
x xy
Page 33
33
Multiple Writer Handling
Page 34
34
Multiple Writer Handling
Page 35
35
Synchronization
• Queue based synchronization
• Request – reply – lock forward mechanism
• AcquireLock(), Unlock(), WaitAtBarrier()
Page 36
36
Performance
• Basic concepts– Shared object– Software release consistency– Multiple consistency protocols
• Software implementation– Prototype overview– Execution process– Advanced programming features– Data object directory and delayed update queue– Synchronization
• Performance• Overview of other DSM systems• Conclusion
Page 37
37
Matrix Multiply
0
50
100
150
200
250
300
350
400
2 Procs 4 Procs 8 Procs 16Procs
DM
Munin
0
2
4
6
8
10
2Procs
4Procs
8Procs
16Procs
Diff %
Page 38
38
Matrix Multiply Optimized
0
50
100
150
200
250
300
350
400
2 Procs 4 Procs 8 Procs 16Procs
DM
Munin
0
0.5
1
1.5
2Procs
4Procs
8Procs
16Procs
Diff %
Page 39
39
SOR
0
10
20
30
40
50
60
70
2 Procs 4 Procs 8 Procs 16Procs
DM
Munin
0
2
4
6
8
10
2Procs
4Procs
8Procs
16Procs
Diff %
Page 40
40
Effect of Multiple Protocols
Protocol Matrix Multiply SOR
Multiple 72.41 27.64
Write-shared 75.59 64.48
Conventional 75.85 67.64
Page 41
41
Overview of Other DSM System
• Basic concepts– Shared object– Software release consistency– Multiple consistency protocols
• Software implementation– Prototype overview– Execution process– Advanced programming features– Data object directory and delayed update queue– Synchronization
• Performance• Overview of other DSM systems• Conclusion
Page 42
42
Overview of Other DSM System
• Clouds: per-segment (object) based consistency protocol
• Mirage: per-page based• Orca: reliable ordered broadcast protocol• Amber: user responsible for the data distribution
among processors• Linda: shared variable in tuple space, atomic
operation: insertion, removal, reading• Midway: using entry consistency (weaker
consistency than release consistency)• DASH: hardware DSM
Page 43
43
Conclusion
• Objective: efficient DSM system with similar protocol to shared memory programming and small message passing overhead
• Special feature: multiple protocols, software release consistency
• Implementation: synchronization realized by Munin root thread and Munin worker threads