TreadMarks: Shared Memory Computing on Networks of Workstations C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony, W. Yu, W. Zwaenepoel Rice University
Jan 02, 2016
TreadMarks: Shared Memory Computing on Networks of
Workstations
C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony,
W. Yu, W. ZwaenepoelRice University
INTRODUCTION
• Distributed shared memory is a software abstraction allowing a set of workstations connected by a LAN to share a single paged virtual address space
• Key issue in building a software DSM is minimizing the amount of data communication among the workstation memories
Why bother with DSM?
• Key idea is to build fast parallel computers that are– Cheaper than shared memory multiprocessor
architectures– As convenient to use
Today’s architecture
• Clusters of workstations are much more cost effective– No need to develop complex bus and cache
structures– Can use off-the-shelf networking hardware
• Gigabit Ethernet • Myrinet (1.5 Gb/s)
– Can quickly integrate newest microprocessors
Limitations of cluster approach
• Communication within a cluster of workstation is through message passing– Much harder to program than concurrent
access to a shared memory• Many big programs were written for shared
memory architectures– Converting them to a message passing
architecture is a nightmare
Distributed shared memory
• DSM makes a cluster of workstations look like a shared memory parallel computer– Easier to write new programs– Easier to port existing programs
• Key problem is that DSM only provides the illusion of having a shared memory architecture– Data must still move back and forth among the
workstations
Munin
• Developed at Rice University• Based on software objects (variables)• Used the processor virtual memory to detect
access to the shared objects• Included several techniques for reducing
consistency-related communication• Only ran on top of the V kernel
Munin main strengths
• Excellent performance • Portability of programs
– Allowed programs written for a multiprocessor architecture to run on a cluster of workstations with a minimum number of changes(dusty decks)
Munin main weakness
• Very poor portability of Munin itself– Depended of some features of the V kernel
• Not maintained since the late 80's
TreadMarks
• Provides DSM as an array of bytes• Like Munin,
– Uses release consistency– Offers a multiple writer protocol to fight false
sharing• Runs at user-level on a number of UNIX
platforms• Offers a very simple user interface
First example: Jacobi iteration
• Illustrates the use of barriers– A barrier is a synchronization primitive that
forces processes accessing it to wait until all processes have reached it• Forces processes to wait until all of them
have completed a specific step
Jacobi iteration: overall organization
• Operates on a two-dimensional array• Each processor works on a specific band of rows
– Boundary rows are shared
Proc 0
…
Proc n-1
Jacobi iteration: overall organization
• During each iteration step, each array element is set to the average of its four neighbors– Averages are stored in a scratch matrix and
copied later into the shared matrix
Jacobi iteration: the barriers
• Mark the end of each computation phase• Prevents processes from continuing the
computation before all other processes have completed the previous phase and the new values are "installed"
• Include an implicit release() followed by an implicit acquire()– To be explained later
Jacobi iteration: declarations
#define M #define N float *grid // shared array float scratch[M][N] // private array
Jacobi iteration: startupmain() { Tmk_startup(); if (Tmk_proc_id == 0 ) { grid = Tmk_malloc(M*N*sizeof(float)); initialize grid; } // if Tmk_barrier(0); length = M/Tmk_nprocs; begin = length*Tmk_proc_id; end = length*(Tmk_proc_id + 1);
Jacobi iteration: main loop for (number of iterations) { for (i = begin; i < end; i++) for (j = 0; j < N; j++)
scratch[i][j] = (grid[i-][j] + … + grid[i][j+1])/4; Tmk_barrier(1); for (i = begin; i < end; i++) for (j = 0; j < N; j++)
grid[i][j] = scratch[i][j]; Tmk_barrier(2); } // main loop} // main
Second example: TSP
• Traveling salesman problem– Finding the shortest path through a number of
cities• Program keeps a queue of partial tours
– Most promising at the end
TSP: startupmain ( Tmkstartup() queue_lock_id = 0; min_lock_id = 1; if (Tmk_proc_id == 0) { Queue = Tmk_malloc(sizeof(queuetype)); Shortest_length =
Tmk_malloc(sizeof(int)); initialize Heap and Shortest_length;
] // if Tmk_barrier (0);
TSP: while loop while (true) do { Tmk_lock_acquire(queue_lock_id);
if (queue is empty) { Tmk_lock_release(queue_lock_id); Tmk_exit(); } // while loop Keep adding to queue until a long promising tour appears at the head Path = Delete the tour from the head Tmk_lock_release(queue_lock_id); } // while
TSP: end of main
length = recursively try all cities not on Path, find the shortest tour length Tmk_lock_acquire(min_lock_id); if (length < Shortest_length) Shortest_length = length Tmk_lock_release(min_lock_id
} // main
Critical sections
• All accesses to shared variables are surrounded by a pair
Tmk_lock_acquire(lock_id);…
Tmk_lock_relese(lock_id);
Consistency model (I)
• Shared data are replicated at times– To speed up read accesses
• All workstations must share a consistent view of all data
• Strict consistency is not possible
Consistency model (II)
• Various authors have proposed weaker consistency models– Cheaper to implement– Harder to use in a correct fashion
• TreadMarks usessoftware release consistency– Only requires the memory to be consistent at
specific synchronization points
SW release consistency (I)
• Well-written parallel programs use locks to achieve mutual exclusion when they access shared variables– P(&mutex) and V(&mutex)– lock(&csect) and unlock(&csect) – acquire( ) and release( )
• Unprotected accesses can produce unpredictable results
SW release consistency (II)
• SW release consistency will only guarantee correctness of operations performed within a request/release pair
• No need to export the new values of shared variables until the release
• Must guarantee that workstation has received the most recent values of all shared variables when it completes a request
SW release consistency (III)
shared int x;acquire( );
x = 1;release ( );// export x=1
shared int x;
acquire( );// wait for new value of x
x++;release ( );// export x=2
SW release consistency (IV)
• Must still decide how to release updated values– TreadMarks uses lazy release:
• Delays propagation until an acquire is issued
– Its predecessor Munin used eager release:• New values of shared variables were
propagated at release time
False sharing
accesses x accesses y
x y
page containing x and y will move back and forthbetween main memories of workstations
Multiple write protocol (I)
• Designed to fight false sharing• Uses a copy-on-write mechanism• Whenever a process is granted access to write-
shared data, the page containing these data is marked copy-on-write
• First attempt to modify the contents of the page will result in the creation of a copy of the page modified (the twin).
Multiple write protocol (II)
• At release time, TreadMarks– Performs a word by word comparison of the
page and its twin– Stores the diff in the space used by the twin page– Informs all processors having a copy of the
shared data of the update• These processors will request the diff the first time
they access the page
x = 1
y = 2
x = 1
y = 2
First write access
twin
x = 3
y = 2
Before
After
Compare with twinNew value of x is 3
Example
Multiple write protocol (III)
• TreadMarks could but does not check for conflicting updates to write-shared pages
The TreadMarks system
• Entirely at user-level• Links to programs written in C, C++ and Fortran• Uses UDP/IP for communication (or AAL3/4 if
machines are connected by an ATM LAN)• Uses SIGIO signal to speed up processing of
incoming requests• Uses mprotect( ) system call to control access to
shard pages
Performance evaluation (II)
• A previous paper compared performance of TreadMarks with that of Munin– Munin performance typically was within 5 to
33% of the performance of hand-coded message passing versions of the same programs
– TreadMarks was almost always better than Munin with one exception:• A 3-D FFT program
Performance Evaluation (III)
• 3-D FFT program was an iterative program that read some shared data outside any critical section– Doing otherwise would have been to costly
• Munin used eager release, which ensured that the values read were not far from their true value
• Not true for TreadMarks!
Other DSM Implementations (I)
• Sequentially-Consistent Software DSM (IVY):– Sends messages to other copies at each write– Much slower
• Software release consistency with eager release (Munin)
Other DSM Implementations (II)
• Entry consistency (Midway):– Requires each variable to be associated to a
synchronization object (typically a lock)– Acquire/release operations on a given
synchronization object only involve the variables associated with that object
– Requires less data traffic– Does not handle well dusty decks
Other DSM Implementations (III)
• Structured DSM Systems (Linda):– Offer to the programmer a shared tuple space
accessed using specific synchronized methods
– Require a very different programming style