Top Banner
TreadMarks: Shared Memory Computing on Networks of Workstations C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony, W. Yu, W. Zwaenepoel Rice University
48

TreadMarks: Shared Memory Computing on Networks of Workstations

Jan 02, 2016

Download

Documents

melvin-hansen

TreadMarks: Shared Memory Computing on Networks of Workstations. C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony, W. Yu, W. Zwaenepoel Rice University. INTRODUCTION. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TreadMarks: Shared Memory Computing on Networks of Workstations

TreadMarks: Shared Memory Computing on Networks of

Workstations

C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony,

W. Yu, W. ZwaenepoelRice University

Page 2: TreadMarks: Shared Memory Computing on Networks of Workstations

INTRODUCTION

• Distributed shared memory is a software abstraction allowing a set of workstations connected by a LAN to share a single paged virtual address space

• Key issue in building a software DSM is minimizing the amount of data communication among the workstation memories

Page 3: TreadMarks: Shared Memory Computing on Networks of Workstations

Why bother with DSM?

• Key idea is to build fast parallel computers that are– Cheaper than shared memory multiprocessor

architectures– As convenient to use

Page 4: TreadMarks: Shared Memory Computing on Networks of Workstations

CPU

Shared memory

Conventional parallel architecture

CACHE CACHE CACHE CACHE

CPU CPU CPU

Page 5: TreadMarks: Shared Memory Computing on Networks of Workstations

Today’s architecture

• Clusters of workstations are much more cost effective– No need to develop complex bus and cache

structures– Can use off-the-shelf networking hardware

• Gigabit Ethernet • Myrinet (1.5 Gb/s)

– Can quickly integrate newest microprocessors

Page 6: TreadMarks: Shared Memory Computing on Networks of Workstations

Limitations of cluster approach

• Communication within a cluster of workstation is through message passing– Much harder to program than concurrent

access to a shared memory• Many big programs were written for shared

memory architectures– Converting them to a message passing

architecture is a nightmare

Page 7: TreadMarks: Shared Memory Computing on Networks of Workstations

Distributed shared memory

DSM = one shared global address space

main memories

Page 8: TreadMarks: Shared Memory Computing on Networks of Workstations

Distributed shared memory

• DSM makes a cluster of workstations look like a shared memory parallel computer– Easier to write new programs– Easier to port existing programs

• Key problem is that DSM only provides the illusion of having a shared memory architecture– Data must still move back and forth among the

workstations

Page 9: TreadMarks: Shared Memory Computing on Networks of Workstations

Munin

• Developed at Rice University• Based on software objects (variables)• Used the processor virtual memory to detect

access to the shared objects• Included several techniques for reducing

consistency-related communication• Only ran on top of the V kernel

Page 10: TreadMarks: Shared Memory Computing on Networks of Workstations

Munin main strengths

• Excellent performance • Portability of programs

– Allowed programs written for a multiprocessor architecture to run on a cluster of workstations with a minimum number of changes(dusty decks)

Page 11: TreadMarks: Shared Memory Computing on Networks of Workstations

Munin main weakness

• Very poor portability of Munin itself– Depended of some features of the V kernel

• Not maintained since the late 80's

Page 12: TreadMarks: Shared Memory Computing on Networks of Workstations

TreadMarks

• Provides DSM as an array of bytes• Like Munin,

– Uses release consistency– Offers a multiple writer protocol to fight false

sharing• Runs at user-level on a number of UNIX

platforms• Offers a very simple user interface

Page 13: TreadMarks: Shared Memory Computing on Networks of Workstations

First example: Jacobi iteration

• Illustrates the use of barriers– A barrier is a synchronization primitive that

forces processes accessing it to wait until all processes have reached it• Forces processes to wait until all of them

have completed a specific step

Page 14: TreadMarks: Shared Memory Computing on Networks of Workstations

Jacobi iteration: overall organization

• Operates on a two-dimensional array• Each processor works on a specific band of rows

– Boundary rows are shared

Proc 0

Proc n-1

Page 15: TreadMarks: Shared Memory Computing on Networks of Workstations

Jacobi iteration: overall organization

• During each iteration step, each array element is set to the average of its four neighbors– Averages are stored in a scratch matrix and

copied later into the shared matrix

Page 16: TreadMarks: Shared Memory Computing on Networks of Workstations

Jacobi iteration: the barriers

• Mark the end of each computation phase• Prevents processes from continuing the

computation before all other processes have completed the previous phase and the new values are "installed"

• Include an implicit release() followed by an implicit acquire()– To be explained later

Page 17: TreadMarks: Shared Memory Computing on Networks of Workstations

Jacobi iteration: declarations

#define M #define N float *grid // shared array float scratch[M][N] // private array

Page 18: TreadMarks: Shared Memory Computing on Networks of Workstations

Jacobi iteration: startupmain() { Tmk_startup(); if (Tmk_proc_id == 0 ) { grid = Tmk_malloc(M*N*sizeof(float)); initialize grid; } // if Tmk_barrier(0); length = M/Tmk_nprocs; begin = length*Tmk_proc_id; end = length*(Tmk_proc_id + 1);

Page 19: TreadMarks: Shared Memory Computing on Networks of Workstations

Jacobi iteration: main loop for (number of iterations) { for (i = begin; i < end; i++) for (j = 0; j < N; j++)

scratch[i][j] = (grid[i-][j] + … + grid[i][j+1])/4; Tmk_barrier(1); for (i = begin; i < end; i++) for (j = 0; j < N; j++)

grid[i][j] = scratch[i][j]; Tmk_barrier(2); } // main loop} // main

Page 20: TreadMarks: Shared Memory Computing on Networks of Workstations

Second example: TSP

• Traveling salesman problem– Finding the shortest path through a number of

cities• Program keeps a queue of partial tours

– Most promising at the end

Page 21: TreadMarks: Shared Memory Computing on Networks of Workstations

TSP: declarations

queue_type *Queueint *Shortest_lengthint queue_lock_id, min_lock_id;

Page 22: TreadMarks: Shared Memory Computing on Networks of Workstations

TSP: startupmain ( Tmkstartup() queue_lock_id = 0; min_lock_id = 1; if (Tmk_proc_id == 0) { Queue = Tmk_malloc(sizeof(queuetype)); Shortest_length =

Tmk_malloc(sizeof(int)); initialize Heap and Shortest_length;

] // if Tmk_barrier (0);

Page 23: TreadMarks: Shared Memory Computing on Networks of Workstations

TSP: while loop while (true) do { Tmk_lock_acquire(queue_lock_id);

if (queue is empty) { Tmk_lock_release(queue_lock_id); Tmk_exit(); } // while loop Keep adding to queue until a long promising tour appears at the head Path = Delete the tour from the head Tmk_lock_release(queue_lock_id); } // while

Page 24: TreadMarks: Shared Memory Computing on Networks of Workstations

TSP: end of main

length = recursively try all cities not on Path, find the shortest tour length Tmk_lock_acquire(min_lock_id); if (length < Shortest_length) Shortest_length = length Tmk_lock_release(min_lock_id

} // main

Page 25: TreadMarks: Shared Memory Computing on Networks of Workstations

Critical sections

• All accesses to shared variables are surrounded by a pair

Tmk_lock_acquire(lock_id);…

Tmk_lock_relese(lock_id);

Page 26: TreadMarks: Shared Memory Computing on Networks of Workstations

Implementation Issues

• Consistency issues• False sharing

Page 27: TreadMarks: Shared Memory Computing on Networks of Workstations

Consistency model (I)

• Shared data are replicated at times– To speed up read accesses

• All workstations must share a consistent view of all data

• Strict consistency is not possible

Page 28: TreadMarks: Shared Memory Computing on Networks of Workstations

Consistency model (II)

• Various authors have proposed weaker consistency models– Cheaper to implement– Harder to use in a correct fashion

• TreadMarks usessoftware release consistency– Only requires the memory to be consistent at

specific synchronization points

Page 29: TreadMarks: Shared Memory Computing on Networks of Workstations

SW release consistency (I)

• Well-written parallel programs use locks to achieve mutual exclusion when they access shared variables– P(&mutex) and V(&mutex)– lock(&csect) and unlock(&csect) – acquire( ) and release( )

• Unprotected accesses can produce unpredictable results

Page 30: TreadMarks: Shared Memory Computing on Networks of Workstations

SW release consistency (II)

• SW release consistency will only guarantee correctness of operations performed within a request/release pair

• No need to export the new values of shared variables until the release

• Must guarantee that workstation has received the most recent values of all shared variables when it completes a request

Page 31: TreadMarks: Shared Memory Computing on Networks of Workstations

SW release consistency (III)

shared int x;acquire( );

x = 1;release ( );// export x=1

shared int x;

acquire( );// wait for new value of x

x++;release ( );// export x=2

Page 32: TreadMarks: Shared Memory Computing on Networks of Workstations

SW release consistency (IV)

• Must still decide how to release updated values– TreadMarks uses lazy release:

• Delays propagation until an acquire is issued

– Its predecessor Munin used eager release:• New values of shared variables were

propagated at release time

Page 33: TreadMarks: Shared Memory Computing on Networks of Workstations

SW release consistency (V)Eagerrelease

Lazyrelease

Page 34: TreadMarks: Shared Memory Computing on Networks of Workstations

False sharing

accesses x accesses y

x y

page containing x and y will move back and forthbetween main memories of workstations

Page 35: TreadMarks: Shared Memory Computing on Networks of Workstations

Multiple write protocol (I)

• Designed to fight false sharing• Uses a copy-on-write mechanism• Whenever a process is granted access to write-

shared data, the page containing these data is marked copy-on-write

• First attempt to modify the contents of the page will result in the creation of a copy of the page modified (the twin).

Page 36: TreadMarks: Shared Memory Computing on Networks of Workstations

Creating a twin

Page 37: TreadMarks: Shared Memory Computing on Networks of Workstations

Multiple write protocol (II)

• At release time, TreadMarks– Performs a word by word comparison of the

page and its twin– Stores the diff in the space used by the twin page– Informs all processors having a copy of the

shared data of the update• These processors will request the diff the first time

they access the page

Page 38: TreadMarks: Shared Memory Computing on Networks of Workstations

Creating a diff

Page 39: TreadMarks: Shared Memory Computing on Networks of Workstations

x = 1

y = 2

x = 1

y = 2

First write access

twin

x = 3

y = 2

Before

After

Compare with twinNew value of x is 3

Example

Page 40: TreadMarks: Shared Memory Computing on Networks of Workstations

Multiple write protocol (III)

• TreadMarks could but does not check for conflicting updates to write-shared pages

Page 41: TreadMarks: Shared Memory Computing on Networks of Workstations

The TreadMarks system

• Entirely at user-level• Links to programs written in C, C++ and Fortran• Uses UDP/IP for communication (or AAL3/4 if

machines are connected by an ATM LAN)• Uses SIGIO signal to speed up processing of

incoming requests• Uses mprotect( ) system call to control access to

shard pages

Page 42: TreadMarks: Shared Memory Computing on Networks of Workstations

Performance evaluation (I)

• Long discussion of two large TreadMarks applications

Page 43: TreadMarks: Shared Memory Computing on Networks of Workstations

Performance evaluation (II)

• A previous paper compared performance of TreadMarks with that of Munin– Munin performance typically was within 5 to

33% of the performance of hand-coded message passing versions of the same programs

– TreadMarks was almost always better than Munin with one exception:• A 3-D FFT program

Page 44: TreadMarks: Shared Memory Computing on Networks of Workstations

Performance Evaluation (III)

• 3-D FFT program was an iterative program that read some shared data outside any critical section– Doing otherwise would have been to costly

• Munin used eager release, which ensured that the values read were not far from their true value

• Not true for TreadMarks!

Page 45: TreadMarks: Shared Memory Computing on Networks of Workstations

Other DSM Implementations (I)

• Sequentially-Consistent Software DSM (IVY):– Sends messages to other copies at each write– Much slower

• Software release consistency with eager release (Munin)

Page 46: TreadMarks: Shared Memory Computing on Networks of Workstations

Other DSM Implementations (II)

• Entry consistency (Midway):– Requires each variable to be associated to a

synchronization object (typically a lock)– Acquire/release operations on a given

synchronization object only involve the variables associated with that object

– Requires less data traffic– Does not handle well dusty decks

Page 47: TreadMarks: Shared Memory Computing on Networks of Workstations

Other DSM Implementations (III)

• Structured DSM Systems (Linda):– Offer to the programmer a shared tuple space

accessed using specific synchronized methods

– Require a very different programming style

Page 48: TreadMarks: Shared Memory Computing on Networks of Workstations

CONCLUSIONS

• Can build an efficient DSM entirely in user space– Modern UNIX systems offer all the required

primitives• Software release consistency model works very

well• Lazy release is almost always better than eager

release