Munin, Clouds and Treadmarks

1

Munin, Clouds and Treadmarks

Distributed Shared Memory course

Taken from a presentation of:Maya Maimon

(University of Haifa, Israel).

2

IntroductionThere are two fundamental models for parallel programming:

Shared memory model Direct extension of the conventional uniprocessor model wherein each processor is provided with the abstraction that there is but a single memory in the machine. A update to shared data therefore becomes visible to all the processors in the system.

Distributed memory model.each processor has a private memory to which no other processor has direct access.communication by explicit message passing.

3

The Goal Distributed memory machines are easier to build.The shared memory programming model is, however, more attractive since the message passing paradigm requires them to explicitly partition data and manage communication.Using a programming model that supports a global address space, an applications programmer can focus on algorithmic development rather than on managing partitioned data sets and communicating values. A distributed shared memory (DSM) system provides a shared memory programming model on distributed memory machine.

4

Distributed Shared Memory

The system consists of the same hardware as distributed memory machine, with the addition of a software layer that provides the abstraction of a single shared memory.

In practice, each memory remains physically independent, and all communication takes place through explicit message passing performed by the DSM software layer.

DSM systems combine the best features of the two methods. They support the convenient shared memory programming model on distributed memory hardware, which is more scalable and less expensive to build.

5

Software Distributed Shared Memory

distributed shared memory

network

MemMem MemMem MemMem

Node 1Node 1 Node 2Node 2 Node nNode n

page based, permissions, …

single system image,

shared virtual address space, …

6

The challengePerformanceWhat is the source of the overhead in DSM?

The large amount of communication that is required to maintain consistency.

In another words, the maintenance of the shared memory abstraction.

7

False Sharing False sharing is a situation in which two or more processes access different variables within a page and at least one of the accesses is a write.

If only one process is allowed to write to a page at a time, false sharing leads to unnecessary communication, called the “ping-pong” effect.

8

Understanding False Sharing

w(x) w(x) w(x)A

B r(y) r(y) r(y)

xy

p

w(x) w(x) w(x)A

B r(y) r(y) r(y)

x

page p1

y

page p2

p

p

p

p

p

9

Techniques for reducing the amount of

communication1) Software release consistency2) Multiple consistency protocol3) Write shared protocol 4) an update with timeout mechanism

These techniques have been incorporated in the Munin DSM system.

10

The need for Relaxed Consistency Schemes

In any implementation of Sequential Consistency there should be some global control mechanism.If processor write to shared data ,all the other should know about it ,before it can write again.Either of writes or reads require memory synchronization operations.

In most implementation writes require some kind of memory synchronization:

w(x) w(y) w(x)A

B

11

1 )Relaxed Consistency Schemes

The Relaxed Consistency Schemes are designed to allow less memory synchronization operations.

Writes can be delayed, aggregated, eliminated.This results in less communication and therefore higher performance.

w(x) w(y) w(x)A

B

barrier

12

The ideaRelease consistency exploits the fact that programmers use synchronization to separate accesses to shared variables by different threads. The system then only needs to guarantee that memory is consistent at (select) synchronization points.

13

False Sharing in Relaxed Consistency Schemes

False sharing has much smaller overhead in relaxed consistency models.The overhead induced by false sharing can be further reduced by the the usage of multiple-writer protocols.Multiple-writer protocols allow multiple processes to simultaneously modify their local copy of a shared page.

The modifications are merged at certain points of execution.

14

Release Consistency[Gharachorloo et al. 1990,

DASH]*Introduces a special type of variables, called synchronization variables or locks.Locks cannot be read or written to. They can be acquired and released. For a lock L those operations are denoted by acquire(L) and release(L) respectivelyWe will say that a process that acquired a lock L but has not released it, holds the lock L.No more than one process can hold a lock L. One process holds the lock while others wait.

K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.L. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26. IEEE, May 1990.

)*(

15

Using Release and Acquire to define execution-flow

synchronization primitives

Let a set of processes release tokens by reaching the operation release in their program order.Let another set (possibly with overlap) acquire those tokens by performing acquire operation, where acquire can proceed only when all tokens have already arrived from all releasing processes.2-way synchronization = lock-unlock, 1 release, 1 acquiren-way synchronization = barrier, n releases, n acquires

PARC’s synch = k-way synchronization

16

Formal Definition of Release Consistency

Conditions for Release Consistency:

(A)Before a read or write access is allowed to perform with respect to any other processor, all previous acquire accesses must be performed, and

(B) Before a release access is allowed to perform with respect to any other processor, all previous read or write accesses must be performed, and

(C) acquire and release accesses are sequentially consistent.

17

A

B

rel(L1)w(x)1

tt

r(x)0 r(x)? acq(L1)r(x)1 r(x)1

Understanding RC

From this point all processes must see the value 1 in X

It is undefined what value is read here. It

can be any value written by some

process. Here it can be 0 or 1.

1 must be read according to rule (B), but

the programmer

can not be sure of it

Programmer is sure that this will return

1 according to rules (C) and (A)

18

Acquire and Release

release serves as a memory-synch operation, or a flush of the local modifications to the attention of all other processes.

According to the definition, the acquire and release operations are not only used for synchronization of execution, but also for synchronization of memory, i.e. for propagation of writes from/to other processes.

This allows to overlap the two expensive kinds of synchronization.

This turns out also simpler on the programmer from semantic point of view.

19

Acquire and Release (cont.)

A release followed by an acquire of the same lock guarantees to the programmer that all writes previous to the release will be seen by all reads following the acquire. The idea is to let the programmer decide which blocks of operations need be synchronized, and put them between matching pair of acquire-release operations.In the absence of release/acquire pairs, there is no assurance that modifications will ever propagate between processes.

20

Implementing RCThe first implementation was proposed by the inventors of RC and is called DASH.DASH combats memory latency by pipelining writes to shared memory.

• The processor is stalled only when executing a release, at which time it must wait for all its previous writes to perform.

P1

rel(L)

P2

w(x) w(y) w(z)

x y z

ack ack ack

stalled

21

Implementing RC (cont.)It is important to reduce the number of messages exchanges, because every message has additional fixed overhead, independent of its size.Another implementation of RC, called Munin reduces the number of messages by buffering writes until a release.

P1

rel(L)

P2

w(x) w(y) w(z)

x,y,z Ack(x,y,z)

22

Eager Release Consistency[Carter et al. 1991, Munin]*

Implementation of Release Consistency (not a new memory model).Postpone sending modifications to the next release.Upon a release send all accumulated modifications to all caching processes.No memory-synchronization operations on an acquire.Upon a miss (no local caching of the variable) get latest modification from latest modifier (need some more control to store its identity, no big deal).

John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and Performance of MUNIN. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 152--164, October 1991.

)*(

23

Understanding ERC

• Release operation does not complete (is not performed) until the acknowledgements from all the processes are received.

A

B w(y)1

tt

w(x)1

acq(L1)

C w(z)1acq(L2)

r(x)0

x,y

z

z

r(x)0

x,y

apply changes

rel(L1)

apply changes

r(x)1

rel(L2)

apply changes

r(z)0

apply changesr(y)1 r(z)1

r(z)1

24

Release Vs. Sequential consistency1. Ordinary reads and writes can be

buffered or pipelined between synchronization points.

2. Ordinary reads and writes following a release do not have to be delayed for the release to complete.

3. An acquire access does not have to delay for previous ordinary reads and writes to complete.

25

2 (Write-Shared Protocol(Supporting Multiple Writers in ERC)

Modifications are detected by twinning.When writing to unmodified page, its twin is created.When releasing, the final copy of a page is compared to its twin.The resulting difference is called a diff.

Twinning and diffing not only allow multiple writers, but also reduce communication.

Sending a diff is cheaper than sending an entire page.

26

Twinning and Diffing

write P twin

writable working copy

release:

diff

27

3(Multiple Consistency Protocol

The use of a single protocol for all shared data leads to a situation where some programs can be handled effectively by a given DSM system, while others cannot.Types of access patterns that should be supported:

1. Conventional shared2. Read only3. Migratory4. Write shared5. synchronization

28

Update-based vs. Invalidate-based

In update-based protocols the modifications are sent whereas in invalidate-based protocol only notifications of modifications are sent.

P1

rel(L)

P2

w(x)1 w(y)2

x:=1y:=2

P1

rel(L)

P2

w(x)1 w(y)2

“I changed x and y”

Invalidate-based Update-based

29

Update-Based vs. Invalidate-Based (cont.)Invalidations are smaller than the updates.

The bigger the coherency unit the bigger is the difference.

In invalidation-based schemes there can be significant overhead due to access misses.

P1

rel(L)

P2

w(x)1 w(y)2

inv(x)inv(y) acq(L)

r(x)

get(x) x=1

r(y)

get(y) y=2

30

4(update with timeout mechanism

The problem: Updates to a data item are propagated to all of its replicas, including those that are no longer being used.In DSM systems, this problem becomes even more severe, because the main memories of the nodes in which the replicas are kept are very large and it takes a long time before a page gets replaced, if at all.Solution: invalid the old copies.

31

Reducing the Number of Messages

In DASH and Munin systems all processes (or all processes that cache the page) see the updates of a process.Consider the following example of execution in Munin:

P1

w(x)

P2

P3

P4

rel(L)

acq(L) w(x) rel(L)

acq(L) w(x) rel(L)

r(x)acq(L)

• There are many unneeded messages. In DASH even more.

• This problem exists in invalidation-based schemes as well.

32

Reducing the Number of Messages )cont.(

• Logically, however it suffices to update each processor’s copy only when it acquires L.

P1

w(x)

P2

P3

P4

rel(L)

acq(L) w(x) rel(L)

acq(L) w(x) rel(L)

r(x)acq(L)

• Therefore, a new algorithm, called Lazy Release Consistency (LRC) for implementing RC was proposed.

• LRC is aimed at reducing both the number of messages and the amount of data exchanged.

33

Lazy Release Consistency[Keleher et al., Treadmarks

1992] *

• The idea is to postpone sending of modifications until a remote processor actually needs them.

• Invalidate-based protocol• The BIG advantage: no need to get modifications

that are irrelevant, because they are already masked by newer ones.

• NOTE: implements a slightly more relaxed memory model than RC!

P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenopol. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115--132, Jan. 1994.

)*(

34

Lazy release consistencyEnforces consistency at the time of an acquire.Causes fewer messages to be sent. At the time of a lock release, Munin sends messages to all processors who cache data modified by the releasing processor. In contrast, in lazy release consistency, consistency messages only travel between the last releaser and the new acquirer.But :

Somewhat more complicated After a release, Munin can forget about all modifications the releasing processor made prior to the release.

35

Multiple writer protocolin Treadmarks

The problem with a single writer protocol.False sharing can cause single writer protocols to perform even worse.multiple writer protocols allow multiple writers to simultaneously modify the same page, with consistency traffic deferred until a later time, in particular until synchronization occurs. Using diff.

36

Multiple writer protocolin Treadmarks

The primary benefit of using diffs is they can be used to implement multiple writer protocols, but they can also significantly reduce overall bandwidth requirements because diffs are typically much smaller than a page.

37

Formal Definition of Lazy Release Consistency

Conditions for Lazy Release Consistency:

(A)Before a read or write access is allowed to perform with respect to any other process, all previous acquire accesses must be performed with respect to that other process, and

(B) Before a release access is allowed to perform with respect to any other process, all previous read or write accesses must be performed with respect to that other process, and

(C) acquire and release accesses are sequentially consistent.

38

A

B

rel(L1)w(x)1

tt

r(x)0 r(x)? acq(L1)r(x)? r(x)1

Understanding the LRC Memory Model

Cr(x)0 r(x)? acq(L2)r(x)? r(x)?

• It is guaranteed that the acquirer of the same lock sees the modification that precede the release in program order.

39

CLOUDSCLOUDSClouds is a distributed operating system that integrates a set of loosely coupled machines into a conceptually centralized system.The system is composed of:

Compute serversData serversUser workstations

Compute server and data server are logical designations

40

The Object Oriented Modelstructures a software system as a set of objects.

Object consists data and operations.Objects respond to messages.

object1Sending msgTo object 1

msg Object 1 responds by executing a method

The method read or update data and may sends messages to

other objects

In the end,the method sends reply to the msg sender.

41

The Object Thread ModelClouds has a similar structure as object thread, implemented at the operating system level. Clouds objects are large grained encapsulations of code and data which model distinct, persistent virtual address spaces. The contents of an object is an instance of a class a class is a compiled program module. Unlike virtual address spaces in conventional operating systems, the contents of a Clouds object are long lived. That is, a Clouds object exists forever and survives system crashes and shutdowns (like a file) until explicitly deleted.

42

Execution A thread is a logical path of execution that executes code in objects, traversing objects as it executes. Unlike processes in a conventional operating system, a Clouds thread is not bound to a single address space, since each object represents a distinct address space.

43

Execution examplethread

Thread executeentry pointin a object

Access to a persistent data in the object

May invoke operationsIn other objects

The thread temporarily leaves the calling

objects enters the called object

Returns to the callingobject

Created by user or

A program

44

Clouds…Clouds uses objects as the abstraction of storage and threads as the abstraction of computation.

Clouds is a general purpose operating system. It is intended to support all types of languages and applications, distributed or not. All applications can view the system as a monolith, but distributed applications may choose to view the system as loosely coupled compute and data servers.

Each compute facility in Clouds has access to all resources in the system.

45

Distributed ComputationsA novel feature of Clouds that makes distributed programming simpler is the automatic migration of objects via Clouds object invocation and distributed shared memory (DSM) mechanisms.

46

Synchronous invocation

ComputingServer A

Creates thread to invoke

1OThread1

Executing thread 1 Data and code1O

are demand as a page to A

Data Server

O1O1

O1O1

Ecexecute thread 1 in O1 invoke O2

If the system chose to invoke on A ,the page of O2 are

on other node and are demand to node A.

Or,other node (B) do the computation the thread invoke B

to perform the operation on O2) will demad the page from data server to B),than the results are

returned to A.

47

Synchronous invocation

This invocation mechanism is called synchronous, because execution in the current environment of the invoking thread waits until the invoked operation completes and results are returned.

48

Asynchronous invocation When thread t1 executing in object 1 asynchronously invokes an operation of object 2 , a new thread t2 is created which executes the invoked operation.Asynchronous invocation returns a variable p which can be used as a place holder for object return value of any type.Thread t1 can continue with execution in object 1 At a later time, thread t1 can wait for the completion of the operation being executed by t2 by executing a claim call on p. When claim p returns, t2 has completed and the results returned by the invoked operation are available to t1 through p.

49

More details…The differences from RPCObjects migrationIn Clouds , a DSM coherence protocol is used to ensure that the data in an object is seen by concurrent threads in a consistent fashion even if they are executing on different compute servers.Since several threads can simultaneously execute in an object, it is necessary to coordinate access to the object data. This is handled by Clouds object programmers using Clouds system synchronization mechanisms such as semaphores and locks. Through these mechanisms, threads may synchronize their actions regardless of where they execute.

50

advantagesAn application programmer does not have to be aware of distribution. A distributed program can be written as a centralized program where processes communicate and synchronize using shared memory. The degree of distribution or concurrency does not have to be decided at the time the program is written. Clouds treats computation and storage orthogonally. Second, Clouds uses a single level persistent store for all objects.

51

SummaryMunin-reduced communication mainly by release consistency.Treadmarks-

lazy release consistency

Clouds- distributed computation and data.

Munin, Clouds and Treadmarks

Documents

shared memory abstraction

single shared memory

distributed memory model

single memory

private memory

distributed memory hardware

shared data

kind of memory synchronization