Distributed System.pdf

7/30/2019 Distributed System.pdf

http://slidepdf.com/reader/full/distributed-systempdf 1/29

Distributed systemsTakashiNanya

Canon,Inc.

March18–22,2013

IIITDM-Jabalpur

“DependableCompung”



2

Issues in distributed systems• Distributed Systems

– Include arbitrary number of system processes and user processes

– Consist of two or more processor/memory modules

– Processes communicate with each other by message passing (with no sharedmemory)

– Global control for inter-process communication and system management

– Variable delays exist in communication between processes

• Purpose: Fault tolerance, Performance enhancement、Extensibility,Resource sharing

• All information describing the global system state ( process state anddata) must be maintained so that all participating processes have aconsistent and identical view of the global state

• Issues – Clock Synchronization, Mutual Exclusion, Concurrency control, Multiple copy

update, Error recovery



3

Clock Synchronization• Distributed systems are asynchronous in nature

• Variable delays in computing and communication

• Possible inconsistency in processes recognizing the temporalordering of event occurrences

• P2 perceives “P1 -> P2”, while P3 perceives “P2 -> P1” !

P1

P2

P3



4

Atomic action (1)

• Main problem in Distributed Systems: Maintaining Consistency

• Basic concept for solution: Atomic Action

• To realize the Atomic Action (consistency control), processesneed to have a common agreement on the following;

– Temporal ordering of event occurrences in the system

– Global system state and state transition

• Possibility that the agreement may be impaired due to delays in

inter-process communication and faults in nodes and/or links＝＞Clock synchronization, Byzantine agreement



5

Atomic action (2)• A method of process structuring for allowing the writer of a procedure

to secure the same benefit of atomicity, i.e. Indivisibility, non-interference, strict sequencing

• Basic notion to solve consistency problems in distributed systems

• Generalized notion of transactions for database concurrency control

• Several definitions:

• 1) An action is atomic if the process performing it is not aware of theexistence of any other active process (can detect no spontaneousstate change) and no other process is aware of the activity of thisprocess (its state changes are concealed) during the time the processis performing the action

• 2) An action is atomic if the process performing it does notcommunicate with other processes while it is executing the action

• 3) Actions are atomic if they can be considered, so far as other processes are concerned, to be indivisible and instantaneous, suchthat the effects on the system are as if they were interleaved asopposed to concurrent



6

NestedstructureofAtomicAcons• Definedrelavelyatanylevelofprocessstructure

• Anatomicaconatalevelconsistsoftwoormoreatomicaconsatalower

level

P1

P2

P3

A

B

C

D

FG

E



7

Logical Clock (1)L.Lamport: ”Time, clocks, and the ordering of events in a distributed system”,

C.ACM, Vol.21, No.7, pp.558 – 565 (1978)• System: a collection of distinct processes, each of which consists of a

sequence of events

• Event a “happened before” event b （denoted by a -> b）: – If a and b are in the same process, and a comes before b, then a->b

– If a is the sending of a message by one process and b is the receipt of the samemessage by another process, then a->b

– If a->b and b->c, then a->c

• (Logical) Clock Ci for each process Pi is defined to be a function whichassigns a number Ci(a) to any event a in that process

• (Correct) Clock condition: For any events a, b, If a->b then C(a)<C(b)

• The clock condition is satisfied if the following two conditions hold;

• C1: If a and b are in process Pi, and a comes before b, then Ci(a)<Ci(b)

• C2: If a is the sending of a message by process Pi, and b is the receipt of the same message by process P j, then Ci(a)<C j(b)



LogicalClock(2)• Assume that the processes are algorithms and the events represent certain

actions during their execution. Process Pi’s clock is represented by aregister Ci, so that Ci(a) is the value contained by Ci during the event a

• Condition C1 and C2, and therefore the Clock Condition, are satisfied if thefollowing implementation rules are satisfied;

• IR1: Each process Pi increments Ci between any two successive events

• IR2: (a) If event a is the sending of a message m by process Pi, thenmessage m contains a timestamp Tm = Ci(a). (b) Upon receiving amessage m, process P j sets C j greater than or equal to its present valueand greater than Tm

• Hence, the simple implementation rules guarantee a correct system of

logical clocks• A system of clocks satisfying the Clock Condition can be used to place atotal ordering on the set of all system events



Total ordering of events•

Define a relation “=>” as follows;

• “If a is an event in process Pi and b is an event in process P j, then a=>b if and only if either (i) Ci(a)<C j(b) or (ii) Ci(a)=C j(b) and Pi<<P j , where << isany arbitrary total ordering of processes”

• Then, relation “=>” is a total ordering, i.e. relation “=>” is a way of

completing the “happened before” partial ordering to a total ordering

• Total ordering is useful in implementing a distributed system !



Mutual Exclusion Problem• Find an algorithm for granting the single shared resource to a

process which satisfies the following three conditions;

• (I) A process which has been granted the resource mustrelease it before it can be granted to another process.

•

(II) Different requests for the resource must be granted in theorder in which they are made.

• (III) If every process which is granted the resource eventuallyrelease it, then every request is eventually granted

• This is a non-trivial problem. A central scheduling processwill not work!



Distributed algorithm for M.E.1. To request the resource, process Pi sends the message Tm:Pi requests resource

to every other process, and puts that message on its request queue, where Tm isthe timestamp of the message

2. When process P j receives the message Tm:Pi requests resource, it places it onits request queue and sends a (timestamped) acknowledgment message to Pi

3. To release the resource, process Pi removes any Tm:Pi requests resource

message from its request queue, and sends a (timestamped) Pi releasesresource message to every other process

4. When process P j receives a Pi releases resource message, it removes any Tm:Pi requests resource message from its request queue

5. Process Pi is granted the resource when the following two conditions aresatisfied: (i) There is a Tm:Pi requests resource message in its request queue

which is ordered before any other request in its queue by the relation “=>” (ii) Pi has received a message from every other process timestamped later than Tm

Note that conditions (i) and (ii) of rule 5 are tested locally by P i



Anomalous behavior • Logical clock based on the relation “=>” may cause “anomalous behavior”

• This can happen because the system has no way of knowing the actual precedence information a->b that is based on the phone message external tothe system

• => we need a system of physical clocks

Computer

A

Computer

C

Computer

Ba b

TA:ReqA TB:ReqB

Phonecall

TB<TAcanhappen

Whileactuallya->b



Physical clock• Ci(t) denotes the reading of clock Ci at physical time t

• For Ci to be a true physical clock, the following must be satisfied; – PC1: There exists a constant κ <<1 such that for all i, |dCi(t)/dt-1|<κ

– PC2: There exists a constant ε such that for all i, j, |Ci(t) – C j(t)|<ε

• To prevent anomalous behavior, for such a number µ that is less than the shortesttransmission time for interprocess messages, it must be made sure that, for any i, j,Ci(t+µ) – C j(t) > 0

• Combining the above with PC1 implies that Ci(t+µ) – Ci(t) > (1-κ)µ

•

Using PC2, it actually holds that Ci(t+µ) – C j(t) > 0 if it holds that ε/(1-κ)≤µ

• Let m be a message sent at physical time t and received at t’, and the minimumtransmission delay µm for m be known to the process that receives m

• Assuming PC1, PC2 can be insured by the following Implementation Rule;

– IR1’: For each i, if P i does not receive a message at physical time t, then Ci isdifferentiable at t and dCi(t)/dt>0

– IR2’: (a) If Pisends a message m at physical time t, then m contains a timestamp

Tm=Ci(t). (b) Upon receiving a message m at time t’, process P j sets C j(t’) equalto MAX{C j(t’-0), Tm+µm)

• To synchronize physical clocks, a process only needs to know its own clock reading and the timestamps of messages it receives



14

Byzantine Generals ProblemL.Lamport,et al:”The Byzantine generals problem,

ACM Trans. Prog. Lang. Syst., Vol.4, No.3, pp.382-401 (1982)

• A problem of coping such a situation that one or more faultycomponents of a system send conflicting information to different part of the system

• A group of generals of the Byzantine army camped with their troops

around an enemy city• Communicating with one another only by messenger, the generals must

agree upon a common battle plan

• However, some of the generals may be traitors trying to prevent theloyal generals from reaching agreement

• [Byzantine Generals Problem]: A commanding general must send an

order to his n-1 lieutenant generals such that• IC1: All loyal lieutenants obey the same order

• IC2: If the commanding general is loyal, then every loyal lieutenantobeys the order he sends



Impossibility for n<3m+1• With oral messages, no solution for fewer than 3m+1 generals can cope

with m traitors

• There is no way for Lieutenant 1 to distinguish between the two scenarios

Commander

Lieutenant1 Lieutenant2

Commander

Lieutenant1 Lieutenant

2

“a\ack” “a\ack”

“a\ack” “retreat”

“he said ‘retreat’”

“he said ‘retreat’”

traitor



Oral message for n≥3m+1•

Oral message is one whose contents are completely under the control of senders, so a traitorous sender can transmit any possible message

• Assumptions

• A1: Every message that is sent is delivered correctly

• A2: The receiver of a message knows who sent it

•

A3: The absence of a message can be detected

• We inductively define the Oral Message algorithm OM(m), for all non-negativeintegers m, by which a commander sends an order to n-1 lieutenants

• OM(m) solves the Byzantine Generals Problem for 3m+1 or more generals inthe presence of at most m traitors

• We consider the case in which only possible decisions are “attack” or “retreat”

• The algorithm is described in terms of Lieutenants “obtaining a value” rather than “obeying an order”



Oral Message algorithm OM(m) • Algorithm OM(0)• (1) The commander sends his value to every lieutenant

• (2) Each lieutenant uses the value he receives from the commander, or usesthe value RETREAT if he receives no value

• Algorithm OM(m), m>0

• (1) The commander sends his value to every lieutenant

• (2) For each i, let vi be the value Lieutenant i receives from the commander, or else be RETREAT if he receives no value. Lieutenant i acts as the commander in OM(m-1) to send the value vi to each of the n-2 other lieutenant

• (3) For each i, and each j≠i, let v j be the value Lieutenant i received fromLieutenant j in step (2) (using OM(m-1)), or else RETREAT if he received nosuch value. Lieutenant i uses the value majority (v1, v2, …, vn-1)



Commander

Lieutenant2

Commander

Lieutenant2

Lieutenant3

Lieutenant1

Lieutenant3

Lieutenant1

v vv

v x

xy z

yy

xz

zx

traitor

Lieutenant 2 obtains the correct value v = majority (v, v, x)

All lieutenants obtain the same value majority (x, y, z)

Algorithm OM(1)



19

Signed message•

If the traitors’ ability to lie can be restricted, an algorithm exists to cope withm traitors for any number ( ≥ m+2) of generals

• A4 (Additional assumption):• (a) A loyal general’s signature cannot be forged, and any alteration of the

contents of his signed messages can be detected• (b) Anyone can verify the authenticity of a general’s signature• (No assumption is made about a traitorous general’s signature. His

signature is allowed to be forged by another traitor, thereby permittingcollusion among the traitors)

• The commander sends a signed order to each of his lieutenants• Each lieutenant then adds his signature to that order and send it to the other

lieutenants, who add their signatures and send it to others, and so on• Let x:i denote the value x signed by General i. Thus, x:i,j denotes the value

x signed by i, and then that value x:i signed by j• Let General 0 be the commander • Each lieutenant i maintains a set Vi of properly signed orders he has

received so far



Signed message algorithm SM(m)• Initially Vi = ϕ

• (1) The commander signs and sends his value to every lieutenant

• (2) For each i :

– (A) If Lieutenant i receives a message of the form v:0 from thecommander and he has not yet received any order, then

• (i) he lets Vi equal {v};

• (ii) he sends the message v:0:i to every other lieutenant

– (B) If Lieutenant i receives a message of the form v:0:j1: … :jk and v isnot in the set Vi, then

• (i) he adds v to Vi;

• (ii) if k<m, then he sends the message v:0:j1: … jk:i to everylieutenant other than j1, … , jk

• (3) For each i: When Lieutenant i will receive no more messages, he obeysthe order choice (Vi)



Commander

Lieutenant2Lieutenant

1

“a\ack”:0“a\ack”:0:1

“retreat”:0:2

“retreat”:0

Commander

Lieutenant2Lieutenant

1

“a\ack”:0

“a\ack”:0:1

“a\ack”:0:x

“a\ack”:0

Lieutenants1and2obeytheorderchoice({“a\ack”,“retreat”})and

knowthecommanderisatraitorbecauseofhissignatureontwodifferentorders

Lieutenant1obeystheorderchoice({“a\ack”})



22

Concurrency Control•

Even if mutual exclusion is realized at a basic action level of processes, ainconsistent state may appear when two or more processes try to accessthe same database

• Example: Two client A and B may wish to send $10 and $20, respectively,to a common account independently of one another ：

• Making READ and WRITE being atomic actions individually is not enough !What will happen if the order of the READ and WRITE commands beingexecuted is, for example, A1, B1, A2, B2 ?

• Transaction：a sequence of READ and WRITE commands sent by a clientto the file system

• Concurrency control (Serializability control)： Executing multipletransactions that occur simultaneously as serializable atomic actions

Process A

A1: READ BALANCE ADD $10

A2: WRITE BALANCE

Process B

B1: READ BALANCE ADD $20

B2: WRITE BALANCE



READX

X=X-10WRITEX

READY

Y=Y+10

WRITEY

READY

Y=Y-20WRITEY

READZ

Z=Z+20

WRITEZ

READX

X=X-10

WRITEX

READY

Y=Y+10

WRITEY

READY

Y=Y-20

WRITEY

READZ

Z=Z+20

WRITEZ

READX

X=X-10WRITEX

READY

Y=Y+10

WRITEY

READY

Y=Y-20

WRITEY

READZ

Z=Z+20

WRITEZ

Serial

ExecuonSerializable

Execuon Non-serializable

Execuon

T1 T2 T3 T4 T5 T6Serializability X=20, y=40, z=60

X+Y+Z is preserved



24

2-phase locking• A lock is an access privilege on a data item, which is granted to a particular

transaction so that one transition can access the data item at a time• When a transaction tries to access a data item, it must lock the item before

accessing it, and unlock it on finishing the access

• In order for the 2-phase locking to guarantee consistency, each transaction

– Does not lock the data item that has been already locked

– locks a data item before accessing it

– unlocks all the data items before finishing the transaction

– Once having unlocked a data item, does not acquire any more locks

• Each transaction is divided into two phases, i.e. growing phase and shrinking

phase. The number of locked items increases monotonically at the growingphase and decreases monotonically at the shrinking phase

• The 2-phase locking makes all the transactions serializable, i.e. atomic

actions !



25

Timestamp• Every transaction are given a timestamp when it occurs

•

Every request for accessing a data item are given its transaction’stimestamp

• If there is a conflict among requests for accessing a data item, theearliest one is granted according to the order of timestamps

• Algorithm for the scheduler at each site;

– For each data item X , the scheduler records the largest timestamp

W(X) of WRITE requests and the largest timestamp R(X) of READrequests that have been processed

– For READ request with timestamp T、if T<W(X), the scheduler rejects the READ requests . Otherwise, it outputs the READrequest and set R(X) to MAX(R(X), T).

– For WRITE request with timestamp T, if T<MAX(R(X), W(X)), thescheduler rejects the WITE request. Otherwise, it outputs the

WRITE request and sets W(X) to T

• If READ request or WRITE request is rejected, the requestingtransaction is aborted, assigned a new larger timestamp and restarted



26

Multiple Copy Update• Multiple copies of a complete database distributed for higher reliability/

availability requirements must be kept consistent

– commit：make all the update made by a transaction permanent

– Abort：roll back( or undo) a transaction to ensure that no effect of the transaction remains in the database

• Commit control for replicated database：

– Ensures that either a transaction is committed by every site or aborted by every site (all or nothing)

– Involves a) commit control for a single transaction, and b)serialization of concurrent transactions



2-phase commit protocol• Given a coordinator node designated, the commit control for a single

transaction can be realized by the following the 2-phase commit protocol;

– Commit-request phase：Coordinator node sends a query to commitmessage to all the other nodes. Each node replies to the coordinator with agree-to-commit message if the transaction succeeded, or abortmessage if the transaction failed

– Commit phase：If the coordinator receives “agree to commit” from all theother nodes, it sends them a “commit” message, otherwise sends a “rollback” message to all the nodes

• Access control of replicated database including serialization of concurrenttransactions

– L.Svobodova:”Attaining resilience in distributed systems”, Chapter 5 of Dependability of Resilient Computers(Ed. By T.Anderson) BSP ProfessionalBooks 1989



28

Error recovery•

When a transient fault or a process abort occurs, affected processes arerolled back to a point (checkpoint) prior to the occurrence of the fault

• Checkpointing：recording a snapshot of the entire state of a process at amoment that is needed to restart the process from that point

• C A: checkpoint for process A CB：checkpoint for process B

• If the communication line intersects the line that links C A and C B , there will

be an inconsistency in the system state when the failed process is rolled

back to the checkpoint C A

ProcessA

ProcessB

CA

CB

communicaonＸfailure



29

Domino effects•

If processes establish their checkpoints independently of each other,there will occur the Domino effectsProcessA

ProcessB

ProcessC

SA

SB

SC

CA1 CA2

CB1 CB2

CC1 CC2

X

ProcessA

ProcessB

ProcessC

SA

SB

SC

CA1 CA2

CB1 CB2

CC1 CC2

XCB3

A recovery line is created if C B3 is additionally established ！

Distributed System.pdf

Documents