Distributed systems Takashi Nanya Canon,Inc. March18–22,2013 IIITDM-Jabalpur “DependableCompung”
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 1/29
Distributed systemsTakashiNanya
Canon,Inc.
March18–22,2013
IIITDM-Jabalpur
“DependableCompung”
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 2/29
2
Issues in distributed systems• Distributed Systems
– Include arbitrary number of system processes and user processes
– Consist of two or more processor/memory modules
– Processes communicate with each other by message passing (with no sharedmemory)
– Global control for inter-process communication and system management
– Variable delays exist in communication between processes
• Purpose: Fault tolerance, Performance enhancement、Extensibility,Resource sharing
• All information describing the global system state ( process state anddata) must be maintained so that all participating processes have aconsistent and identical view of the global state
• Issues – Clock Synchronization, Mutual Exclusion, Concurrency control, Multiple copy
update, Error recovery
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 3/29
3
Clock Synchronization• Distributed systems are asynchronous in nature
• Variable delays in computing and communication
• Possible inconsistency in processes recognizing the temporalordering of event occurrences
• P2 perceives “P1 -> P2”, while P3 perceives “P2 -> P1” !
P1
P2
P3
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 4/29
4
Atomic action (1)
• Main problem in Distributed Systems: Maintaining Consistency
• Basic concept for solution: Atomic Action
• To realize the Atomic Action (consistency control), processesneed to have a common agreement on the following;
– Temporal ordering of event occurrences in the system
– Global system state and state transition
• Possibility that the agreement may be impaired due to delays in
inter-process communication and faults in nodes and/or links=>Clock synchronization, Byzantine agreement
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 5/29
5
Atomic action (2)• A method of process structuring for allowing the writer of a procedure
to secure the same benefit of atomicity, i.e. Indivisibility, non-interference, strict sequencing
• Basic notion to solve consistency problems in distributed systems
• Generalized notion of transactions for database concurrency control
• Several definitions:
• 1) An action is atomic if the process performing it is not aware of theexistence of any other active process (can detect no spontaneousstate change) and no other process is aware of the activity of thisprocess (its state changes are concealed) during the time the processis performing the action
• 2) An action is atomic if the process performing it does notcommunicate with other processes while it is executing the action
• 3) Actions are atomic if they can be considered, so far as other processes are concerned, to be indivisible and instantaneous, suchthat the effects on the system are as if they were interleaved asopposed to concurrent
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 6/29
6
NestedstructureofAtomicAcons• Definedrelavelyatanylevelofprocessstructure
• Anatomicaconatalevelconsistsoftwoormoreatomicaconsatalower
level
P1
P2
P3
A
B
C
D
FG
E
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 7/29
7
Logical Clock (1)L.Lamport: ”Time, clocks, and the ordering of events in a distributed system”,
C.ACM, Vol.21, No.7, pp.558 – 565 (1978)• System: a collection of distinct processes, each of which consists of a
sequence of events
• Event a “happened before” event b (denoted by a -> b): – If a and b are in the same process, and a comes before b, then a->b
– If a is the sending of a message by one process and b is the receipt of the samemessage by another process, then a->b
– If a->b and b->c, then a->c
• (Logical) Clock Ci for each process Pi is defined to be a function whichassigns a number Ci(a) to any event a in that process
• (Correct) Clock condition: For any events a, b, If a->b then C(a)<C(b)
• The clock condition is satisfied if the following two conditions hold;
• C1: If a and b are in process Pi, and a comes before b, then Ci(a)<Ci(b)
• C2: If a is the sending of a message by process Pi, and b is the receipt of the same message by process P j, then Ci(a)<C j(b)
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 8/29
LogicalClock(2)• Assume that the processes are algorithms and the events represent certain
actions during their execution. Process Pi’s clock is represented by aregister Ci, so that Ci(a) is the value contained by Ci during the event a
• Condition C1 and C2, and therefore the Clock Condition, are satisfied if thefollowing implementation rules are satisfied;
• IR1: Each process Pi increments Ci between any two successive events
• IR2: (a) If event a is the sending of a message m by process Pi, thenmessage m contains a timestamp Tm = Ci(a). (b) Upon receiving amessage m, process P j sets C j greater than or equal to its present valueand greater than Tm
• Hence, the simple implementation rules guarantee a correct system of
logical clocks• A system of clocks satisfying the Clock Condition can be used to place atotal ordering on the set of all system events
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 9/29
Total ordering of events•
Define a relation “=>” as follows;
• “If a is an event in process Pi and b is an event in process P j, then a=>b if and only if either (i) Ci(a)<C j(b) or (ii) Ci(a)=C j(b) and Pi<<P j , where << isany arbitrary total ordering of processes”
• Then, relation “=>” is a total ordering, i.e. relation “=>” is a way of
completing the “happened before” partial ordering to a total ordering
• Total ordering is useful in implementing a distributed system !
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 10/29
Mutual Exclusion Problem• Find an algorithm for granting the single shared resource to a
process which satisfies the following three conditions;
• (I) A process which has been granted the resource mustrelease it before it can be granted to another process.
•
(II) Different requests for the resource must be granted in theorder in which they are made.
• (III) If every process which is granted the resource eventuallyrelease it, then every request is eventually granted
• This is a non-trivial problem. A central scheduling processwill not work!
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 11/29
Distributed algorithm for M.E.1. To request the resource, process Pi sends the message Tm:Pi requests resource
to every other process, and puts that message on its request queue, where Tm isthe timestamp of the message
2. When process P j receives the message Tm:Pi requests resource, it places it onits request queue and sends a (timestamped) acknowledgment message to Pi
3. To release the resource, process Pi removes any Tm:Pi requests resource
message from its request queue, and sends a (timestamped) Pi releasesresource message to every other process
4. When process P j receives a Pi releases resource message, it removes any Tm:Pi requests resource message from its request queue
5. Process Pi is granted the resource when the following two conditions aresatisfied: (i) There is a Tm:Pi requests resource message in its request queue
which is ordered before any other request in its queue by the relation “=>” (ii) Pi has received a message from every other process timestamped later than Tm
Note that conditions (i) and (ii) of rule 5 are tested locally by P i
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 12/29
Anomalous behavior • Logical clock based on the relation “=>” may cause “anomalous behavior”
• This can happen because the system has no way of knowing the actual precedence information a->b that is based on the phone message external tothe system
• => we need a system of physical clocks
Computer
A
Computer
C
Computer
Ba b
TA:ReqA TB:ReqB
Phonecall
TB<TAcanhappen
Whileactuallya->b
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 13/29
Physical clock• Ci(t) denotes the reading of clock Ci at physical time t
• For Ci to be a true physical clock, the following must be satisfied; – PC1: There exists a constant κ <<1 such that for all i, |dCi(t)/dt-1|<κ
– PC2: There exists a constant ε such that for all i, j, |Ci(t) – C j(t)|<ε
• To prevent anomalous behavior, for such a number µ that is less than the shortesttransmission time for interprocess messages, it must be made sure that, for any i, j,Ci(t+µ) – C j(t) > 0
• Combining the above with PC1 implies that Ci(t+µ) – Ci(t) > (1-κ)µ
•
Using PC2, it actually holds that Ci(t+µ) – C j(t) > 0 if it holds that ε/(1-κ)≤µ
• Let m be a message sent at physical time t and received at t’, and the minimumtransmission delay µm for m be known to the process that receives m
• Assuming PC1, PC2 can be insured by the following Implementation Rule;
– IR1’: For each i, if P i does not receive a message at physical time t, then Ci isdifferentiable at t and dCi(t)/dt>0
– IR2’: (a) If Pisends a message m at physical time t, then m contains a timestamp
Tm=Ci(t). (b) Upon receiving a message m at time t’, process P j sets C j(t’) equalto MAX{C j(t’-0), Tm+µm)
• To synchronize physical clocks, a process only needs to know its own clock reading and the timestamps of messages it receives
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 14/29
14
Byzantine Generals ProblemL.Lamport,et al:”The Byzantine generals problem,
ACM Trans. Prog. Lang. Syst., Vol.4, No.3, pp.382-401 (1982)
• A problem of coping such a situation that one or more faultycomponents of a system send conflicting information to different part of the system
• A group of generals of the Byzantine army camped with their troops
around an enemy city• Communicating with one another only by messenger, the generals must
agree upon a common battle plan
• However, some of the generals may be traitors trying to prevent theloyal generals from reaching agreement
• [Byzantine Generals Problem]: A commanding general must send an
order to his n-1 lieutenant generals such that• IC1: All loyal lieutenants obey the same order
• IC2: If the commanding general is loyal, then every loyal lieutenantobeys the order he sends
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 15/29
Impossibility for n<3m+1• With oral messages, no solution for fewer than 3m+1 generals can cope
with m traitors
• There is no way for Lieutenant 1 to distinguish between the two scenarios
Commander
Lieutenant1 Lieutenant2
Commander
Lieutenant1 Lieutenant
2
“a\ack” “a\ack”
“a\ack” “retreat”
“he said ‘retreat’”
“he said ‘retreat’”
traitor
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 16/29
Oral message for n≥3m+1•
Oral message is one whose contents are completely under the control of senders, so a traitorous sender can transmit any possible message
• Assumptions
• A1: Every message that is sent is delivered correctly
• A2: The receiver of a message knows who sent it
•
A3: The absence of a message can be detected
• We inductively define the Oral Message algorithm OM(m), for all non-negativeintegers m, by which a commander sends an order to n-1 lieutenants
• OM(m) solves the Byzantine Generals Problem for 3m+1 or more generals inthe presence of at most m traitors
• We consider the case in which only possible decisions are “attack” or “retreat”
• The algorithm is described in terms of Lieutenants “obtaining a value” rather than “obeying an order”
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 17/29
Oral Message algorithm OM(m) • Algorithm OM(0)• (1) The commander sends his value to every lieutenant
• (2) Each lieutenant uses the value he receives from the commander, or usesthe value RETREAT if he receives no value
• Algorithm OM(m), m>0
• (1) The commander sends his value to every lieutenant
• (2) For each i, let vi be the value Lieutenant i receives from the commander, or else be RETREAT if he receives no value. Lieutenant i acts as the commander in OM(m-1) to send the value vi to each of the n-2 other lieutenant
• (3) For each i, and each j≠i, let v j be the value Lieutenant i received fromLieutenant j in step (2) (using OM(m-1)), or else RETREAT if he received nosuch value. Lieutenant i uses the value majority (v1, v2, …, vn-1)
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 18/29
Commander
Lieutenant2
Commander
Lieutenant2
Lieutenant3
Lieutenant1
Lieutenant3
Lieutenant1
v vv
v x
xy z
yy
xz
zx
traitor
Lieutenant 2 obtains the correct value v = majority (v, v, x)
All lieutenants obtain the same value majority (x, y, z)
Algorithm OM(1)
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 19/29
19
Signed message•
If the traitors’ ability to lie can be restricted, an algorithm exists to cope withm traitors for any number ( ≥ m+2) of generals
• A4 (Additional assumption):• (a) A loyal general’s signature cannot be forged, and any alteration of the
contents of his signed messages can be detected• (b) Anyone can verify the authenticity of a general’s signature• (No assumption is made about a traitorous general’s signature. His
signature is allowed to be forged by another traitor, thereby permittingcollusion among the traitors)
• The commander sends a signed order to each of his lieutenants• Each lieutenant then adds his signature to that order and send it to the other
lieutenants, who add their signatures and send it to others, and so on• Let x:i denote the value x signed by General i. Thus, x:i,j denotes the value
x signed by i, and then that value x:i signed by j• Let General 0 be the commander • Each lieutenant i maintains a set Vi of properly signed orders he has
received so far
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 20/29
Signed message algorithm SM(m)• Initially Vi = ϕ
• (1) The commander signs and sends his value to every lieutenant
• (2) For each i :
– (A) If Lieutenant i receives a message of the form v:0 from thecommander and he has not yet received any order, then
• (i) he lets Vi equal {v};
• (ii) he sends the message v:0:i to every other lieutenant
– (B) If Lieutenant i receives a message of the form v:0:j1: … :jk and v isnot in the set Vi, then
• (i) he adds v to Vi;
• (ii) if k<m, then he sends the message v:0:j1: … jk:i to everylieutenant other than j1, … , jk
• (3) For each i: When Lieutenant i will receive no more messages, he obeysthe order choice (Vi)
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 21/29
Commander
Lieutenant2Lieutenant
1
“a\ack”:0“a\ack”:0:1
“retreat”:0:2
“retreat”:0
Commander
Lieutenant2Lieutenant
1
“a\ack”:0
“a\ack”:0:1
“a\ack”:0:x
“a\ack”:0
Lieutenants1and2obeytheorderchoice({“a\ack”,“retreat”})and
knowthecommanderisatraitorbecauseofhissignatureontwodifferentorders
Lieutenant1obeystheorderchoice({“a\ack”})
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 22/29
22
Concurrency Control•
Even if mutual exclusion is realized at a basic action level of processes, ainconsistent state may appear when two or more processes try to accessthe same database
• Example: Two client A and B may wish to send $10 and $20, respectively,to a common account independently of one another :
• Making READ and WRITE being atomic actions individually is not enough !What will happen if the order of the READ and WRITE commands beingexecuted is, for example, A1, B1, A2, B2 ?
• Transaction:a sequence of READ and WRITE commands sent by a clientto the file system
• Concurrency control (Serializability control): Executing multipletransactions that occur simultaneously as serializable atomic actions
Process A
A1: READ BALANCE ADD $10
A2: WRITE BALANCE
Process B
B1: READ BALANCE ADD $20
B2: WRITE BALANCE
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 23/29
READX
X=X-10WRITEX
READY
Y=Y+10
WRITEY
READY
Y=Y-20WRITEY
READZ
Z=Z+20
WRITEZ
READX
X=X-10
WRITEX
READY
Y=Y+10
WRITEY
READY
Y=Y-20
WRITEY
READZ
Z=Z+20
WRITEZ
READX
X=X-10WRITEX
READY
Y=Y+10
WRITEY
READY
Y=Y-20
WRITEY
READZ
Z=Z+20
WRITEZ
Serial
ExecuonSerializable
Execuon Non-serializable
Execuon
T1 T2 T3 T4 T5 T6Serializability X=20, y=40, z=60
X+Y+Z is preserved
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 24/29
24
2-phase locking• A lock is an access privilege on a data item, which is granted to a particular
transaction so that one transition can access the data item at a time• When a transaction tries to access a data item, it must lock the item before
accessing it, and unlock it on finishing the access
• In order for the 2-phase locking to guarantee consistency, each transaction
– Does not lock the data item that has been already locked
– locks a data item before accessing it
– unlocks all the data items before finishing the transaction
– Once having unlocked a data item, does not acquire any more locks
• Each transaction is divided into two phases, i.e. growing phase and shrinking
phase. The number of locked items increases monotonically at the growingphase and decreases monotonically at the shrinking phase
• The 2-phase locking makes all the transactions serializable, i.e. atomic
actions !
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 25/29
25
Timestamp• Every transaction are given a timestamp when it occurs
•
Every request for accessing a data item are given its transaction’stimestamp
• If there is a conflict among requests for accessing a data item, theearliest one is granted according to the order of timestamps
• Algorithm for the scheduler at each site;
– For each data item X , the scheduler records the largest timestamp
W(X) of WRITE requests and the largest timestamp R(X) of READrequests that have been processed
– For READ request with timestamp T、if T<W(X), the scheduler rejects the READ requests . Otherwise, it outputs the READrequest and set R(X) to MAX(R(X), T).
– For WRITE request with timestamp T, if T<MAX(R(X), W(X)), thescheduler rejects the WITE request. Otherwise, it outputs the
WRITE request and sets W(X) to T
• If READ request or WRITE request is rejected, the requestingtransaction is aborted, assigned a new larger timestamp and restarted
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 26/29
26
Multiple Copy Update• Multiple copies of a complete database distributed for higher reliability/
availability requirements must be kept consistent
– commit:make all the update made by a transaction permanent
– Abort:roll back( or undo) a transaction to ensure that no effect of the transaction remains in the database
• Commit control for replicated database:
– Ensures that either a transaction is committed by every site or aborted by every site (all or nothing)
– Involves a) commit control for a single transaction, and b)serialization of concurrent transactions
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 27/29
2-phase commit protocol• Given a coordinator node designated, the commit control for a single
transaction can be realized by the following the 2-phase commit protocol;
– Commit-request phase:Coordinator node sends a query to commitmessage to all the other nodes. Each node replies to the coordinator with agree-to-commit message if the transaction succeeded, or abortmessage if the transaction failed
– Commit phase:If the coordinator receives “agree to commit” from all theother nodes, it sends them a “commit” message, otherwise sends a “rollback” message to all the nodes
• Access control of replicated database including serialization of concurrenttransactions
– L.Svobodova:”Attaining resilience in distributed systems”, Chapter 5 of Dependability of Resilient Computers(Ed. By T.Anderson) BSP ProfessionalBooks 1989
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 28/29
28
Error recovery•
When a transient fault or a process abort occurs, affected processes arerolled back to a point (checkpoint) prior to the occurrence of the fault
• Checkpointing:recording a snapshot of the entire state of a process at amoment that is needed to restart the process from that point
• C A: checkpoint for process A CB:checkpoint for process B
• If the communication line intersects the line that links C A and C B , there will
be an inconsistency in the system state when the failed process is rolled
back to the checkpoint C A
ProcessA
ProcessB
CA
CB
communicaonXfailure
7/30/2019 Distributed System.pdf
http://slidepdf.com/reader/full/distributed-systempdf 29/29
29
Domino effects•
If processes establish their checkpoints independently of each other,there will occur the Domino effectsProcessA
ProcessB
ProcessC
SA
SB
SC
CA1 CA2
CB1 CB2
CC1 CC2
X
ProcessA
ProcessB
ProcessC
SA
SB
SC
CA1 CA2
CB1 CB2
CC1 CC2
XCB3
A recovery line is created if C B3 is additionally established !