Top Banner
1/33 Kyung Hee University Fault Tolerance Chap 7
34

Kyung Hee University 1/33 Fault Tolerance Chap 7.

Jan 21, 2016

Download

Documents

Colin Chambers
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kyung Hee University 1/33 Fault Tolerance Chap 7.

1/33

Kyung Hee University

Fault Tolerance

Chap 7

Page 2: Kyung Hee University 1/33 Fault Tolerance Chap 7.

2/33

Kyung Hee University

Index

Introduction to Fault ToleranceProcess ResilienceReliable Client-Server communicationReliable group communicationDistributed commitRecoverySummary

Page 3: Kyung Hee University 1/33 Fault Tolerance Chap 7.

3/33

Kyung Hee University

Basic Concepts

Dependable systems:

Availability: property that a system to be used immediately

Reliability: the property that a system can run continuously without failure

Safety: if a system temporarily fails to operate correctly, nothing catastrophic happens

Maintainability: refers to how easy a failed system can be repaired

Page 4: Kyung Hee University 1/33 Fault Tolerance Chap 7.

4/33

Kyung Hee University

Failure Models

Different types of failures.

Type of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure Receive omission Send omission

A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

Page 5: Kyung Hee University 1/33 Fault Tolerance Chap 7.

5/33

Kyung Hee University

Failure Masking by Redundancy

Three kind for masking faults: information redundancy, time redundancy, and physical redundancy.

Triple modular redundancy.

Page 6: Kyung Hee University 1/33 Fault Tolerance Chap 7.

6/33

Kyung Hee University

Flat groups versus Hierarchical groups

a) Communication in a flat group.b) Communication in a simple hierarchical group

Page 7: Kyung Hee University 1/33 Fault Tolerance Chap 7.

7/33

Kyung Hee University

Failure Masking and Replication

Failure masking having a group of identical processes

A group of process is organized in a hierarchical fashion with one or more primary backups

An important issue with using process groups to tolerate faults is how much replication is needed

Page 8: Kyung Hee University 1/33 Fault Tolerance Chap 7.

8/33

Kyung Hee University

Agreement in Faulty Systems (1)

3000 3000

5000Red Troop

Blue TroopCommand by Napoleon

Blue TroopCommand by Alexander

it is easy to show that Alexander and Napoleon will never reach agreement, no matter how many acknowledgements

they send. (due to unreliable communication).

Attack Attack

Page 9: Kyung Hee University 1/33 Fault Tolerance Chap 7.

9/33

Kyung Hee University

Agreement in Faulty Systems (2)

The Byzantine generals problem for 3 loyal generals and1 traitor.a) The generals announce their troop strengths (in units of 1

kilosoldiers).b) The vectors that each general assembles based on (a)c) The vectors that each general receives in step 3.

Page 10: Kyung Hee University 1/33 Fault Tolerance Chap 7.

10/33

Kyung Hee University

Agreement in Faulty Systems (3)

The same as in previous slide, except now with 2 loyal generals and one traitor.

Lamport proved that in a system with m faulty processes, agreement can be achieved only if 2m+1 correctly functioning processes are present, for a total of 3m+1.

Page 11: Kyung Hee University 1/33 Fault Tolerance Chap 7.

11/33

Kyung Hee University

RPC Semantics in the Presence of Failures

1. The client is unable to locate the server

2. The request message from the client to the server is lost

3. The server crashes after receiving a request

4. The reply message from the server to the client is lost

5. The client crashes after sending a request

Page 12: Kyung Hee University 1/33 Fault Tolerance Chap 7.

12/33

Kyung Hee University

RPC Semantics in the Presence of Failures

1. The client is unable to locate the server (solution)raise an exception (like divide by 0) destroys the transparency

2. The request messages from the client to the server is lost using a timer for sending the request

Timer expired request message is sent a gain

Page 13: Kyung Hee University 1/33 Fault Tolerance Chap 7.

13/33

Kyung Hee University

Sever Crashes (1)

Three philosophy exist on what to do here: At least once semantics At most once semantics Do nothing

A server in client-server communicationa) Normal caseb) Crash after execution c) Crash before execution

Page 14: Kyung Hee University 1/33 Fault Tolerance Chap 7.

14/33

Kyung Hee University

Sever Crashes (2)

Different combinations of client and server strategies in the presence of server crashes.

Client Server

Strategy M P Strategy P M

Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM)

Always DUP OK OK DUP DUP OK

Never OK ZERO ZERO OK OK ZERO

Only when ACKed DUP OK ZERO DUP OK ZERO

Only when not ACKed OK ZERO OK OK DUP OK

Page 15: Kyung Hee University 1/33 Fault Tolerance Chap 7.

15/33

Kyung Hee University

Lost Reply MessagesClient Crashes

Lost Reply MessagesNo reply send the request (idempotent request)

once more Assign each request a sequence number

Client Crashes Appearance of orphanExtermination: check log and kill orphanReincarnation: based on broadcasting message to all

machines declaring the state of a new epoch (when client reboots)

Gentle reincarnation: like reincarnation, orphan is killed only if owner cannot be found

Expiration: each RPC is given a standard amount of time

Page 16: Kyung Hee University 1/33 Fault Tolerance Chap 7.

16/33

Kyung Hee University

Basic Reliable-Multicasting

A simple solution to reliable multicasting when all receivers are known and are assumed not to fail

a) Message transmission

b) Reporting feedback

Page 17: Kyung Hee University 1/33 Fault Tolerance Chap 7.

17/33

Kyung Hee University

Nonhierarchical Feedback Control

Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Page 18: Kyung Hee University 1/33 Fault Tolerance Chap 7.

18/33

Kyung Hee University

Hierarchical Feedback Control

The essence of hierarchical reliable multicasting.a) Each local coordinator forwards the message to its children.b) A local coordinator handles retransmission requests.

Page 19: Kyung Hee University 1/33 Fault Tolerance Chap 7.

19/33

Kyung Hee University

Virtual Synchrony (1)

The logical organization of a distributed system to distinguish between message receipt and message delivery

Page 20: Kyung Hee University 1/33 Fault Tolerance Chap 7.

20/33

Kyung Hee University

Virtual Synchrony (2)

The principle of virtual synchronous multicast.

Page 21: Kyung Hee University 1/33 Fault Tolerance Chap 7.

21/33

Kyung Hee University

Message Ordering (1)

1. Unordered multicast

2. FIFO-ordered multicast

Process P1 Process P2 Process P3

sends m1 receives m1 receives m2

sends m2 receives m2 receives m1

Process P1 Process P2 Process P3 Process P4

sends m1 receives m1 receives m3 sends m3

sends m2 receives m3 receives m1 sends m4

receives m2 receives m2

receives m4 receives m4

Page 22: Kyung Hee University 1/33 Fault Tolerance Chap 7.

22/33

Kyung Hee University

Message Ordering (2)3. Reliable causally-ordered multicast delivers

messages so that potential causality between different messages is preserved

4. Total-ordered delivery

Multicast Basic Message Ordering Total-ordered Delivery?

Reliable multicast None No

FIFO multicast FIFO-ordered delivery No

Causal multicast Causal-ordered delivery No

Atomic multicast None Yes

FIFO atomic multicast FIFO-ordered delivery Yes

Causal atomic multicast Causal-ordered delivery Yes

Page 23: Kyung Hee University 1/33 Fault Tolerance Chap 7.

23/33

Kyung Hee University

Implementing Virtual SynchronyIsis (point-to-point communication is reliable, using TCP)

a) Process 4 notices that process 7 has crashed, sends a view change

b) Process 6 sends out all its unstable messages, followed by a flush message

c) Process 6 installs the new view when it has received a flush message from everyone else

Page 24: Kyung Hee University 1/33 Fault Tolerance Chap 7.

24/33

Kyung Hee University

Two-phase Commit (1)

Process crashes other processes may be indefinite waiting for a message This protocol can easily fail

timeout mechanisms are used

a) The finite state machine for the coordinator in 2PC.b) The finite state machine for a participant.

Page 25: Kyung Hee University 1/33 Fault Tolerance Chap 7.

25/33

Kyung Hee University

Two-phase Commit (2)

Outline of the steps taken by the coordinator in a two phase commit protocol

actions by coordinator:

while START _2PC to local log;multicast VOTE_REQUEST to all participants;while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote;}if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants;} else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants;}

Page 26: Kyung Hee University 1/33 Fault Tolerance Chap 7.

26/33

Kyung Hee University

Three-phase Commit

a) Finite state machine for the coordinator in 3PCb) Finite state machine for a participant

Page 27: Kyung Hee University 1/33 Fault Tolerance Chap 7.

27/33

Kyung Hee University

Recovery-Introduction

Goal: replace an erroneous state with an error-free state

Backward recovery: to restore such a recorded state when things go wrong

Combining checkpoints and message logging

Forward recovery: an attempt is made to bring the system in a correct new state from which it can continue to execute

Page 28: Kyung Hee University 1/33 Fault Tolerance Chap 7.

28/33

Kyung Hee University

Stable Storage

Stable storage is well suited to applications that require a high degree of fault tolerance

a) Stable Storageb) Crash after

drive 1 is updated

c) Bad spot

Page 29: Kyung Hee University 1/33 Fault Tolerance Chap 7.

29/33

Kyung Hee University

Checkpointing

A recovery line.

Page 30: Kyung Hee University 1/33 Fault Tolerance Chap 7.

30/33

Kyung Hee University

Independent Checkpointing

The domino effect.

Page 31: Kyung Hee University 1/33 Fault Tolerance Chap 7.

31/33

Kyung Hee University

Message Logging

Incorrect replay of messages after recovery, leading to an orphan process.

Page 32: Kyung Hee University 1/33 Fault Tolerance Chap 7.

32/33

Kyung Hee University

Summarization (1)

Fault tolerance is defined as the characteristic by which a system cam mask the occurrence and recovery from failures

Redundancy is the key technique needed to achieve fault tolerance

Reliable group communication is suitable for small groups

Atomic multicasting can be precisely formulated in terms of a virtual synchronous execution model

Page 33: Kyung Hee University 1/33 Fault Tolerance Chap 7.

33/33

Kyung Hee University

Summarization (2)

Group membership change agreement on the same list of members using commit protocol

Recovery in fault-tolerant systems is invariably achieved by checkpointing with message logging

Problem: in RPC failures, they only mention about how to kill an orphan why don’t use it again

Page 34: Kyung Hee University 1/33 Fault Tolerance Chap 7.

34/33

Kyung Hee University

End of chapter 7

Thank you for joining us!