Top Banner
2/23/2009 CS5090 1 Implementing Fault- Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat
32

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

Jan 23, 2016

Download

Documents

lenci

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. Fred B. Schneider Presenter: Aly Farahat. Contents. Introduction State Machines Fault-Tolerance Agreement & Order Logical Clocks Synchronized Clocks Server Side Ordering Faulty Output Devices - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 1

Implementing Fault-Tolerant Services Using the State

Machine Approach: A TutorialFred B. Schneider

Presenter:

Aly Farahat

Page 2: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 2

Contents

• Introduction• State Machines• Fault-Tolerance• Agreement & Order

– Logical Clocks– Synchronized Clocks– Server Side Ordering

• Faulty Output Devices• Faulty Clients• Using Time to Make Requests• Reconfiguration

– Managing Reconfiguration– Integrating Repaired Replicas

Page 3: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 3

Client/Server Model

Client1

Client2

Client3

Client4

Network Server

Page 4: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 4

Fault Types

• Fail Stop Faults: a faulty component enters a predefined state and halts

• Byzantine Faults: arbitrary malicious faults

Q: Why do we need logic for programs?

Page 5: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 5

Fault Tolerance

• Based on the concept of Replication• t- tolerant: system delivers correct service up to

a failure of t components• Identical Replicas of Server• t+1 for Fail Stop faults• 2t+1 for Byzantine faults

Q: What kind of fault tolerance is this? What types of faults it can tolerate?

Page 6: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 6

Replication Scheme

Client1

Client2

Client3

Client4

Network

ServerReplica n-2

ServerReplica 1

ServerReplica n-1

ServerReplica n

Page 7: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 7

State Machine Model

• Each Server Replica is an identical state machine

• State Machines are Request Driven Machines and cannot progress on their own

• A client Issues a Request to the State Machine

Page 8: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 8

State Machine Behavior with respect to clients

• O1: Requests Issued by a single client should be processed in the same order they were issued

• O2: If a request r2 is causally related to r1, r1 should be processed before r2

Page 9: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 9

Example

Q: Find the analogy between state machine in this context and FSM used in sequential circuits synthesis

Page 10: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 10

Agreement and Order

• Coordination is necessary to assure O1 and O2

• Agreement: All Replicas agree upon the value of request they should process

• Order: All Replicas should process requests in the same order (agree on order of requests)

• Stable Request: a request whose value and order are agreed among Replicas

Page 11: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 11

Agreement

• IC1: All nonfaulty processors agree on the same value

• IC2: If the transmitter is nonfaulty, all nonfaulty processors use its value as the one on which they agree

Q: How to determine faulty processors assuming a byzantine fault model?

Page 12: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 12

Order and Stability

• Order: all replicas process the requests in the same order

• Stability: a property of a request, meaning that it is in the correct order

• Protocols:– Logical Clocks– Synchronized Clocks– Server Side Identification

Q: Suggest a scenario for an out of order request reception

Page 13: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 13

Logical Clocks

Page 14: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 14

Stability Test

• r is stable at a replica if for a new request r’ from every client, T(r) < T(r’): ( T: returns the logical clock value appended to a request)

• As unbounded delays of messages are accepted, agreement in the case of Byzantine faults is impossible

Page 15: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 15

Synchronized Real-Time Clocks

• Each Processor has a real-time clock synchronized with all other processors clocks.

• Upper bounds on request delays guarantee order in the case of Byzantine failures

Page 16: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 16

Stability Test

• 1- Replica waits to guarantee no reception of requests: disadvantage (Replica has to wait)

• 2- Check for a request from every client with a larger identifier

• In practice the disjunction of both tests is used

Q: How Byzantine Failures are handled in this case?

Page 17: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 17

Replica Generated Identifiers

• Advantage: not all processors need to communicate

• Phase 1: each replica proposes a unique ID for the received request, a request is seen in this case

• Phase 2: all replicas agree upon the request ID, the request is accepted in this case

Page 18: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 18

Requirements for Stability Agreement

• Stability Test:For all received request r’ from every client,

their candidate identifiers should be strictly greater than an accepted request r

Page 19: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 19

Generating Unique Identifiers

Q: What is the significance of i/N term?

Page 20: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 20

Tolerating Faulty Output Devices

• Outputs Used Outside the System– Replicate Output Devices– Replicate Voters

• Outputs Used Inside the system– Outputs go back to Clients– Each Client has a voter inside it

Page 21: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 21

Tolerating Faulty Clients

• Replication– Server State Machine Modification– Voter Inside the State Machine

• Requests having same content but different identifiers

• Requests having different content and identifiers

Q: How a voter failure inside server is handled?

Page 22: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 22

Defensive Programming

• Replicas are not always possible– Lack of hardware– Application Semantics do not allow replication

• Defensive Programming: additional requirements on state machines to prevent some possibly destructive actions from a faulty client

• Examples: – Memory Partitioning and prevention of shared access– Bounded time shared resources by using scheduled

requests on the server side

Page 23: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 23

Timed Requests

• Pro: No need to transmit requests

• Con: Does not have parameters

• Default Request: Executes on time at the server unless the client sends a different request

Page 24: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 24

Reconfiguration

Page 25: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 25

C, O and S

• A configuration is a Triplet <C,O,S>– C: the set of operational clients– O: the set of operational output devices– S: the set of operational state machine replicas

• C and O are needed by the state machine replicas

• S is needed by the agreement protocol

Page 26: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 26

Configurators

• Manages a single object in C, O or S

• Detects failures and repairs of this objects

• Are clients by themselves

• Issue requests of reconfiguration to State Machine Replicas

• State machine use application dependent mechanisms for failure detection

Page 27: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 27

Note

The Next Slides are adapted from a presentation by

Leon TrailleFrom Georgia Tech

For a presenatation of the same paper

Page 28: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 28

Integrating a Repaired Object• e[ri]:the state that a non-faulty system element e should be after

processing requests r0 through ri

• An element joining the configuration immediately after request rjoin must be in state e[rjoin] before it can participate

• Fail-stop failures– output device : e[rjoin] is likely to be a small amount of setup information

that can be provided by state variables of smi

– a client : e[rjoin] is frequently based on previous sensor values and can be determined by information from other clients

– a state machine replica :the information for e[rjoin] is stored in state variables and pending requests at smi

• Byzantine failures– require t + 1 replicas instead of just one

Page 29: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 29

Integration with Logical Clocks

• Integrating element e by state machine replica smi at request rjoin

• Fail-stop processorsIf e is client or e is output device then

send any relevant portion of state variables to ebefore sending any output produced by requests with unique identifiers larger than the one on rjoin

If e is state machine replica smnew then1) send the values of its state variables and copies of any

pending requests to smnew

2) send to smnew every subsequent received from each

client c such that uid(r) < uid(rc) where rc is

the first request smnew received directly from c after being restarted

• Byzantine failures– Because information from smi might be incorrect t + 1 copies of identical

state information and t + 1 copies of relayed messages must be obtained

Page 30: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 30

Integration with Real-time Clocks

• Integrating element e by state machine replica smi at request rjoin

• Fail-stop processorsIf e is client or e is output device then

send relevant portions of its state variables to e before sending any output produced by requests with unique identifiers larger than the one on rjoin

If e is state machine replica smnew then1) send the values of its state variables and copies

of any pending requests to smi

2) send to smnew every request received during the next interval of duration Δ

• Byzantine failures– Because information from smi might be incorrect t + 1 copies of identical

state information and t + 1 copies of relayed messages must be obtained

Page 31: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 31

Stability Test During Restart

• Relaying of messages break the stability tests

• A request r may be received directly from client c but later a request r’, also from c, is relayed by smi with uid(r) > uid(r’)

• Solution: must consider requests from c as stable only after no relayed requests from c can arrive

• Stability Test During Restart: A request r received directly from a client c by restarting state machine replica smnew is stable only after the last request from c relayed by another processor has been received by smnew

Page 32: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

2/23/2009 CS5090 32

Thank you!