Understanding Fault-Tolerant Distributed Systems A. Mok 2013 CS 386C.

Understanding Fault-Tolerant Distributed Systems

A. Mok 2013

CS 386C

Dependable Real-Time System Architecture

Application stack

Real-time services

Real-time scheduler

Basic Concepts

Terms: service, server, "depends" relation

• Failure semantics

• Failure masking:– hierarchical– group

• Hardware architecture issues

• Software architecture issues

Basic architectural building blocks

Service: collection of operations that can be invoked by users

• The operations can only be performed by a server which hides state representation and operation implementation details. Servers can be implemented in hardware or software,

Examples – IBM 4381 processor service: all operations in a 4381

manual– DB2 database service: set of query and update

operations

The “depends on” relation

• We say that server s depends on server t if s relies on the correctness of t’s behavior to correctly provide its own service.

• Graphic representation: server s (user, client)

server t (server, resource)• System-level description:

Levels of abstraction

Failure Classification

• Service specification: For all operations, specify state transitions and timing requirements

• Correct service: meets state transition and timing specs• Failure: erroneous state transition or timing in response to a request• Crash failure: no state transition/response to all subsequent

requests.• Omission failure: no state transition/response in response to a

request, distinguishable from crash failure only by timeout.• Timing failure: either omission or the correct state transition occurs

too early or too late; also called performance failure.• Arbitrary failure: either timing failure or bad state transition, also

known as Byzantine failure.

Failure Classification

Arbitrary

(Byzantine)TimingOmissionCrash

Examples of Failures

• Crash failure– “Clean" operating system shutdown due to power outage

• Omission failure– Message loss on a bus– Impossibility to read a bad disk sector

• Timing failure– early scheduling of an event due to a fast clock– Late scheduling of an event due to a slow clock

• Arbitrary failure– A "plus" procedure returns 5 to plus(2,2)– A search procedure finds a key that was never inserted– The contents of a message is altered on route

Server Failure Semantics

• When programming recovery actions when a server s fails, it is important to know what failure behavior s can exhibit.

Example: server s, client r connected by network server n

d - max time to transport message p - max time needed to receive and process a message

• If n, r can only suffer omission failures, then if no reply arrives at s within 2(d + p) time units, no reply will arrive at s. Hence s can assume "no answer" to message.

• If n, r can exhibit performance failures, then s must keep local data to discard replies to "old" messages.

• It is the responsibility of a server implementer to ensure that the specified failure semantics is properly implemented, e.g., a traffic light control system needs to ensure a "flashing red or no light" failure semantics.

s r

n

Failure Semantics Enforcement

• Usually, failure semantics is implemented to the extent of satisfying the required degree of plausibility.

Examples:

– Use error-detecting code to implement networks with performance failure semantics.

– Use error-detecting code, circuit switching and real-time executive implement networks with omission failure semantics.

– Use lock-step duplication and highly reliable comparator to implement crash failure semantics for CPUs.

• In general, the stronger the desired failure semantics, the more expensive it is to implement, e.g., it is cheaper to build a memory without error-detecting code (with arbitrary failure semantics) than build one with error-detecting code (with omission failure semantics).

Failure Masking: Hierarchical Masking

• Suppose server u depends on server r. If server u can provide its service by taking advantage of r's failure semantics, then u is said to mask r’s failure to its own (u’s) clients.

• Typical sequence of events consists of1. A masking attempt.

2. If impossible to mask, then recover consistent state and propagate failure upwards.

Examples:– Retry I/O on same disk assuming omission failure on bus (time

redundancy).

– Retry I/O on backup disk assuming crash failure on disk being addressed (space redundancy).

Failure Masking: Group Masking

• Implement service S by a group of redundant, independent servers so as to ensure continuity of service.

• A group is said to mask the failure of a member m if the group response stays correct despite m’s failure.

• Group response = f(member responses)

Examples of group responses– Response of fastest member

– Response of "primary" member

– Majority vote of member responses

• A server group able to mask to its clients any k concurrent member failures is said to be k-fault-tolerant. k = 1: single-fault tolerant

k > 1: multiple-fault tolerant

Group Masking Enforcement

• The complexity of group management mechanism is a function of the failure semantics assumed for group members

Example:

Assume memory components with read omission failure semantics

– Simply "oring" the read outputs is sufficient to mask member failure

– Cheaper and faster group management

– More expensive members (stores with error correcting code)

M

M’

readwrite

Group Masking Enforcement

• The complexity of group management mechanism is a function of the failure semantics assumed for group members

Example:

Assume memory components with arbitrary read failure semantics

– Needs majority voter to mask failure of member failure– More expensive and slower group management– Cheaper members (no e.c.c. circuitry)

M M’ M”

writeread

Key Architectural Issues

• Strong server failure semantics - expensive

• Weak failure semantics - cheap

• Redundancy management for strong failure semantics - cheap

• Redundancy management for weak failure semantics - expensive

This implies:

Need to balance amount of failure detection, recovery and masking redundancy mechanisms at various levels of abstraction to obtain best overall cost/performance/dependability results

Example:

Error-detecting for memory systems at lower levels usually decreases the overall system cost; error-correcting for communication systems at lower levels may be an overkill for non-real-time applications (known as end-to-end arguments in networking).

Hardware Architectural Issues

• Replaceable hardware unit:

A set of hardware servers packaged together so that the set is a physical unit of failure, replacement and growth. May be field replaceable (by field engineers) or customer replaceable.

• Goals:– Allow for physical removal/insertion without disruption to higher level software

servers (removals and insertions are masked)– If masking is not possible or too expensive, ensure "nice" failure semantics such

as crash or omission

• Coarse granularity architecture:

A replaceable unit includes several elementary servers, e.g., CPU, memory, I/O controller.

• Fine granularity architecture:

Elementary hardware servers are replaceable units.

Question: How are the replaceable units grouped, connected?

Coarse Granularity Example: Tandem Non-Stop

Coarse Granularity Example:DEC VAX cluster

Coarse Granularity Example:IBM Extended Recovery Facility

IBM XRF Architecture:

Fine Granularity Example:Stratus

Fine granularity Example:Sequoia

O.S. Hardware Failure Semantics?

What failure semantics is specified for hardware replaceable units that is usually assumed by operating system software?

• CPU - crash• Bus - omission• Memory- read omission• Disk - read/write omission• I/O controller - crash• Network - omission or performance failure

What failure detection mechanisms are used to implement the specified hardware replaceable unit's failure semantics?

Examples:

– Processors with crash failure semantics implemented by duplication and comparison in Stratus, Sequoia, Tandem CLX

– Crash failure semantics approximated by using error-detecting codes in IBM 370, Tandem TXP, VLX.

At what level of abstraction are hardware replaceable unit's failure masked?

• Masking at hardware level (e.g., Stratus)– Redundancy at the hardware level.

– Duplexing CPU-servers with crash failure semantics provides single-fault tolerance.

– Increases mean time between failure for CPU service.

• Masking at operating system level (e.g., Tandem process groups)– Redundancy at the O.S. level.

– Hierarchical masking hides single CPU failure from higher level software servers by restarting a process that ran on a failed CPU in a manner transparent to the server.

• Masking at application server level (e.g., IBM XRF, AAS)– Redundancy at the application level

– Group masking hides CPU failure from users by using a group of redundant software servers running on distinct hardware hosts and maintaining global service state.

Software Architecture Issues

Software servers:

Analogous to hardware replaceable units (units of failure, replacement, growth)

Goals:• Allow for removal/insertion without disrupting higher level

users.• If masking impossible or not economical, ensure "nice"

failure semantics (which will allow higher level users, possibly human to use simple masking techniques, such as "login and try again")

What failure semantics is specified for software servers?• If service state is persistent (e.g. ATM), servers are

typically required to implement omission (atomic transaction, at-most-once) failure semantics.

• If service state is not persistent (e.g., network topology management, virtual circuit management, low level I/O controller),then crash failure semantics is sufficient.

• To implement atomic transaction or crash failure semantics, the operations implemented by servers are assumed to be functionally correct, a deposit of $100 must not credit customer's account by $1000.

How are software server failures masked?• Functional redundancy (e.g., N-version

programming, recovery blocks)• Use of software server groups (e.g., IBM XRF,

Tandem)

The use of software server groups raises a number of issues are not well understood.– How do clients address service requests to server

groups?– What group-to-group communication protocols are

needed?

The Tandem Process-Pair Communication Protocol

Goal:

To achieve at-most-once semantics in the presence of performance process and communication failures.

1. C sends request to S; S assigns unique serial number.

2. Client-server session number 0 replicated in (C, C'), (S, S').

3. Current serial number SN1 replicated in (S, S').

Current message counter

Current message counter records the Id of the transaction request. Since many requests from different clients may be processed simultaneously, each request has a unique Id

S: if SN(message) = my session message counter

then do

1. Increment current serial number to SN2

2. Log locally the fact that SN1 was returned to request 0

3. Increment session message counter

4. Checkpoint (session message counter, log, new state) to S'

else

Return result saved for request 0

Normal Protocol Execution (1)

Current message counter

S‘ updates session message count, records that SN1 was returned for request 0, adopts new current serial number and sends ack.

S after receipt of ack that S' is now in synch, sends response for request 0 to C


C updates its state with SN1, then checkpoints (session counter, new state) to C'

C‘ updates session counter and state, then sends ack to C.


Example 1: S crashes before checkpointing to S'

• C resends requests to (new) primary S

• S starts new backup S', initializes its state to (0, SN1) and interprets requests as before

Example 2:S' crashes before sending ack to S

S creates new backup S', initializes it and resends check-point message to S‘

Example 3:C crashes after sending request to S

C' becomes primary, starts new back-up and resends request to S

S performs the check on session number:

If sn(message) = my session message counter

then ...

else return result saved for request 0

Issues raised by the use of software server groups:• State synchronization:

How should group members maintain consistency of their local states (including time) despite member failures and joins and communication failures?

• Group membership:

How should group members achieve agreement on which members are correctly working?

• Service availability:

How is it automatically ensured that the required number of members is maintained for each server group despite operating system, server and communication failures?

Example of Software Group Masking

• a fails: b and c agree that a failed; b becomes primary, c back-up.

• b fails: c agrees (trivially) that b failed, c becomes primary.

• a and b fail: c agrees that a and b failed, c becomes primary.

If all working processors agree on group state, then the service S can be made available in spite of two concurrent processor failures

Problem of Disagreement on Service State

If a fails then S becomes unavailable despite the fact that enough hardware for running it exists.

Disagreement on state can cause unavailability when failure occurs:

If clocks synchronized within 10 milliseconds are used to detect excessive message delay, then clocks out of synch (by 10 minutes) can lead to a late system response

Problem of Disagreement on TimeDisagreement on time will not detect performance failure

Message m arriving 10 minutes later will still cause B to think that m took only 30 milliseconds for the trip which could be within network performance bounds.

Problem of Disagreement onGroup Membership

Out-of-date membership information causes unavailability

Disagreement on membership can create confusion:

Server Group Synchronization Strategy

• Close state synchronization– Each server interprets each request– Sends result by voting if members have arbitrary failure semantics,

else result can be sent by all or by highest ranking member

• Loose state synchronization– Members are ranked, primary, first back-up, second back-up ...– Primary maintains current service state and sends results– Back-ups log requests (maybe results also). Periodically purges log.– Applicable only when members have performance failure semantics

One solution is to use atomic broadcast with tight clock synchronization.

Requirements for Atomic Broadcast and Membership

Safety properties, e.g.,• All group members agree on the group membership• All group members agree on what messages were

broadcast and the order in which they were sent

Timeliness properties, e.g.,• There is a bound on the time required to complete an

atomic broadcast• There are bounds on the time needed to detect server

failures and server reactivation

Service Availability Issues

What server group availability strategy would ensure required availability objective?– Prescribe how many members should a server group

have.

What mechanism should be used to automatically enforce a given availability strategy?1. Direct mutual surveillance among group members

2. General srevice availability manager

Q & A

Understanding Fault-Tolerant Distributed Systems A. Mok 2013 CS 386C.

Documents

failure behavior s

byzantine failure

requestcrash failure

performance failure

light failure semantics

server s user

specified failure semantics

client server t server