Top Banner
24

Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

Jun 11, 2018

Download

Documents

trannhan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •
Page 2: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Basic concepts in fault tolerance

• Masking failure by redundancy

• Process resilience

• Reliable communication – One-one communication

– One-many communication

• Distributed commit – Two phase commit

• Failure recovery – Checkpointing

– Message logging

CS550: Advanced Operating Systems 2

Page 3: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Single machine systems

– Failures are all or nothing

• OS crash, disk failures

• Distributed systems: multiple independent nodes

– Partial failures are also possible (some nodes fail)

• Question: Can we automatically recover from partial

failures?

– Important issue since probability of failure grows with

number of independent components (nodes) in the systems

– Prob(failure) = Prob(Any one component fails)=1-P(no

failure)

CS550: Advanced Operating Systems 3

Page 4: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Computing systems are not very reliable – OS crashes frequently (Windows), buggy software,

unreliable hardware, software/hardware incompatibilities

– Until recently: computer users were “tech savvy”

• Could depend on users to reboot, troubleshoot problems

– Growing popularity of Internet/World Wide Web

• “Novice” users

• Need to build more reliable/dependable systems

– Example: what if your TV (or car) broke down every day?

• Users don’t want to “restart” TV or fix it (by opening it up)

• Need to make computing systems more reliable

CS550: Advanced Operating Systems 4

Page 5: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Need to build dependable systems

• Requirements for dependable systems

– Availability: system should be available for

use at any given time (99.999% means ?)

– Reliability: system should run continuously

without failure (over a time interval)

– Safety: temporary failures should not result

in a catastrophe

– Maintainability:a failed system should be

easy to repair CS550: Advanced Operating Systems 5

Page 6: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Fault tolerance: system should provide

services despite faults

• Three types:

– Transient faults

– Intermittent faults

– Permanent faults

CS550: Advanced Operating Systems 6

Page 7: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

Type of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure

Receive omission

Send omission

A server fails to respond to incoming requests

A server fails to receive incoming messages

A server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure

Value failure

State transition failure

The server's response is incorrect

The value of the response is wrong

The server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

CS550: Advanced Operating Systems 7

Page 8: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Handling faulty processes:

– Use process group:

• All processes perform the same operations

• All messages are sent to all members of the group

• Majority need to agree on results of an operation

CS550: Advanced Operating Systems 8

Page 9: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

Triple Modular Redundancy (TMR)

CS550: Advanced Operating Systems 9

Page 10: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

Advantages and disadvantages?

CS550: Advanced Operating Systems 10

Page 11: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• How should processes agree on results of a

computation?

• K-fault tolerant: system can survive k faults and yet

function

• (1) If processes fail silently: (k+1) components

• (2) if Byzantine failures: (2k+1) components

CS550: Advanced Operating Systems 11

Page 12: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Defend against Byzantine failures

• Components of a system fail in arbitrary ways

– not just by stopping or crashing but by processing

requests incorrectly, corrupting their local state,

and/or producing incorrect or inconsistent outputs

• Correct functionality assuming not too many

Byzantine faulty components

CS550: Advanced Operating Systems 12

Page 13: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Two army problem:

– Each army coordinates with a messenger

– Messenger can be captured by the hostile army

– Can generals reach agreement?

– Conclusion: ?

CS550: Advanced Operating Systems 13

Page 14: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Make use of a reliable transport protocol

(TCP) or handle at the application layer

• In Chapter 4, we summarized five different

classes of failures in RPC systems:

– Client unable to locate server

– Lost request messages

– Server crashes after receiving request

– Lost reply messages

– Client crashes after sending request

CS550: Advanced Operating Systems 16

Page 15: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

•Reliable multicast

– Lost messages =>

need to retransmit

•Approaches:

– ACK-based

schemes

• Problems?

– NACK-based

schemes

CS550: Advanced Operating Systems

Page 16: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

•Atomic multicast: all processes received the message

or none at all

•Solution: Group view & View change

– Each msg is uniquely associated with a group of processes

– View of the process group when message was sent

– All procs in the group should have the same view

– Virtually synch property

CS550: Advanced Operating Systems 18

Page 17: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Distributed commit: all processes in a group perform an

operation or not at all

– Examples:

• Reliable multicast: Operation = delivery of a message

• Distributed transaction: Operation = commit transaction

• Possible approaches

– One phase commit

– Two phase commit (2PC) [Gray 1978 ]

– Three phase commit

CS550: Advanced Operating Systems 20

Page 18: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Coordinator: coordinates the

operation

• Involves two phases

– Voting phase

– Decision phase

coordinator participant

VOTE_REQUEST

VOTE_COMMIT

GOBAL_COMMIT

Ready to locally

commits its part

of transaction

Locally

commits its part

of transaction

Collects all

votes from

the participants

CS550: Advanced Operating Systems 21

Page 19: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Techniques thus far allow failure handling

• Recovery: operations that must be performed after a failure to recover to a correct state

• Techniques: – Backward recovery

– Forward recovery

• Storage types: – RAM

– Disk

– Stable storage

CS550: Advanced Operating Systems 23

Page 20: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Steps:

– ?

• Key issue: consistent cut & recovery line

CS550: Advanced Operating Systems 24

Page 21: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Each processes periodically checkpoints independently of other processes

• Upon a failure, work backwards to locate a consistent cut

• Problem: ?

CS550: Advanced Operating Systems 25

Page 22: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Checkpointing is expensive – All procs restart from previous consistent cut

– Taking a snapshot is expensive

• Combine checkpointing (expensive) with message logging (cheap) – Take infrequent checkpoints

– Log all msgs between checkpoints to local stable storage

– To recover: simply replay msgs from previous checkpoint

• Avoid recomputations from previous checkpoint

CS550: Advanced Operating Systems 27

Page 23: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

• Basic concepts in fault tolerance

• Reliable communication – One-one communication

– One-many communication

• Distributed commit – Two phase commit

• Failure recovery – Checkpointing

– Message logging

• Reading materials: – AST chpt 8

CS550: Advanced Operating Systems 28

Page 24: Basic concepts in fault tolerance - IIT-Computer Scienceiraicu/teaching/CS550-S11/lecture18.pdf · – OS crashes frequently (Windows), ... system should be available for ... •

CS550: Advanced Operating Systems 29