Top Banner
Failure Tolerance Distributed Systems Santa Clara University
61

Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Jun 26, 2018

Download

Documents

nguyendan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure ToleranceDistributed Systems

Santa Clara University

Page 2: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing

Page 3: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Capture the global state of a distributed system • Chandy and Lamport: Distributed snapshot

• Reflects a consistent, global state • If process P has received a message from Q • Then global state should show that process Q

sent a message to process P

Page 4: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Global state presented by a cut

• Consistent cuts: • Messages shown received are shown sent • Messages shown sent are either

• received or • in transit

Page 5: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing

Page 6: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Represent distributed system as a system of

processes connected by unidirectional point-to-point

communication

Page 7: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Distributed snapshot

• Anybody can start snapshot • Initiating process P records its own state • Process P sends a marker along all of its outgoing

channels • Process Q upon receiving first marker

• Records its state • Sends a marker to all of its neighbors • Starts recording all incoming channels

• Process Q upon receiving subsequent markers • Stops recording on channel on which the marker arrived

Page 8: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Process Q upon receiving last marker

• Send • own state • messages on channels monitored

• to the initiating state

Page 9: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing

Page 10: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Termination Detection:

• Use snapshot protocol • If Q receives a marker for the first time

• Sending process becomes its predecessors • If Q is done with the snapshot, sends a DONE

message to predecessor • This still allows for messages in transit

Page 11: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Termination detection:

• Need snapshot where all channels are empty • Q returns DONE only if

• All of Q’s successors have returned a DONE message

• Q has not received any message between the point it recorded its state and the point it had received the marker along each of its incoming channel

• In all other cases, Q sends a CONTINUE message

Page 12: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Distributed Checkpointing• Termination detection

• When initiating process receives only DONE messages • No regular messages are in transit • Thus, computation is terminated

Page 13: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Dependability consists of

• Availability • System is ready to be used

• Reliability • System can run continually without failure

• Safety • In a failure condition, nothing catastrophic

happens • Maintainability

• How easy can a failed system be repaired

Page 14: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Dependability:

• System that breaks down for a millisecond every hour • Availability > 99.9999 % • Reliability is low

• System breaks down only for two weeks every July • Availability ~ 96% • Reliability is high

Page 15: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Failure: system cannot meet its promises • Error: part of the system state that may lead to a

failure • Fault: cause of an error

Page 16: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Transient faults

• occur once and the disappear • If the operation is repeated, fault goes away • Example:

• Bird flies through the beam of a microwave transmitter • and possibly gets roasted

Page 17: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Intermittent fault

• Fault occurs • Goes away • Fault returns

Page 18: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Permanent fault

• Fault appears • Continues to exist until the faulty component is

repaired

Page 19: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Crash failure

• Server halts, but it is working correctly until it has • Omission failure

• A server fails to respond to incoming messages • Receive omission

• Server fails to receive incoming messages • Send omission

• Server fails to send messages

Page 20: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Timing failure

• A server’s response lies outside the specified time interval • Response failure

• A server’s response is incorrect • Value failure

• The value of the response is wrong • State transition failure

• The server deviates from the correct flow of control • Arbitrary / Byzantine failure

• A server may produce arbitrary responses at arbitrary times

Page 21: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Types• Fail-stop failure

• Fail stop server stops producing output • Others can detect this state

• Fail-silent failure • Fail silent server stops producing output • Others cannot distinguish this from a server that is

slow • Fail-safe failure:

• Server acts arbitrarily • But other servers can recognize its output as false

Page 22: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure Masking• Failure masking by redundancy

• Erasure correcting codes • Replication

Page 23: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Failure MaskingTriple Modular Redundancy

Page 24: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience• Organize processes into groups

• Groups can be • dynamic • run membership protocols • hierarchical

Page 25: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience

Page 26: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience• Leader election

• Bully algorithm • Process with highest ID wins

Page 27: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience• Leader Election using a ring

Page 28: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience• Agreement in Faulty Systems

Page 29: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience• Byzantine general problem

• In the presence of byzantine failure • Can only decide on a single value is >2/3 of

the participants are not faulty

Page 30: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience• Byzantine General Problem;

• Lamport algorithm • Each process has to share a value with all

others • But processes can lie and can misrepresent

their value • Goal: All processes accept values from the

non-faulty processes

Page 31: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience• Lamport algorithm (1982)

• Each process sends its value to all other processes

• Values are gathered into vectors • Each process sends these vectors to everybody

else • Every process accepts values with a majority

Page 32: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience

Page 33: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Process Resilience

Page 34: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Problem: • How to get messages to the members of a

process group • Reliable multicasting

• Without process failures: • Problem assumes that there is a join and

leave protocol for processes • Often: members receive messages in exactly

the same order

Page 35: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

Simple solution if all receivers are known and assumed to not fail

Page 36: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Tradeoffs: • Explicit retransmission requests or

retransmissions when acks are missing • Use multicast or point-to-point transmission for

retransmissions • Use piggy-backing in order save network

bandwidth

Page 37: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Scalability in Reliable Multicasting • Simple scheme cannot support large numbers • Optimization:

• Get rid of acks • Only send retransmission requests • Difficult to get messages out of history buffer.

• Use cumulative acks

Page 38: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Scalability in reliable multicasting • Feedback suppression • Implemented in Scalable Reliable Multicasting

(SRM) by Floyd (97) • Never ack receipt of messages • Whenever a process sends a retransmission

request (NACK), it multicasts to everyone • Servers that receive this multicast suppress

their own NACK message

Page 39: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

Page 40: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Feedback suppression scales reasonably well • Problems:

• Receivers need to schedule feedback messages accurately • Otherwise, too many will send out their NACK

anyway • Feedback still interrupts processes that received the

message • Could form a separate multicast process for those

that have not received • But that is difficult to do over a wide area network

Page 41: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Hierarchical Feedback Control

Page 42: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Atomic multicast (in the presence of failures) • Make a distinction between receiving and

delivering a message

Page 43: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Each message is associated with a group view • The processes on the delivery list

• Changes in group membership • Announced by a group view change message

• Problem: • Message based on old group view needs to be

delivered before the group view change message is delivered

Page 44: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Virtual Synchronicity • Reliable multicast where multicast message to a

group view G is delivered to all non-faulty processes in G

Page 45: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

Page 46: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Gives several possibilities for ordering • Unordered multicasts • Fifo ordered multicasts • Causally-ordered multicasts • Totally-ordered multicasts

Page 47: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Virtually synchronous reliable multicasting with totally-ordered delivery of messages is called

• Atomic multicasting

Page 48: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• ISIS: Implementing atomic multicast • Build on TCP as a reliable point-to-point

communication • Assumes that messages sent out by a sender

arrive in that order (TCP property) • Multicasting message with group view

• Same as sending individual messages to all members in the group

Page 49: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Processes keep messages until they know that every other process has received m • In that case m is stable • ONLY STABLE MESSAGES ARE DELIVERED

• This is also true for view-change messages • Forwarding of messages guarantees that a

message delivered to one non-faulty process is received by everyone in the group

• Can require any process to send message to all members of the group

Page 50: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

Page 51: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Processing a group change • Process receives group change message

• Forwards any unstable message for the old group to all processes in the new group and marks them as stable • ISIS / TCP assumes that these messages are

never lost • All messages to the old group received by one

process are therefore guaranteed to be received by all non-faulty process in the old group

Page 52: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• When process P no longer has unstable messages: • Multicasts a flush message to the new group • When P receives flush messages from all

members of the new group, it installs the new view

Page 53: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

Page 54: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• When process Q receives message sent to the old group • If Q still believes itself to be in the old group:

• Delivers message (unless it has already received it and considers it a duplicate)

• If Q has received view change message • Forwards any unstable message • Then sends flush message to the new group

Page 55: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Reliable Group Communication

• Need more protocol in order to deal with failure during a view change

• Details in Birman’s book or the papers on ISIS

Page 56: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Checkpointing• Revocery

• Forward recovery • Bring system to a new, failure free state

• Backward recovery • Bring system back to an old, failure free state

and start over

Page 57: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Checkpointing• Distributed snapshot to establish recovery line

Page 58: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Checkpointing• Domino effect

Page 59: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Checkpointing• Need to do coordinated checkpointing instead of

individual checkpointing • Simpler solution:

• Two-phase blocking protocol • Coordinator broadcasts a CHECKPOINT_REQ • Processes receiving CHECKPOINT_REQ create

local checkpoint • queue messages from the application • block until they receive CHECKPOINT_DONE

• Coordinator sends CHECKPOINT_DONE after receiving acks from everyone

Page 60: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Checkpointing• Techniques used to reduce checkpoints

• Message logging • Can lead to orphans

Page 61: Failure Tolerancetschwarz/COEN 317/Failure.pdf · • Chandy and Lamport: Distributed snapshot ... • In the presence of byzantine failure

Checkpointing• Pessimistic logging protocols

• Ensure that for each non-stable message there is at most one process depending on it

• Optimistic logging protocols • Any orphan process depending on some

message is rolled back until it now longer depend on the message