Top Banner
CS 603 Review April 24, 2002
67

CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

CS 603Review

April 24, 2002

Page 2: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Seminar Announcements

• Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance (SIFT) Environment”– April 25, 10:30-11:30, MSEE 239

• Fabian E. Bustamante, “The Active Streams Approach to Adaptive Distributed Systems– April 29, 10:30-11:30, CS 101

Page 3: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Review

• Why do we want distributed systems?– Scaling– Heterogeneity– Geographic Distribution

• What is a distributed system?– Transparency vs. Exposing Distribution

• Hardware Basics– Communication Mechanisms

Page 4: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Basic Software Concepts

• Hiding vs. Exposing– Distribution – Distributed OS– Location, but not distribution – Middleware– None – Network OS

• Concurrency Primitives– Semaphores– Monitors

• Distributed System Models– Client-Server– Multi-Tier– Peer to Peer

Page 5: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Communication Mechanisms

• Shared Memory– Enforcement of single-system view– Delayed consistency: δ-Common Storage

• Message Passing– Reliability and its limits

• Stream-oriented Communications

• Remote Procedure Call

• Remote Method Invocation

Page 6: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

RPC Mechanisms

• DCE– Language / Platform Independent– Implementation Issues:

• Data Conversion• Underlying Mechanisms

– Fault Tolerance Approaches

• Java RMI• SOAP

– Interoperable– Language independent– Transport independent (anything that moves XML)

Page 7: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Naming Requirements

• Disambiguate only

• Access resource given the name

• Build a name to find a resource

• Do humans need to use name?

• Static/Dynamic Resource

• Performance Requirements

Page 8: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Registry Example: X.500

• Goal: Global “white pages”– Lookup anyone, anywhere– Developed by Telecommunications Industry– ISO standard directory for OSI networks

• Idea: Distributed Directory– Application uses Directory User Agent to

access a Directory Access Point

• Basis for LDAP, ActiveDirectory

Page 9: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Directory Information Base(X.501)

• Tree structure– Root is entire directory– Levels are “groups”

• Country• Organization• Individual

• Entry structure– Unique name

• Build from tree– Attributes: Type/value

pairs– Schema enforces type

rules• Alias entries

Page 10: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

X.500

• Directory Entry:– Organization level – CN=Purdue University, L=West

Lafayette– Person level – CN=Chris Clifton, SN=Clifton,

TITLE=Associate Professor

• Directory Operations– Query, Modify

• Authorization / Access control– To directory– Directory as mechanism to implement for others

Page 11: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

X.500 – Distributed Directory

• Directory System Agent• Referrals• Replication

– Cache vs. Shadow copy– Access control– Modifications at Master only– Consistency

• Each entry must be internally consistent• DSA giving copy must identify as copy

Page 12: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Clock Synchronization

• Definition: All nodes agree on time– What do we mean by time?– What do we mean by agree?

• Lamport Definition: Events– Events partially ordered– Clock “counts” the order

Page 13: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Event-based definition(Lamport ’78)

Define partial order of processes• A B: A “happened before” B: Smallest

relation such that:1. If A and B in same process and A occurs first, A

B2. If A is sending a message and B is receipt of a

message, A B3. If A B and B C, then A C

• Clock: C(x) is time x occurs:– C(x) = Ci(x) where x running on node i.– Clocks correct if a,b: ab C(a) < C(b)

Page 14: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Lamport Clock Implementation

• Node i Increments Ci between any two successive events

• If event a is sending of a message m from i to j,– m contains timestamp Tm = Ci(a)– Upon receiving m, set Cj ≥ current Cj and > Tm

• Can now define total ordering. a b iff:– Ci(a) < Cj(b)– Ci(a) = Cj(b) and Pi < Pj

Page 15: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

What if we want “wall clock” time?

• Ci must run at correct rate: κ << 1 such that | dCi(t)/dt – 1 | < κ

• Synchronized: small ε such that i,j: | Ci(t) – Cj(t) | < ε

• Assume transmission time between μ and μ+ξ• Algorithm: Upon receiving message m,

set Cj(t) = max(Cj(t), Tm+μ)• Theorem: Assume every τ seconds a message

with unpredictable delay ξ is sent over every arc. Then

t ≥ t0 + τd, ε ≈ d(2κτ + ξ)

Page 16: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Clock Synchronization:Limits

• Best Possible: Delay Uncertainty– Actually ε(1 – 1/n)

• Synchronization with Faults– Faulty clock– Communication Failure– Malicious processor

• Worst case: Can only synchronize if < 1/3 processors faulty– Better if clocks can be authenticated

Page 17: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Process Synchronization

• Problem: Shared Resources– Model as sequential or parallel process– Assumes global state!

• Alternative: Mutual Exclusion when Needed– Coordinator approach– Token Passing– Timestamp

Page 18: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Mutual Exclusion

• Requirements– Does it guarantee mutual exclusion?– Does it prevent starvation?– Is it fair?– Does it scale?– Does it handle failures?

Page 19: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Mutual Exclusion:Colored Ticket Algorithm

• Goals:– Decentralized– Fair– Fault tolerant– Space Efficient

• Idea: Numbered Tickets– Next number gets resource– Problem: Unbounded Space– Solution: Reissue blocks

Page 20: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Multi-ResourceMutual Exclusion

• New Problem: Deadlock– Processes using all resources– Each needs additional resource to proceed

• Dining Philosophers Problem– Coordinated vs. truly distributed solutions

• Problems with deterministic solutions• Probabilistic solution – Lehman & Rabin

– Starvation / fairness properties

Page 21: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Distributed Transactions

• ACID properties• Issues:

– Commit Protocols– Fault ToleranceWhy is this enough?

• Failure Models and Limitations• Mechanisms:

– Two-phase commit– Three-phase commit

Page 22: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Two-Phase Commit(Lamport ’76, Gray ’79)

• Central coordinator initiates protocol– Phase 1:

• Coordinator asks if participants can commit• Participants respond yes/no

– Phase 2:• If all votes yes, coordinator sends Commit• Participants respond when done

• Blocks on failure– Participants must replace coordinator– If participant and coordinator fail, wait for recovery

• While blocked, transaction must remain Isolated– Prevents other transactions from completing

Page 23: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Transaction Model

• Transaction Model– Global Transaction State– Reachable State Graph

• Local states potentially concurrent if a reachable global state contains both local states

– Concurrency set C(s) is all states potentially concurrent with s

• Sender set S(s) = {local states t | t sends m and s can receive m}

• Failure Model– Site failure assumed when expected message not

received in time– Independent Recovery

Page 24: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Problems with 2-PC

• Blocking on failure– 3-PC as solution

• Theorems on recovery limits– Independent recovery: No two-site failure– Non-independent recovery

• Anything short of total failure okay• Recovery protocol for total failure

Page 25: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Data Replication

• Fault Tolerance– Hot backup– Catastrophic failure

• Performance– Parallelism– Decreased reliance on network

• Correctness criterion: Replication invisible– One-copy serializability (1SR)

Page 26: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Data Replication: How?

• Goal: Ensure one-copy serializability• Write-all solution: All copies identical

– Write goes to every site– Read from any site– Standard single-copy concurrency control– Guarantees 1SR

• Single-copy concurrency control gives serializable execution

• Equivalent to serial execution where all writes happen in one transaction

Page 27: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Problem: Site Failure

• Failure causes write to block– Must maintain locks– Clogs up entire systemIs this fault tolerance?

• What about “write all available”?– T0: w0[xA] w0[xB] w0[yC] c0

– B-fails– T1: r1[yC] w1[xA] c1

– B-recovers– T2: r2[xB] w2[yC] c2

• What is the serial equivalent order?

Page 28: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Write All Available FailsEven if no recovery!

Page 29: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Solutions

• Validate availability on commit– Check if any failed writes now available– Check that all sites read or written still available– Enforces serializability for site failures

Doesn’t work with communication failures!

Page 30: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Formalisms for Relaxed consistency

• Goal: Relaxed consistency constraints– Meet application needs– Outperform true transparent replication

• How do we ensure constraints meet needs?– Formalisms to describe application needs– Methods to prove constraints adequate

Page 31: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Quasi-Copies(Alonso, Barbará, Garcia-Molina ’90)

• Data Caching– Each site keeps copy of data likely to be used

locally– Propagation cost of writes high

• User-Defined Cache• Controlled Divergence

– Weak consistency constraints– Bounds on the differences between copies– User defines constraints

Page 32: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Assumptions

• Read-only copies– Updates sent to master copy– E.g., ORACLE Materialized View

• User Specified Coherency– Strict limits– “Hints”

• Example: Stock Purchase– Place order based on delayed price– Limit order to ensure price paid okay

Page 33: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Selection Conditions

• Identification clause– Select/Project Query

• Modifier Clause– Add / drop from cache– Compulsory or advisory cache– Static / Dynamic: As new objects meet the

identification clause, are they cached?• Triggering delay on dynamic

Page 34: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Coherency Conditions

• Default (always enforced): Value was true once• Delay W(x,α): Max time lag• Version V(x): Number of updates• Periodic P(x): Time for refresh• Arithmetic A(x): Bounded Difference• Combine conditions with logical operators• Multi-object conditions

– Consistency conditions on a group– Order of application in a group

Page 35: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

CS 603Review

April 26, 2002

Page 36: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Remote Operation Mechanisms

• Client-Server Model:• Remote Procedure CallProblem: Remote Site must already know what we

want to do!• Process consists of:

– Code– Resources (files, devices, etc.)– Execution (data, stack, registers, etc.)

• Fork copies everything– Is this needed?

• Solution: Copy part of the process

Page 37: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

So where are we?

• Models for Remote Processing– Server: Request documented service– RPC: Request execution of existing

procedure

• What if operation we want isn’t offered remotely?

• Solution: Agents / Code Migration

Page 38: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Types of Code Migration

From Andrew Tanenbaum, Distributed Operating Systems, 1995.

Page 39: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Resource Binding

Resource to Machine Binding

Process to Resource Binding

Unattached Fastened Fixed

Identifier Move Global Reference

Global Reference

Value Copy Value Global Reference

Global Reference

Type Rebind Locally

Rebind locally

Rebind Locally

Page 40: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

DCOM – What is it?

• Start with COM – Component Object Model– Language-independent object interface

• Add interprocess communication

Page 41: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

DCOM:Distributed COM

• Looks like COM to the client• Built on DCE RPC

– Extends to support full COM functionality

Page 42: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

DCOM Architecture

Page 43: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Locating Objects:Activation

• CoCreateInstance(Ex)(<CLSID>)– Interface pointer to uninitialized instance– Same as COM

• CoiGetInstanceFromFile, FromStorage– Create new instance

• CoGetClassObject(<CLSID>)– Factory object that creates objects of <CLSID>– CoGetClassObjectFromURL

• Downloads necessary code from URL and instantiates• Can take server name as parameter

– Or default to server specified in DCOM configuration on client machine[HKEY_CLASSES_ROOT\APPID\{<appid-guid>}] "RemoteServerName"="<DNS name>“

• Also store information in ActiveDirectory

Page 44: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

DCOM vs. CORBA

CORBA• Single interface name

• Multiple inheritance

• Dynamic Invocation Interface

• C++-style Exception Handling

• Explicit and Implicit reference counts

• Implemented by ORB with replaceable services

DCOM• Distinction between Class

and Instance Identifier• Implement multiple

interfaces• Type libraries for on-

demand marshaling• 32 Bit Error Code

• Explicit reference count only

• Implemented by many independent services

Page 45: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

What is .NET?

• Language for distributed computation– C#, VB.NET, JScript

• Protocols– SOAP, HTTP

• Run-time environment– Common Language Runtime (CLR)– ActiveDirectory– Web Servers (ASP.NET)

Page 46: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

COM/DCOM .NET

DCOM• IDL

• Name, Monikers• Registry / ActiveDirectory

• C++, Visual Basic• DCE RPC• DCOM Network protocol

(based on DCE standards)

.NET• Web Services Description

Language (WSDL)• DISCO (URI grammar)• Universal Description

Discovery and Integration (UDDI)

• C#, VB.NET• SOAP• HTTP (presumed

ubiquitous), SMTP (!?)

Page 47: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

How .NET works

• Query UDDI directory to get service location

• Query service to get WSDL (interface specification)

• Build call (XML) based on WSDL spec.

• Make call using SOAP• Parse XML results based

on WSDL spec.

Page 48: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Jini:Java Middleware

• Tools to construct federation– Multiple devices, each with Java Virtual Machine– Multiple services

• Uses (doesn’t replace) Java RMI• Adds infrastructure to support distribution

– Registration– Lookup– Security

Page 49: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Service

• Basic “unit” of JINI system– Members provide services– Federate to share access to services

• Services combined to accomplish tasks

• Communicate using service protocol– Initial set defined– Add more on the fly

Page 50: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Infrastructure:Key Components

• RMI– Basic communication model

• Distributed Security System– Integrated with RMI– Extends JVM security model

• Discovery/join protocol– How to register and advertise services

• Lookup services– Returns object implementing service (really a local

proxy)

Page 51: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Programming Model

• Lookup

• Leasing– Extends Java reference with notion of time

• Events– Extends JavaBeans event model– Adds third-party transfer, delivery and

timeliness guarantees, possibility of delay

• Transaction Interfaces

Page 52: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Jini Component Categories

• Infrastructure – Base features• Programming Model – How you use them• Services – What you build

Java / Jini Comparison

Page 53: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Failure Models

• Failure: System doesn’t give desired behavior– Component-level failure (can compensate)– System-level failure (incorrect result)

• Fault: Cause of failure (component-level)– Transient: Not repeatable– Intermittent: Repeats, but (apparently) independent

of system operations– Permanent: Exists until component repaired

• Failure Model: How the system behaves when it doesn’t behave properly

Page 54: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Failure Model(Flaviu Cristian, 1991)

• Dependency– Proper operation of Database depends on proper

operation of processor, disk

• Failure Classification– Type of response to failure

• Failure semantics– State of system after given class of failure

• Failure masking– High-level operation succeeds even if they depend on

failed services

Page 55: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Failure Classification

• Correct– In response to inputs, behaves in a manner consistent with the

service specification

• Omission Failure– Doesn’t respond to input

• Crash: After first omission failure, subsequent requests result in omission failure

• Timing failure (early, late)– Correct response, but outside required time window

• Response failure– Value: Wrong output for inputs– State Transition: Server ends in wrong state

Page 56: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Crash Failure types(based on recovery behavior)

• Amnesia– Server recovers to predefined state independent of

operations before crash

• Partial amnesia– Some part of state is as before crash, rest to

predefined state

• Pause– Recovers to state before omission failure

• Halting– Never restarts

Page 57: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Failure Semantics

• Specification for service must include– Failure-free (normal) semantics– Failure semantics (likely failure behaviors)

• Multiple semantics– Combine to give (weaker) semantics– Arbitrary failure semantics: Weakest possible

• Choice of failure semantics– Is class of failure likely?

• Probability of type of failure

– What is the cost of failure• Catastrophic?

Page 58: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Failure Masking

• Hierarchical failure masking– Dependency: Higher level gets (at best) failure

semantics of lower level– Can compensate for lower level failure to improve this

• Group Failure Masking– Redundant servers– Allows failure semantics of group to be higher than

individuals

• k-fault tolerant– Group can mask k concurrent group member failures

from client

Page 59: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Fault Tolerance• A distributed program A is said to tolerate faults from a fault class F for an

invariant P iff there exists a predicate T for which:1. At any configuration where P holds, T also holds (i.e., P T)2. Starting from any state where T holds, if any actions of A or F are executed, the

resulting state will always be one in which T holds (i.e., T is closed in A and T is closed in F)

3. Starting from any state where T holds, every computation that executes actions from A alone eventually reaches a state where P holds

• If a program A tolerates faults from a fault class F for invariant P, we say that A is F-tolerant for P.

Page 60: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Forms of fault tolerance

• For each entry, determine:– F: Fault class

handled– T: Set of states that

can be reached

Live Not live

Safe Masking Fail safe

Not safe Nonmasking none

Page 61: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Reliable Multicast

• Classes:

• Sender-initiated: Acknowledge all packets– Scales poorly in normal operation

• Receiver-initiated: Request missing packets– Sender doesn’t need receiver list– Scales poorly on failure (cascading failure?)

• Tree-based, Ring-based protocols

Page 62: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Disaster Recovery

• Problem: complete failure at single site– Must have multiple sites– Thus a distributed problem

• Two examples– Distributed Storage: Palladio

• Think wide-area RAID

– Distributed Transactions: Epoch algorithm

Page 63: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Epoch Algorithm (Garcia-Molina, Polyzois, and Hagmann 1990)

• 1-Safe backup– No performance

penalty

• Multiple transaction streams– Use distribution to

improve performance

• Multiple Logs– Avoid single bottleneck

Page 64: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Algorithm Overview

• Idea: Transactions that can be committed together grouped into epochs

• Primaries write marker in log– Must agree when safe to write marker– Keep track of current epoch number– Master broadcasts when to end epoch

• Backups commit epoch when all backups have received marker

Page 65: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Correctnes Criteria

• Atomicity: If any writes of a transaction appear at backup, all must appear– If W(Tx, d) at backup then

W(Tx, d’), W(Tx, d’) exists at backup

• Consistency: If Ti Tj at primary, then – Local: Tj installed at backup Ti installed at backup– Mutual: If W(Ti, d) and W(Tj, d), then

W(Ti, d) W(Tj, d)

• Minimum Divergence: If Tj is at the backup and does not depend on a missing transaction, then it should be installed at the backup

Page 66: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Single-Mark Algorithm

• Problem: Is it locally safe to mark when broadcast received?– Might be in the middle of a transaction

• Solution: Share epoch at commit– Prepare to commit includes local epoch number– If received number greater than local, end epoch

• At Backup: When all sites have epoch ○n, Commit transactions where– C(Ti) ○n

– P(Ti) ○n, local site is not coordinator, and coordinator has C(Ti) ○n

Page 67: CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Test Basics

• Mechanics: Open book/notes– No electronic aids

• Two questions– Each multi-part– Will include scoring suggestions

• Underlying question: Do you understand the material?– No need to regurgitate “best in literature” answer

• Reasonable self-designed solution fine

– Key: Do you really understand your answer• Can you build CORRECT distributed systems?