CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

CS 603Review

April 24, 2002

Seminar Announcements

• Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance (SIFT) Environment”– April 25, 10:30-11:30, MSEE 239

• Fabian E. Bustamante, “The Active Streams Approach to Adaptive Distributed Systems– April 29, 10:30-11:30, CS 101

Review

• Why do we want distributed systems?– Scaling– Heterogeneity– Geographic Distribution

• What is a distributed system?– Transparency vs. Exposing Distribution

• Hardware Basics– Communication Mechanisms

Basic Software Concepts

• Hiding vs. Exposing– Distribution – Distributed OS– Location, but not distribution – Middleware– None – Network OS

• Concurrency Primitives– Semaphores– Monitors

• Distributed System Models– Client-Server– Multi-Tier– Peer to Peer

Communication Mechanisms

• Shared Memory– Enforcement of single-system view– Delayed consistency: δ-Common Storage

• Message Passing– Reliability and its limits

• Stream-oriented Communications

• Remote Procedure Call

• Remote Method Invocation

RPC Mechanisms

• DCE– Language / Platform Independent– Implementation Issues:

• Data Conversion• Underlying Mechanisms

– Fault Tolerance Approaches

• Java RMI• SOAP

– Interoperable– Language independent– Transport independent (anything that moves XML)

Naming Requirements

• Disambiguate only

• Access resource given the name

• Build a name to find a resource

• Do humans need to use name?

• Static/Dynamic Resource

• Performance Requirements

Registry Example: X.500

• Goal: Global “white pages”– Lookup anyone, anywhere– Developed by Telecommunications Industry– ISO standard directory for OSI networks

• Idea: Distributed Directory– Application uses Directory User Agent to

access a Directory Access Point

• Basis for LDAP, ActiveDirectory

Directory Information Base(X.501)

• Tree structure– Root is entire directory– Levels are “groups”

• Country• Organization• Individual

• Entry structure– Unique name

• Build from tree– Attributes: Type/value

pairs– Schema enforces type

rules• Alias entries

ftp://ftp.bull.com/pub/OSIdirectory/4thEditionTexts/X501%204thEditionDraftV6.pdf

X.500

• Directory Entry:– Organization level – CN=Purdue University, L=West

Lafayette– Person level – CN=Chris Clifton, SN=Clifton,

TITLE=Associate Professor

• Directory Operations– Query, Modify

• Authorization / Access control– To directory– Directory as mechanism to implement for others

X.500 – Distributed Directory

• Directory System Agent• Referrals• Replication

– Cache vs. Shadow copy– Access control– Modifications at Master only– Consistency

• Each entry must be internally consistent• DSA giving copy must identify as copy

Clock Synchronization

• Definition: All nodes agree on time– What do we mean by time?– What do we mean by agree?

• Lamport Definition: Events– Events partially ordered– Clock “counts” the order

Event-based definition(Lamport ’78)

Define partial order of processes• A B: A “happened before” B: Smallest

relation such that:1. If A and B in same process and A occurs first, A

B2. If A is sending a message and B is receipt of a

message, A B3. If A B and B C, then A C

• Clock: C(x) is time x occurs:– C(x) = Ci(x) where x running on node i.– Clocks correct if a,b: ab C(a) < C(b)

Lamport Clock Implementation

• Node i Increments Ci between any two successive events

• If event a is sending of a message m from i to j,– m contains timestamp Tm = Ci(a)– Upon receiving m, set Cj ≥ current Cj and > Tm

• Can now define total ordering. a b iff:– Ci(a) < Cj(b)– Ci(a) = Cj(b) and Pi < Pj

What if we want “wall clock” time?

• Ci must run at correct rate: κ << 1 such that | dCi(t)/dt – 1 | < κ

• Synchronized: small ε such that i,j: | Ci(t) – Cj(t) | < ε

• Assume transmission time between μ and μ+ξ• Algorithm: Upon receiving message m,

set Cj(t) = max(Cj(t), Tm+μ)• Theorem: Assume every τ seconds a message

with unpredictable delay ξ is sent over every arc. Then

t ≥ t0 + τd, ε ≈ d(2κτ + ξ)

Clock Synchronization:Limits

• Best Possible: Delay Uncertainty– Actually ε(1 – 1/n)

• Synchronization with Faults– Faulty clock– Communication Failure– Malicious processor

• Worst case: Can only synchronize if < 1/3 processors faulty– Better if clocks can be authenticated

Process Synchronization

• Problem: Shared Resources– Model as sequential or parallel process– Assumes global state!

• Alternative: Mutual Exclusion when Needed– Coordinator approach– Token Passing– Timestamp

Mutual Exclusion

• Requirements– Does it guarantee mutual exclusion?– Does it prevent starvation?– Is it fair?– Does it scale?– Does it handle failures?

Mutual Exclusion:Colored Ticket Algorithm

• Goals:– Decentralized– Fair– Fault tolerant– Space Efficient

• Idea: Numbered Tickets– Next number gets resource– Problem: Unbounded Space– Solution: Reissue blocks

Multi-ResourceMutual Exclusion

• New Problem: Deadlock– Processes using all resources– Each needs additional resource to proceed

• Dining Philosophers Problem– Coordinated vs. truly distributed solutions

• Problems with deterministic solutions• Probabilistic solution – Lehman & Rabin

– Starvation / fairness properties

Distributed Transactions

• ACID properties• Issues:

– Commit Protocols– Fault ToleranceWhy is this enough?

• Failure Models and Limitations• Mechanisms:

– Two-phase commit– Three-phase commit

Two-Phase Commit(Lamport ’76, Gray ’79)

• Central coordinator initiates protocol– Phase 1:

• Coordinator asks if participants can commit• Participants respond yes/no

– Phase 2:• If all votes yes, coordinator sends Commit• Participants respond when done

• Blocks on failure– Participants must replace coordinator– If participant and coordinator fail, wait for recovery

• While blocked, transaction must remain Isolated– Prevents other transactions from completing

Transaction Model

• Transaction Model– Global Transaction State– Reachable State Graph

• Local states potentially concurrent if a reachable global state contains both local states

– Concurrency set C(s) is all states potentially concurrent with s

• Sender set S(s) = {local states t | t sends m and s can receive m}

• Failure Model– Site failure assumed when expected message not

received in time– Independent Recovery

Problems with 2-PC

• Blocking on failure– 3-PC as solution

• Theorems on recovery limits– Independent recovery: No two-site failure– Non-independent recovery

• Anything short of total failure okay• Recovery protocol for total failure

Data Replication

• Fault Tolerance– Hot backup– Catastrophic failure

• Performance– Parallelism– Decreased reliance on network

• Correctness criterion: Replication invisible– One-copy serializability (1SR)

Data Replication: How?

• Goal: Ensure one-copy serializability• Write-all solution: All copies identical

– Write goes to every site– Read from any site– Standard single-copy concurrency control– Guarantees 1SR

• Single-copy concurrency control gives serializable execution

• Equivalent to serial execution where all writes happen in one transaction

Problem: Site Failure

• Failure causes write to block– Must maintain locks– Clogs up entire systemIs this fault tolerance?

• What about “write all available”?– T0: w0[xA] w0[xB] w0[yC] c0

– B-fails– T1: r1[yC] w1[xA] c1

– B-recovers– T2: r2[xB] w2[yC] c2

• What is the serial equivalent order?

Write All Available FailsEven if no recovery!

Solutions

• Validate availability on commit– Check if any failed writes now available– Check that all sites read or written still available– Enforces serializability for site failures

Doesn’t work with communication failures!

Formalisms for Relaxed consistency

• Goal: Relaxed consistency constraints– Meet application needs– Outperform true transparent replication

• How do we ensure constraints meet needs?– Formalisms to describe application needs– Methods to prove constraints adequate

Quasi-Copies(Alonso, Barbará, Garcia-Molina ’90)

• Data Caching– Each site keeps copy of data likely to be used

locally– Propagation cost of writes high

• User-Defined Cache• Controlled Divergence

– Weak consistency constraints– Bounds on the differences between copies– User defines constraints

Assumptions

• Read-only copies– Updates sent to master copy– E.g., ORACLE Materialized View

• User Specified Coherency– Strict limits– “Hints”

• Example: Stock Purchase– Place order based on delayed price– Limit order to ensure price paid okay

Selection Conditions

• Identification clause– Select/Project Query

• Modifier Clause– Add / drop from cache– Compulsory or advisory cache– Static / Dynamic: As new objects meet the

identification clause, are they cached?• Triggering delay on dynamic

Coherency Conditions

• Default (always enforced): Value was true once• Delay W(x,α): Max time lag• Version V(x): Number of updates• Periodic P(x): Time for refresh• Arithmetic A(x): Bounded Difference• Combine conditions with logical operators• Multi-object conditions

– Consistency conditions on a group– Order of application in a group

CS 603Review

April 26, 2002

Remote Operation Mechanisms

• Client-Server Model:• Remote Procedure CallProblem: Remote Site must already know what we

want to do!• Process consists of:

– Code– Resources (files, devices, etc.)– Execution (data, stack, registers, etc.)

• Fork copies everything– Is this needed?

• Solution: Copy part of the process

So where are we?

• Models for Remote Processing– Server: Request documented service– RPC: Request execution of existing

procedure

• What if operation we want isn’t offered remotely?

• Solution: Agents / Code Migration

Types of Code Migration

From Andrew Tanenbaum, Distributed Operating Systems, 1995.

Resource Binding

Resource to Machine Binding

Process to Resource Binding

Unattached Fastened Fixed

Identifier Move Global Reference

Global Reference

Value Copy Value Global Reference

Global Reference

Type Rebind Locally

Rebind locally

Rebind Locally

DCOM – What is it?

• Start with COM – Component Object Model– Language-independent object interface

• Add interprocess communication

DCOM:Distributed COM

• Looks like COM to the client• Built on DCE RPC

– Extends to support full COM functionality

DCOM Architecture

Locating Objects:Activation

• CoCreateInstance(Ex)(<CLSID>)– Interface pointer to uninitialized instance– Same as COM

• CoiGetInstanceFromFile, FromStorage– Create new instance

• CoGetClassObject(<CLSID>)– Factory object that creates objects of <CLSID>– CoGetClassObjectFromURL

• Downloads necessary code from URL and instantiates• Can take server name as parameter

– Or default to server specified in DCOM configuration on client machine[HKEY_CLASSES_ROOT\APPID\{<appid-guid>}] "RemoteServerName"="<DNS name>“

• Also store information in ActiveDirectory

DCOM vs. CORBA

CORBA• Single interface name

• Multiple inheritance

• Dynamic Invocation Interface

• C++-style Exception Handling

• Explicit and Implicit reference counts

• Implemented by ORB with replaceable services

DCOM• Distinction between Class

and Instance Identifier• Implement multiple

interfaces• Type libraries for on-

demand marshaling• 32 Bit Error Code

• Explicit reference count only

• Implemented by many independent services

What is .NET?

• Language for distributed computation– C#, VB.NET, JScript

• Protocols– SOAP, HTTP

• Run-time environment– Common Language Runtime (CLR)– ActiveDirectory– Web Servers (ASP.NET)

COM/DCOM .NET

DCOM• IDL

• Name, Monikers• Registry / ActiveDirectory

• C++, Visual Basic• DCE RPC• DCOM Network protocol

(based on DCE standards)

.NET• Web Services Description

Language (WSDL)• DISCO (URI grammar)• Universal Description

Discovery and Integration (UDDI)

• C#, VB.NET• SOAP• HTTP (presumed

ubiquitous), SMTP (!?)

How .NET works

• Query UDDI directory to get service location

• Query service to get WSDL (interface specification)

• Build call (XML) based on WSDL spec.

• Make call using SOAP• Parse XML results based

on WSDL spec.

Jini:Java Middleware

• Tools to construct federation– Multiple devices, each with Java Virtual Machine– Multiple services

• Uses (doesn’t replace) Java RMI• Adds infrastructure to support distribution

– Registration– Lookup– Security

Service

• Basic “unit” of JINI system– Members provide services– Federate to share access to services

• Services combined to accomplish tasks

• Communicate using service protocol– Initial set defined– Add more on the fly

Infrastructure:Key Components

• RMI– Basic communication model

• Distributed Security System– Integrated with RMI– Extends JVM security model

• Discovery/join protocol– How to register and advertise services

• Lookup services– Returns object implementing service (really a local

proxy)

Programming Model

• Lookup

• Leasing– Extends Java reference with notion of time

• Events– Extends JavaBeans event model– Adds third-party transfer, delivery and

timeliness guarantees, possibility of delay

• Transaction Interfaces

Jini Component Categories

• Infrastructure – Base features• Programming Model – How you use them• Services – What you build

Java / Jini Comparison

Failure Models

• Failure: System doesn’t give desired behavior– Component-level failure (can compensate)– System-level failure (incorrect result)

• Fault: Cause of failure (component-level)– Transient: Not repeatable– Intermittent: Repeats, but (apparently) independent

of system operations– Permanent: Exists until component repaired

• Failure Model: How the system behaves when it doesn’t behave properly

Failure Model(Flaviu Cristian, 1991)

• Dependency– Proper operation of Database depends on proper

operation of processor, disk

• Failure Classification– Type of response to failure

• Failure semantics– State of system after given class of failure

• Failure masking– High-level operation succeeds even if they depend on

failed services

Failure Classification

• Correct– In response to inputs, behaves in a manner consistent with the

service specification

• Omission Failure– Doesn’t respond to input

• Crash: After first omission failure, subsequent requests result in omission failure

• Timing failure (early, late)– Correct response, but outside required time window

• Response failure– Value: Wrong output for inputs– State Transition: Server ends in wrong state

Crash Failure types(based on recovery behavior)

• Amnesia– Server recovers to predefined state independent of

operations before crash

• Partial amnesia– Some part of state is as before crash, rest to

predefined state

• Pause– Recovers to state before omission failure

• Halting– Never restarts

Failure Semantics

• Specification for service must include– Failure-free (normal) semantics– Failure semantics (likely failure behaviors)

• Multiple semantics– Combine to give (weaker) semantics– Arbitrary failure semantics: Weakest possible

• Choice of failure semantics– Is class of failure likely?

• Probability of type of failure

– What is the cost of failure• Catastrophic?

Failure Masking

• Hierarchical failure masking– Dependency: Higher level gets (at best) failure

semantics of lower level– Can compensate for lower level failure to improve this

• Group Failure Masking– Redundant servers– Allows failure semantics of group to be higher than

individuals

• k-fault tolerant– Group can mask k concurrent group member failures

from client

Fault Tolerance• A distributed program A is said to tolerate faults from a fault class F for an

invariant P iff there exists a predicate T for which:1. At any configuration where P holds, T also holds (i.e., P T)2. Starting from any state where T holds, if any actions of A or F are executed, the

resulting state will always be one in which T holds (i.e., T is closed in A and T is closed in F)

3. Starting from any state where T holds, every computation that executes actions from A alone eventually reaches a state where P holds

• If a program A tolerates faults from a fault class F for invariant P, we say that A is F-tolerant for P.

Forms of fault tolerance

• For each entry, determine:– F: Fault class

handled– T: Set of states that

can be reached

Live Not live

Safe Masking Fail safe

Not safe Nonmasking none

Reliable Multicast

• Classes:

• Sender-initiated: Acknowledge all packets– Scales poorly in normal operation

• Receiver-initiated: Request missing packets– Sender doesn’t need receiver list– Scales poorly on failure (cascading failure?)

• Tree-based, Ring-based protocols

Disaster Recovery

• Problem: complete failure at single site– Must have multiple sites– Thus a distributed problem

• Two examples– Distributed Storage: Palladio

• Think wide-area RAID

– Distributed Transactions: Epoch algorithm

Epoch Algorithm (Garcia-Molina, Polyzois, and Hagmann 1990)

• 1-Safe backup– No performance

penalty

• Multiple transaction streams– Use distribution to

improve performance

• Multiple Logs– Avoid single bottleneck

Algorithm Overview

• Idea: Transactions that can be committed together grouped into epochs

• Primaries write marker in log– Must agree when safe to write marker– Keep track of current epoch number– Master broadcasts when to end epoch

• Backups commit epoch when all backups have received marker

Correctnes Criteria

• Atomicity: If any writes of a transaction appear at backup, all must appear– If W(Tx, d) at backup then

W(Tx, d’), W(Tx, d’) exists at backup

• Consistency: If Ti Tj at primary, then – Local: Tj installed at backup Ti installed at backup– Mutual: If W(Ti, d) and W(Tj, d), then

W(Ti, d) W(Tj, d)

• Minimum Divergence: If Tj is at the backup and does not depend on a missing transaction, then it should be installed at the backup

Single-Mark Algorithm

• Problem: Is it locally safe to mark when broadcast received?– Might be in the middle of a transaction

• Solution: Share epoch at commit– Prepare to commit includes local epoch number– If received number greater than local, end epoch

• At Backup: When all sites have epoch ○n, Commit transactions where– C(Ti) ○n

– P(Ti) ○n, local site is not coordinator, and coordinator has C(Ti) ○n

Test Basics

• Mechanics: Open book/notes– No electronic aids

• Two questions– Each multi-part– Will include scoring suggestions

• Underlying question: Do you understand the material?– No need to regurgitate “best in literature” answer

• Reasonable self-designed solution fine

– Key: Do you really understand your answer• Can you build CORRECT distributed systems?

CS 603 Review April 24, 2002. Seminar Announcements Saurabh Bagchi, “Hierarchical Error Detection in a Distributed Software Implemented Fault Tolerance.

Documents

peer slide

order slide

directory entry

activedirectory slide

xml slide

b ca cb slide

directory user agent

distributed software