Virtual Synchrony

Virtual Synchrony

Justin W. HartCS 614

11/17/2005

Papers The Process Group Approach to Rel

iable Distributed Computing. Birman. CACM, Dec 1993, 36(12):37-53.

Understanding the Limitations of Causally and Totally Ordered Communication. Cheriton and Skeen. 14th SOSP, 1993.

Background Chandy-Lamport Logical Clocks Consistent Cuts Distributed Snapshots Publish/Subscribe Fail-Stop

Fail Stop Group Membership Service Processes appear to fail by halting How does this affect the FLP

result?

Motivation Information Backplane Customization Hierarchical Structure Fault-Tolerance Reliability

Process GroupsTypes of groups Anonymous groups Explicit groups

Implementation Requirements

Group communication Group membership as

input Synchronization

Anonymous Groups Group addressing Messages sent exactly once to all

or no recipients Ordering Logging

Explicit Groups Group members cooperate directly

May execute algorithms based on membership knowledge

Communication is sensitive to membership changes

Building groups over conventional technology Conventional message passing

technologies Group addressing Logical time & causal dependency Message delivery ordering State transfer Fault tolerance

Close Synchrony Close Synchrony

100% lock-step execution model

A synchronous execution

p

q

r

s

t

u

With true synchrony executions run in genuine lock-step.

So… what’s wrong with that?

Under close synchrony, execution is limited by the slowest process in the group!

Virtual Synchrony Relax synchronization

requirements where possible Benefit by allowing for

asynchronous interactions Do this where the result is identical

to close synchrony

A few protocols… fbcast cbcast abcast gbcast

Four protocols!?!? …but Justin. The paper only

discussed 2 protocols… you’re getting off-topic!

A few protocols… fbcast

Simple protocol upon which we’ll build the others.

Delivery is FIFO ordered, with respect to the original sender

Accomplished easily with a logical timestamp cbcast abcast gbcast

Single updater If p is the only update source, the

need is a bit like the TCP “fifo” ordering

fbcast is a good choice for this case

p

rst

1 2 3 4

A few protocols… fbcast cbcast

Receipt is causally ordered Protocol in paper uses token passing Another simple protocol uses vector

timestamps abcast gbcast

Causally ordered updates Simple protocol

based on token passing

Causally ordered updates Example: messages from p and s

arrive out of order at t

p

rst

VT(a) = [0,0,0,1]

VT(b)=[1,0,0,1]

VT(c) = [1,0,1,1]

c is early: VT(c) = [1,0,1,1] but VT(t)=[0,0,0,1]: clearly we are missing one message from sWhen b arrives, we can deliver

both it and message c, in order

Causally ordered updates Each thread corresponds to a different

lock

In effect: red “events” never conflict with green ones!

p

r

s

t1

2

3

4

5

1

2

Hey… that sped things up! Now I get it!

Processes only have to wait for processes that they depend on. Not the slowest in the group!

A few protocols… fbcast cbcast abcast

Atomic delivery ordering With respect to other abcasts

More costly than cbcast, but with a stronger ordering property

ISIS builds abcast over cbcast gbcast

A few protocols… fbcast cbcast abcast gbcast

Atomic delivery ordering With respect to everything

Three Round Multicast

As a time-line picture

2PC initiator

pqrst

Vote?

All vote “commit”

Commit!

Phase 1 Phase 2

Just one more…

Flush protocol We say that a message is unstable

if some receiver has it but (perhaps) others don’t For example, q’s message is unstable

at process r If q fails we want to “flush”

unstable messages out of the system

Styles of groups Peer Groups

Processes cooperate closely Client-Server Groups

Group acts as a server Client multicasts repeatedly to the group

Diffusion Groups Group serves information Clients connect to receive data from group

Hierarchical Groups Offer scalability through a hierarchy of

connected groups

Historical Aside Two major classes of real systems

Virtual synchrony Weaker properties – not quite “FLP consensus” Much higher performance (orders of magnitude) Requires that majority of system remain

connected. Partitioning failures force protocols to wait for repair

Quorum-based state machine protocols are Closer to FLP definition of consensus Slower (by orders of magnitude) Sometimes can make progress in partitioning

situations where virtual synchrony can’t

Names of some famous systems Isis was first practical virtual synchrony

system Later followed by Transis, Totem, Horus Today: Best options are Jgroups, Spread, Ensemble Technology is now used in IBM Websphere and

Microsoft Windows Clusters products! Paxos was first major state machine system

BASE and other Byzantine Quorum systems now getting attention from the security community

(End of Historical aside)

Sounds good… what’s wrong with it? Tries to solve state problems at

communication level This violates the end-to-end

argument! Consistency requirements are

typically stated with respect to application state

Stable vs Durable Stable – messages are buffered

until received by all group members

Durable – message will be delivered, even if the sender dies

Ordering semantics Incidental Ordering Semantic Ordering Prescriptive Ordering

The problem with CATOCS It can’t say “for sure” It can’t say the “whole story” It can’t say “together” It can’t say it efficiently

It can’t say “for sure” Processes

communicating over a “hidden” channel Common database Shared memory

Two threads reacting to external event

It can’t say “together”

Standard solution – locking Transaction models allow for abort

and rollback Higher level conditions… what

happens if a message arrives, but is not successfully processed

Stock trading example

Can’t say the “whole story” Not everything can be expressed

through the “happens-before” relationship

Semantic ordering constraints Causal memory, the weakest of these,

cannot be expressed in causal multicast Total ordering helps some of these, but

is far too expensive Inexpensive, state-level protocols with

logical clocks can solve these

It can’t say it efficiently False causality

Potential causality != Actual causality Memory requirements for buffering

“unstable” messages Ordering information during

transmission and reception

And… what of the end to end argument? All of this considers our

communication channels… isn’t the application-level check far more important?

Classes of distributed applications Data dissemination

Netnews Trading application example

Global predicate evaluation Transactional applications Replicated data Replication in the large Distributed real-time applications

Implementing only part of the messaging? Can you cut down on overhead by

implementing only part of the messaging using CATOCS?

Semantics Are the semantics of state-based

approaches superior to those of virtual synchrony?

Scalability N Processes Time T to propagate a message

across the system Grows roughly proportional with the

square root of the number of processes

Arcs in the active causal graph grow quadratically

Quadratic causal graph

Buffering grows Quadratic arcs Linear communication of causal

dependencies Linear growth in required buffering

Changing topologies doesn’t help CATOCS would require separate process

groups for read and write to accomplish optimization of updates vs queries

Group membership protocols Must enforce atomic delivery

semantics Run our most expensive protocol…

gbcast Failures increase with the size of

the system, increasing load on the GMS

Who uses ISIS? Brokerage Database replication and triggers

ISIS-based utilities NEWS

A pub/sub application with that will replay histories

NMGR Manages batch-style jobs and

performs load sharing Parallel make

ISIS-based utilities DECEIT

NFS compatible file system META/LOMITA

Sensors & actuators Abstract sensors Specify control actions in high-level

terms SPOOLER/LONG-HAUL FACILITY

Now… somewhat supported

ISIS/Horus/Ensemble/QuickSilver JGroups Spread Totem Transis WebSphere & Windows Cluster

(internally)

…and people actually use it. NYSE French ATC System AEGIS

An ongoing debate The effort continues here at

Cornell with the QuickSilver effort

You’ve been presented the options… what are your conclusions?

References Some slides borrowed from Ken Birman’s CS 614 slide sets on

Virtual Synchrony http://www.cs.cornell.edu/courses/cs514/2005sp/Slide%20Sets.htm

Images have been borrowed from The Process Group Approach to Reliable Distributed Computing. Birman. CACM, Dec 1993, 36(12):37-53.

Images have been borrowed from Understanding the Limitations of Causally and Totally Ordered Communication. Cheriton and Skeen. 14th SOSP, 1993.

Statements and ideas have been borrowed verbatim from both papers, including section headings, and statements in notes. This has been mostly for coherence between the slides and papers

Also sourced data from http://www.cs.cornell.edu/ken/

Virtual Synchrony

Documents

message c

updatessimple protocol

slowest process

protocolsfbcastsimple

genuine lockstep

token passingcausally

flp result

virtual synchronyjustin