Virtual Synchrony Justin W. Hart CS 614 11/17/2005
Dec 31, 2015
Virtual Synchrony
Justin W. HartCS 614
11/17/2005
Papers The Process Group Approach to Rel
iable Distributed Computing. Birman. CACM, Dec 1993, 36(12):37-53.
Understanding the Limitations of Causally and Totally Ordered Communication. Cheriton and Skeen. 14th SOSP, 1993.
Background Chandy-Lamport Logical Clocks Consistent Cuts Distributed Snapshots Publish/Subscribe Fail-Stop
Fail Stop Group Membership Service Processes appear to fail by halting How does this affect the FLP
result?
Motivation Information Backplane Customization Hierarchical Structure Fault-Tolerance Reliability
Process GroupsTypes of groups Anonymous groups Explicit groups
Implementation Requirements
Group communication Group membership as
input Synchronization
Anonymous Groups Group addressing Messages sent exactly once to all
or no recipients Ordering Logging
Explicit Groups Group members cooperate directly
May execute algorithms based on membership knowledge
Communication is sensitive to membership changes
Building groups over conventional technology Conventional message passing
technologies Group addressing Logical time & causal dependency Message delivery ordering State transfer Fault tolerance
Close Synchrony Close Synchrony
100% lock-step execution model
A synchronous execution
p
q
r
s
t
u
With true synchrony executions run in genuine lock-step.
So… what’s wrong with that?
Under close synchrony, execution is limited by the slowest process in the group!
Virtual Synchrony Relax synchronization
requirements where possible Benefit by allowing for
asynchronous interactions Do this where the result is identical
to close synchrony
A few protocols… fbcast cbcast abcast gbcast
Four protocols!?!? …but Justin. The paper only
discussed 2 protocols… you’re getting off-topic!
A few protocols… fbcast
Simple protocol upon which we’ll build the others.
Delivery is FIFO ordered, with respect to the original sender
Accomplished easily with a logical timestamp cbcast abcast gbcast
Single updater If p is the only update source, the
need is a bit like the TCP “fifo” ordering
fbcast is a good choice for this case
p
rst
1 2 3 4
A few protocols… fbcast cbcast
Receipt is causally ordered Protocol in paper uses token passing Another simple protocol uses vector
timestamps abcast gbcast
Causally ordered updates Simple protocol
based on token passing
Causally ordered updates Example: messages from p and s
arrive out of order at t
p
rst
VT(a) = [0,0,0,1]
VT(b)=[1,0,0,1]
VT(c) = [1,0,1,1]
c is early: VT(c) = [1,0,1,1] but VT(t)=[0,0,0,1]: clearly we are missing one message from sWhen b arrives, we can deliver
both it and message c, in order
Causally ordered updates Each thread corresponds to a different
lock
In effect: red “events” never conflict with green ones!
p
r
s
t1
2
3
4
5
1
2
Hey… that sped things up! Now I get it!
Processes only have to wait for processes that they depend on. Not the slowest in the group!
A few protocols… fbcast cbcast abcast
Atomic delivery ordering With respect to other abcasts
More costly than cbcast, but with a stronger ordering property
ISIS builds abcast over cbcast gbcast
A few protocols… fbcast cbcast abcast gbcast
Atomic delivery ordering With respect to everything
Three Round Multicast
As a time-line picture
2PC initiator
pqrst
Vote?
All vote “commit”
Commit!
Phase 1 Phase 2
Just one more…
Flush protocol We say that a message is unstable
if some receiver has it but (perhaps) others don’t For example, q’s message is unstable
at process r If q fails we want to “flush”
unstable messages out of the system
Styles of groups Peer Groups
Processes cooperate closely Client-Server Groups
Group acts as a server Client multicasts repeatedly to the group
Diffusion Groups Group serves information Clients connect to receive data from group
Hierarchical Groups Offer scalability through a hierarchy of
connected groups
Historical Aside Two major classes of real systems
Virtual synchrony Weaker properties – not quite “FLP consensus” Much higher performance (orders of magnitude) Requires that majority of system remain
connected. Partitioning failures force protocols to wait for repair
Quorum-based state machine protocols are Closer to FLP definition of consensus Slower (by orders of magnitude) Sometimes can make progress in partitioning
situations where virtual synchrony can’t
Names of some famous systems Isis was first practical virtual synchrony
system Later followed by Transis, Totem, Horus Today: Best options are Jgroups, Spread, Ensemble Technology is now used in IBM Websphere and
Microsoft Windows Clusters products! Paxos was first major state machine system
BASE and other Byzantine Quorum systems now getting attention from the security community
(End of Historical aside)
Sounds good… what’s wrong with it? Tries to solve state problems at
communication level This violates the end-to-end
argument! Consistency requirements are
typically stated with respect to application state
Stable vs Durable Stable – messages are buffered
until received by all group members
Durable – message will be delivered, even if the sender dies
Ordering semantics Incidental Ordering Semantic Ordering Prescriptive Ordering
The problem with CATOCS It can’t say “for sure” It can’t say the “whole story” It can’t say “together” It can’t say it efficiently
It can’t say “for sure” Processes
communicating over a “hidden” channel Common database Shared memory
Two threads reacting to external event
It can’t say “together”
Standard solution – locking Transaction models allow for abort
and rollback Higher level conditions… what
happens if a message arrives, but is not successfully processed
Stock trading example
Can’t say the “whole story” Not everything can be expressed
through the “happens-before” relationship
Semantic ordering constraints Causal memory, the weakest of these,
cannot be expressed in causal multicast Total ordering helps some of these, but
is far too expensive Inexpensive, state-level protocols with
logical clocks can solve these
It can’t say it efficiently False causality
Potential causality != Actual causality Memory requirements for buffering
“unstable” messages Ordering information during
transmission and reception
And… what of the end to end argument? All of this considers our
communication channels… isn’t the application-level check far more important?
Classes of distributed applications Data dissemination
Netnews Trading application example
Global predicate evaluation Transactional applications Replicated data Replication in the large Distributed real-time applications
Implementing only part of the messaging? Can you cut down on overhead by
implementing only part of the messaging using CATOCS?
Semantics Are the semantics of state-based
approaches superior to those of virtual synchrony?
Scalability N Processes Time T to propagate a message
across the system Grows roughly proportional with the
square root of the number of processes
Arcs in the active causal graph grow quadratically
Quadratic causal graph
Buffering grows Quadratic arcs Linear communication of causal
dependencies Linear growth in required buffering
Changing topologies doesn’t help CATOCS would require separate process
groups for read and write to accomplish optimization of updates vs queries
Group membership protocols Must enforce atomic delivery
semantics Run our most expensive protocol…
gbcast Failures increase with the size of
the system, increasing load on the GMS
Who uses ISIS? Brokerage Database replication and triggers
ISIS-based utilities NEWS
A pub/sub application with that will replay histories
NMGR Manages batch-style jobs and
performs load sharing Parallel make
ISIS-based utilities DECEIT
NFS compatible file system META/LOMITA
Sensors & actuators Abstract sensors Specify control actions in high-level
terms SPOOLER/LONG-HAUL FACILITY
Now… somewhat supported
ISIS/Horus/Ensemble/QuickSilver JGroups Spread Totem Transis WebSphere & Windows Cluster
(internally)
…and people actually use it. NYSE French ATC System AEGIS
An ongoing debate The effort continues here at
Cornell with the QuickSilver effort
You’ve been presented the options… what are your conclusions?
References Some slides borrowed from Ken Birman’s CS 614 slide sets on
Virtual Synchrony http://www.cs.cornell.edu/courses/cs514/2005sp/Slide%20Sets.htm
Images have been borrowed from The Process Group Approach to Reliable Distributed Computing. Birman. CACM, Dec 1993, 36(12):37-53.
Images have been borrowed from Understanding the Limitations of Causally and Totally Ordered Communication. Cheriton and Skeen. 14th SOSP, 1993.
Statements and ideas have been borrowed verbatim from both papers, including section headings, and statements in notes. This has been mostly for coherence between the slides and papers
Also sourced data from http://www.cs.cornell.edu/ken/