Tackling Challenges of Scale in Highly Available Computing Systems Ken Birman Dept. of Computer Science Cornell University
Jan 13, 2016
Tackling Challenges of Scale in Highly Available Computing Systems
Ken BirmanDept. of Computer Science
Cornell University
Members of the group
Ken Birman Robbert van Renesse Einar Vollset Krzystof Ostrowski Mahesh Balakrishnan Maya Haridasan Amar Phanishayee
Our topic Computing systems are growing
… larger, … and more complex, … and we are hoping to use them in a more
and more “unattended” manner Peek under the covers of the toughest,
most powerful systems that exist Then ask: Can we discern a research
agenda?
Some “factoids”
Companies like Amazon, Google, eBay are running data centers with tens of thousands of machines Credit card companies, banks,
brokerages, insurance companies close behind
Rate of growth is staggering Meanwhile, a new rollout of wireless
sensor networks is poised to take off
How are big systems structured?
Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients
The front-end servers are connected to a pool of clustered back-end application “services”
All of this load-balanced, multi-ported Extensive use of caching for improved
performance and scalability Publish-subscribe very popular
A glimpse inside eStuff.com
Pub-sub combined with point-to-pointcommunication technologies like TCP
LB
service
LB
service
LB
service
LB
service
LB
service
LB
service
“front-end applications”
Hierarchy of sets A set of data centers, each having A set of services, each structured as A set of partitions, each consisting of A set of programs running in a clustered
manner on A set of machines
… raising the obvious question: how well do platforms support hierarchies of sets?
x y z
A RAPS of RACS (Jim Gray)
RAPS: A reliable array of partitioned subservices
RACS: A reliable array of cloned server processes
Ken Birman searching for “digital
camera”
Pmap “B-C”: {x, y, z} (equivalent replicas)
Here, y gets picked, perhaps based on load
A set of RACS
RAPS
RAPS of RACS in Data Centers
Query source Update source
Services are hosted at data centers but accessible system-wide
pmap
pmap
pmap
Server pool
l2P map
Logical partitioning of services
Logical services map to a physical resource pool, perhaps many to one
Data center A Data center B
Operators can control pmap, l2P map, other parameters. Large-scale multicast used to
disseminate updates
Technology needs? Programs will need a way to
Find the “members” of the service Apply the partitioning function to find
contacts within a desired partition Dynamic resource management,
adaptation of RACS size and mapping to hardware
Fault detection Within a RACS we also need to:
Replicate data for scalability, fault tolerance Load balance or parallelize tasks
Scalability makes this hard!
Membership Within RACS Of the service Services in data
centers Communication
Point-to-point Multicast
Resource management Pool of machines Set of services Subdivision into RACS
Fault-tolerance Consistency
… hard in what sense? Sustainable workload often drops at least
linearly in system size And this happens because overheads grow
worse than linearly (quadratic is common)
Reasons vary… but share a pattern: Frequency of “disruptive” events rises with
scale Protocols have property that whole system is
impacted when these events occur
QuickSilver project We’ve been building a scalable
infrastructure addressing these needs Consists of:
Some existing technologies, notably Astrolabe, gossip “repair” protocols
Some new technology, notably a new publish-subscribe message bus and a new way to automatically create a RAPS of RACS for time-critical applications
Gossip 101
Suppose that I know something I’m sitting next to Fred, and I tell him
Now 2 of us “know” Later, he tells Mimi and I tell Anne
Now 4 This is an example of a push
epidemic Push-pull occurs if we exchange data
Gossip scales very nicely
Participants’ loads independent of size
Network load linear in system size Information spreads in log(system
size) time
% in
fect
ed
0.0
1.0
Time
Gossip in distributed systems
We can gossip about membership Need a bootstrap mechanism, but
then discuss failures, new members Gossip to repair faults in replicated
data “I have 6 updates from Charlie”
If we aren’t in a hurry, gossip to replicate data too
Bimodal Multicast
ACM TOCS 1999
Gossip source has a message from
Mimi that I’m missing.
And he seems to be missing two messages from Charlie that I
have.
Here are some messages from
Charlie that might interest you.
Could you send me a copy of Mimi’s 7’th message?
Mimi’s 7’th message was
“The meeting of our Q exam study
group will start late on
Wednesday…”
Send multicasts to report events
Some messages don’t get through
Periodically, but not synchronously, gossip
about messages.
Stock Exchange Problem: Reliable multicast is too “fragile”
Most members are healthy….
… but one is slow
Most members are healthy….
The problem gets worse as the system scales up
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
50
100
150
200
250Virtually synchronous Ensemble multicast protocols
perturb rate
aver
age
thro
ug
hp
ut
on
no
np
ertu
rbed
mem
ber
s group size: 32group size: 64group size: 96
32
96
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80
100
120
140
160
180
200Low bandwidth comparison of pbcast performance at faulty and correct hosts
perturb rate
aver
age
thro
ughp
ut
traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80
100
120
140
160
180
200High bandwidth comparison of pbcast performance at faulty and correct hosts
perturb rate
aver
age
thro
ughp
ut
traditional: at unperturbed hostpbcast: at unperturbed host traditional: at perturbed host pbcast: at perturbed host
Bimodal multicast with perturbed processes
Bimodal multicastscales well
Traditional multicast: throughput collapses under stress
Bimodal Multicast Imposes a constant overhead on participants
Many optimizations and tricks needed, but nothing that isn’t practical to implement
Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links
Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter
setting Theory also let’s us predict behavior Despite simplified model, the predictions work!
Kelips
A distributed “index” Put(“name”, value) Get(“name”)
Kelips can do lookups with one RPC, is self-stabilizing after disruption
Kelips
30
110
230 202
Take a a collection of
“nodes”
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
N
members per affinity group
Map nodes to affinity
groups
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
Affinity group pointers
N
members per affinity group
id hbeat rtt
30 234 90ms
230 322 30ms
Affinity group view
110 knows about other members – 230, 30…
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
Contact pointers
N
members per affinity group
id hbeat rtt
30 234 90ms
230 322 30ms
Affinity group view
group contactNode
… …
2 202
Contacts
202 is a “contact” for 110 in group
2
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
Gossip protocol replicates data
cheaply
N
members per affinity group
id hbeat rtt
30 234 90ms
230 322 30ms
Affinity group view
group contactNode
… …
2 202
Contacts
resource info
… …
cnn.com 110
Resource Tuples
“cnn.com” maps to group 2. So 110 tells group 2 to
“route” inquiries about cnn.com to it.
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
N
members per affinity group
To look up “cnn.com”, just ask
some contact in group 2. It returns “110” (or forwards
your request).
IP2P, ACM TOIS (submitted)
Kelips
Per-participant loads are constant Space required grows as O(√N) Finds an object in “one hop”
Most other DHTs need log(N) hops And isn’t disrupted by churn, either
Most other DHTs are seriously disrupted when churn occurs and might even “fail”
Astrolabe: Distributed Monitoring
Name Load Weblogic?
SMTP?
Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what
data is pulled into the table (and can change)
3.1
5.3
0.9
1.9
3.6
0.8
2.1
2.7
1.1
1.8
ACM TOCS 2003
State Merge: Core of Astrolabe epidemic
Name Time Load Weblogic?
SMTP? Word Version
swift 2003 .67 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Versi
on
swift 2011 2.0 0 1 6.2
falcon 1971 1.5 1 0 4.1
cardinal 2004 4.5 1 0 6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
State Merge: Core of Astrolabe epidemic
Name Time Load Weblogic?
SMTP? Word Version
swift 2003 .67 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Versi
on
swift 2011 2.0 0 1 6.2
falcon 1971 1.5 1 0 4.1
cardinal 2004 4.5 1 0 6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
swift 2011 2.0
cardinal 2201 3.5
State Merge: Core of Astrolabe epidemic
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Versi
on
swift 2011 2.0 0 1 6.2
falcon 1971 1.5 1 0 4.1
cardinal 2201 3.5 1 0 6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
Scaling up… and up…
With a stack of domains, we don’t want every system to “see” every domain Cost would be huge
So instead, we’ll see a summaryName Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
cardinal.cs.cornell.edu
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
SQL query “summarizes”
data
Dynamically changing query output is visible system-wide
(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
1
3 3
1
2 2
Hierarchy is virtual… data is replicated
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
ACM TOCS 2003
Astrolabe Load on participants, in worst case,
grows as logrsize(N) Most partipants see a constant, low
load Incredibly robust, self-repairing
Information visible in log time And can reconfigure or change
aggregation query in log time, too Well matched to data mining
QuickSilver: Current work One goal is to offer scalable support for:
Publish(“topic”, data) Subscribe(“topic”, handler)
Topic associated w/ protocol stack, properties
Many topics… hence many protocol stacks (communication groups) Quicksilver scalable multicast is running now
and demonstrates this capability in a web services framework
Primary developer is Krzys Ostrowski
Tempest This project seeks to automate a new
drag-and-drop style of clustered application development
Emphasis is on time-critical response You start with a relatively standard web
service application having good timing properties (inheriting from our data class)
Tempest automatically clones services, places them, load-balances, repairs faults
Uses Ricochet protocol for time-critical multicast
Ricochet Core protocol underlying Tempest Delivers a multicast with
Probabilistically strong timing properties Three orders of magnitude faster than prior record!
Probability-one reliability, if desired Key idea is to use FEC and to exploit
patterns of numerous, heavily overlapping groups.
Available for download from Cornell as a library (coded in Java)
Our system will be used in…
Massive data centers Distributed data mining Sensor networks Grid computing Air Force “Services Infosphere”
Our platform in a datacenter
Query source Update source
Services are hosted at data centers but accessible system-wide
pmap
pmap
pmap
Server pool
l2P map
Logical partitioning of services
Logical services map to a physical resource pool, perhaps many to one
Data center A Data center B
One application can be a source of both queries and updates.
Two Astrolabe Hierarchies monitor the system: one tracks logical services, and the other tracks the physical server pool
Operators can control pmap, l2P map, other parameters. Large-scale multicast used to
disseminate updates
Next major project? We’re starting a completely new effort Goal is to support a new generation of mobile platforms
that can collaborate, learn, and can query a surrounding mesh of sensors using wireless ad-hoc communication
Stefan Pleisch has worked on the mobile query problem. Einar Vollset and Robbert van Renesse are building the new mobile platform software. Epidemic gossip remains our key idea…
Summary Our project builds software
Software that real people will end up running But we tell users when it works and prove it!
The focus lately is on scalability and QoS Theory, engineering, experiments and
simulation For scalability, set probabilistic goals, use
epidemic protocols But outcome will be real systems that we
believe will be widely used.