Availability and Performance in Wide-Area Service Composition Bhaskaran Raman EECS, U.C.Berkeley July 2002
Dec 14, 2015
Availability and Performance in Wide-Area Service Composition
Bhaskaran RamanEECS, U.C.Berkeley
July 2002
Problem Statement (Continued)
Poor availability of wide-area (inter-domain) Internet paths
BGP recovery can take several 10s of seconds
Why does it matter?
• Streaming applications– Real-time
• Session-oriented applications– Client sessions lasting several minutes to hours
• Composed applications
Service Composition: Motivation
Provider QProvider Q
TextTexttoto
speechspeech
Provider RProvider R
CellularPhone
Emailrepository
Provider AProvider AVideo-on-demandserver
Provider BProvider B
ThinClient
Transcoder
Service-Level Path
Other examples: ICEBERG, IETF OPES’00
Goals, Assumptions and Non-goals
• Goals– Availability: Detect and handle failures quickly– Performance: Choose set of service instances– Scalability: Internet-scale operation
• Operational model:– Service providers deploy different services at various
network locations– Next generation portals compose services– Code is NOT mobile (mutually untrusting service
providers)
• We do not address service interface issue• Assume that service instances have no
persistent state– Not very restrictive [OPES’00]
Related Work
• Other efforts have addressed:– Semantics and interface definitions
• OPES (IETF), COTS (Stanford)
– Fault tolerant composition within a single cluster• TACC (Berkeley)
– Performance constrained choice of service, but not for composed services
• SPAND (Berkeley), Harvest (Colorado), Tapestry/CAN (Berkeley), RON (MIT)
• None address wide-area network performance or failure issues for long-lived composed sessions
Outline
• Architecture for robust service-composition– Failure detection in wide-area Internet paths
• Evaluation of effectiveness/overheads– Scaling– Algorithms for load-balancing– Wide-area experiments demonstrating availability
• Text-to-speech composed application
Requirements to achieve goals
• Failure detection/liveness tracking– Server, Network failures
• Performance information collection– Load, Network characteristics
• Service location
• Global information is required– Hop-by-hop approach will not work
Design challenges
• Scalability and Global information– Information about all service
instances, and network paths in-between should be known
• Quick failure detection and recovery– Internet dynamics intermittent
congestion
Failure detection: trade-off
• What is a “failure” on an Internet path?– Outage periods happen for varying durations
Monitoring for liveness of path using keep-alive heartbeat
Time
TimeFailure: detected by timeout
Timeout period
Time
False-positive: failure detected incorrectly unnecessary overheadTimeout period
There’s a trade-off between time-to-detection and rate of false-positives
Is “quick” failure detection possible?
• Study outage periods using traces– 12 pairs of hosts
• Berkeley, Stanford, UIUC, CMU, TU-Berlin, UNSW• Some trans-oceanic links, some within US (including
Internet2 links)
– Periodic UDP heart-beat, every 300 ms– Measure “gaps” between receive-times: outage
periods– Plot CDF of gap periods
CDF of gap distributions (continued)
• Failure detection close to ideal case• For a timeout of about 1.8-2sec
– False-positive rate is about 50%
• Is this bad?– Depends on:
• Effect on application• Effect on system stability, absolute rate of occurrence
Towards an Architecture
• Service execution platforms– For providers to deploy services– First-party, or third-party service platforms
• Overlay network of such execution platforms– Collect performance information– Exploit redundancy in Internet paths
ArchitectureInternet
Service cluster: compute cluster capable of running
services
Peering: exchange perf. info.
Destination
Source
Composed services
Hardware platform
Peering relations,Overlay network
Service clusters
Logical platform
Application plane
• Overlay size: how many nodes?– Akamai: O(10,000) nodes
• Cluster process/machine failures handled within
Key Design Points
• Overlay size:– Could grow much slower than #services, or #clients– How many nodes?
• A comparison: Akamai cache servers• O(10,000) nodes for Internet-wide operation
• Overlay network is virtual-circuit based:– “Switching-state” at each node
• E.g. Source/Destination of RTP stream, in transcoder
– Failure information need not propagate for recovery
• Problem of service-location separated from that of performance and liveness
• Cluster process/machine failures handled within
Software ArchitectureFi
ndin
g O
verl
ay E
ntry
/Exi
t
Loc
atio
n of
Ser
vice
Rep
lica
sService-Level Path
Creation, Maintenance, Recovery
Link-State Propagation
At-least-once UDP
Perf.Meas.
LivenessDetection Peer-Peer Layer
Link-State Layer
Service-Composition Layer
Functionalities at the Cluster-Manager
Layers of Functionality
• Why Link-State?– Need full graph information– Also, quick propagation of failure information– Link-state flood overheads?
• Service-Composition layer:– Algorithm for service-composition
• Modified version of Dijkstra’s– To accommodate for constraints in service-level path
• Additive metric (latency)• Load-balancing metric
– Computational overheads?– Signaling for path creation, recovery
• Downstream to upstream
Link-State Overheads
• Link-state floods:– Twice for each failure– For a 1,000-node graph
• Estimate #edges = 10,000
– Failures (>1.8 sec outage): O(once an hour) in the worst case
– Only about 6 floods/second in the entire network!
• Graph computation:– O(k*E*log(N)) computation time; k = #services
composed– For 6,510-node network, this takes 50ms– Huge overhead, but: path caching helps– Memory: a few MB
Evaluation: Scaling
• Scaling bottleneck:– Simultaneous recovery of all client sessions on a failed
overlay link
• Parameter– Load – number of client sessions with a single overlay
node as exit node
• Metric– Average time-to-recovery of all paths failed and
recovered
Evaluation: Emulation Testbed
• Idea: Use real implementation, emulate the wide-area network behavior (NistNET)
• Opportunity: Millennium cluster
App
LibNode 1
Node 2
Node 3
Node 4
Rule for 12
Rule for 13
Rule for 34
Rule for 43
Emulator
Scaling Evaluation Setup
• 20-node overlay network– Created over 6,510 node physical network– Physical network generated using GT-ITM
• Latency variation: according to [Acharya & Saltz 1995]• Load per cluster-manager (CM)
– Vary from 25 to 500
• Paths setup using latency metric• 12 different runs
– Deterministic failure of link with maximum #client paths
– Worst-case in single-link failure
Path creation: load-balancing metric
• So far used a latency metric– In combination with modified Dijkstra’s algorithm– Not good for balancing load
• How to balance load across service instances?– During path creation and path recovery
• QoS literature:– Sum(1/available-bandwidth) for bandwidth balancing
• Applying this for server load balancing:– Metric: Sum(1/(max_load – curr_load))– Study interaction with
• Link-state update interval• Failure recovery
Dealing with load variation
• Decreasing link-state update interval– More messages– Could lead to instability
• Use path-setup messages to update load– Do it all along the path
• Each node that sees the path setup message– Adds its load info to the message– Records all load info collected so far
Fixing the long-path
effect
Metric:Sum_services(1/(max_load-curr_load)) + Sum_noop(0.1/(max_load-curr_load))
Wide-Area experiments: setup
• 8 nodes:– Berkeley, Stanford, UCSD, CMU– Cable modem (Berkeley)– DSL (San Francisco)– UNSW (Australia), TU-Berlin (Germany)
• Text-to-speech composed sessions– Half with destinations at Berkeley, CMU– Half with recovery algo enabled, other half disabled– 4 paths in system at any time– Duration of session: 2min 30sec– Run for 4 days
• Metric: loss-rate measured in 5sec intervals
Improvement in Availability
Availability % table(Client at Berkeley)
Without recovery
With recovery
Day 1 99.58 99.63
Day 2 99.65 99.67
Day 3 99.65 99.65
Day 4 99.86 99.91
Day 5 99.87 99.92
Day 6 99.63 99.69
Day 7 99.84 99.88
Day 8 99.71 99.80
Day 9 99.79 99.93
Day 10 99.10 99.23
Day 11 99.86 99.88
Availability % table(Client at CMU)
Without recovery
With recovery
Day 1 99.59 99.59
Day 2 99.73 99.96
Day 3 99.79 99.98
Day 4 100.00 100.00
Day 5 99.45 99.45
Day 6 98.29 98.67
Day 7 95.79 96.21
Day 8 97.43 97.45
Day 9 98.98 98.99
Day 10 97.98 97.96
Day 11 98.69 98.74
Split of recovery time
• Text-to-Speech application
• Two possible places of failure
Leg-2 Leg-1TextText
totoaudioaudio
Text SourceEnd-Client
Request-response protocolData (text, or RTP audio)Keep-alive soft-state refreshApplication soft-state (for restart on failure)
Split of Recovery Time (continued)
• Recovery time:– Failure detection time– Signaling time to setup alternate path– State restoration time
• Experiment using tts application, using emulation– Recovery time = 3,300ms– 1,800ms failure detection time– 700ms signaling– 450ms for state restoration
• New tts engine has to re-process current sentence