1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica, Doug Tygar University of California, Berkeley and *Stanford University
1
A Research Program inReliable Adaptive
Distributed Systems (RADS)
Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,
Doug TygarUniversity of California, Berkeley
and *Stanford University
2
Presentation Outline
• Why We Need a New Approach to Networked Systems
• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with
Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems
• Elements of a unified Experimental Prototype • Summary and Conclusions
3
New Approach for RADS(Reliable Adaptive Distributed
Systems)
Dramatically improve the trustworthiness of networked systems
• Observe: design observation points throughout system
• Analyze: SLT as an enabling technology– Respond: detect anomalous behavior vs. baseline– Learn: use observations to modify responses to future
observations
• Act:– Reactive: use control points in system for rapid recovery if
detect something wrong– Proactive/protective: prophylactically act on system to
prevent predicted impending failure
4
Today’s Systems are Too Brittle
• Fragile, easily broken, yielding poor trustworthiness (dependability and security).
– Amazon: Revenue $3.1B, Downtime Costs $600,000 per hour
• Why? Overly focused on performance, performance, and cost-performance
• Systems based on fundamentally incorrect assumptions– Humans are perfect – Software will eventually be bug free – Maintenance is “free”
• People/HW/SW failures are facts, not problems“If a problem has no solution, it may not be a problem,
but a fact--not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”)
5
If Failure is Inevitable...then Design for Rapid Adaptation
• Encompasses rapid server recovery, network rerouting, prophylactic/protective actions...
• Blurs distinction between “normal operation” and “recovery”
• Elements of the solution– Programming paradigms for robust recovery– Crash-only software design for rapid server recovery– Network protocols designed for rapid detection of assertion
violations– Instrumentation and SLT for online analysis, anomaly
detection, and diagnosis of failure
• Recovery benchmarks to measure progress– What you can’t measure, you can’t improve– Collect real failure data to drive benchmarks
6
RADS Conceptual Architecture
CommodityInternet & IP networks
EdgeNetwork
DistributedMiddleware
Client
SLT ServicesDistributedMiddleware
Server
Router Router
EdgeNetwork
PNE PNE
Prototype Applications:E-voting, Messaging,
E-Mail, etc.
OperatorUser
Application-Specific
Overlay Network
ProgrammingAbstractionsFor Roll-back
(Necula
Crash-Only Middleware &
Servers, System O&CInfrastructure
(Fox)
Protocols Enabling Fast Detection &
Route Recovery,Network O&C Infrastructure (Katz, Stoica)
Online Statistical Learning
Algorithms (Jordan)
Benchmarks,Tools for Human
Operators (Patterson)
• Reduction to practice of online SLT and observe/analyze/act infrastructure
•Reusable embeddable components
7
Presentation Outline
• Why We Need a New Approach to Networked Systems
• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with
Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems
• Elements of a unified Experimental Prototype • Summary and Conclusions
8
Crash-Only Software:Dramatically Simplifying
Recovery• Since robust systems must be crash-safe, make crashes
the only supported form of shutdown/restart• Software components’
external “power switch” is independent ofmisbehaving component
• Recovery becomes inexpensive/safe to try– Simplifies failure detection, since can be overly aggressive– Simplifies recovery, since only 1 type of recovery action and
always safe to try– Idea: if something looks anomalous, it’s probably wrong
• Can machine learning and statistical monitoring approaches be applied during online operations?
9
Crash-Only Software:Practical to Build
• [[refocus on JAGR, talk about relevance of middleware]]• Case studies: two crash-only state-storage subsystems (for
session state and durable state)– OK to crash any node at any time for any reason– Recovery is highly predictable, doesn’t impact online performance – Replication provides probabilistic durability & capacity during recovery– Access pattern of workload exploited for consistency guarantees
• 9 “activity” & “state” statistics monitored per storage brick– Metrics compared against those of “peer” bricks– Basic idea: Changes in workload tend to affect all bricks equally– Underlying (weak) assumption: “Most bricks are doing mostly the right
thing most of the time”– Anomaly in 6 or more (out of 9) metrics => reboot brick– Simple thresholding and substring-frequency used to determine
“anomalous”
10
Supporting Crash-Only in Middleware• Add observation & control points
to Java application middleware– Observe: capture paths taken through
system by user request– Analyze: look for highly-unlikely anomalous
(therefore probably buggy) paths– Act: micro-reboot suspected-faulty J2EE
components transparently to rest of system
• Result: fast recovery improves overall performability
– micro-reboot is 2-3 orders of magnitude faster than full application reboot
– Improves performability (total amount of work per unit time in presence of faults)
– Minimizes disruption to users of other (non-faulty) parts of system
Fast, cheap uRB’s + statistical monitoring provide a degree of application-generic failure detection &
recovery
11
Crash-Oriented Software:Systematic Approach
• Needed: Systematic mechanism for determining when micro-reboots are safe
– Programming-language level support for rollback and state tracking
• Needed: Better integration with SLT– Which clustering/analysis techniques best correlate anomalous
paths to particular observed failure types? (current prototype uses very simple data clustering techniques)
– Are these techniques suitable for online use? (current prototype does offline analysis)
12
Presentation Outline
• Why We Need a New Approach to Networked Systems
• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with
Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems
• Elements of a unified Experimental Prototype • Summary and Conclusions
13
Research Challenges
• No protection against DoS attacks– MS Blaster inflicted Internet packet loss > 20%
• Routing protocols blindly believe routes advertised by neighbors
– BGP router misconfigurations» 200-1200 prefixes affected every day» C&W’s (AS3561) misconfiguration caused an
outage for > 5000 prefixes for 2 hours (April 2001) – Malicious routers: huge potential threat
» Drop packets and render a destination unreachable
» Eavesdrop the traffic to a given destination» Impersonate the destination
14
Observe, Analyze, Act• Observe:
– Use multiple vantage points to monitor the network– Design protocols whose behaviors can be verified
• Analyze: – Learn from protocol behavior – Identify bogus information
• Act:– Contain misbehaving components– Rise flags for network operators– Empower end-hosts (e.g., enable end-hosts to stop unwanted packets in
the network infrastructure)» End-hosts know better when under attack (flashcrowds vs. DoS
attacks) » End-hosts can react faster than infrastructure
sender receiver
15
Case Study: BGP (Listen & Whisper)
• Whisper: – Use redundancy to check for route
advertisements consistency
• Listen:– Monitor TCP flow progress to detect
reachability problems
• Results:– Whisper: reduce the region of
Internet vulnerable to an isolated adversary to 5%
» Scalable, implementation: can handle 10 times today’s BGP load
– Listen: detect reachability problems» Probability of false positives ~
1% » Vulnerable to port scans plan
to use SLT
16
Commodity Internet & IP networks
Programmable Network Elements
• Enabling Technology– Edge network elements for IDS, firewall, traffic shaping, etc.– Next generation: exposed APIs for 3rd party programming– Location for efficient network-level monitoring and control
» Observe: rapid detection of route failure or network attack» Act: e.g., filter intrusions, quarantine propagating worms» Avoid configuration and “latest patch not installed” errors
Router Router
EdgeNetwork
EdgeNetwork
In-PortClassify
TransformOut-Port
17
Presentation Outline
• Why We Need a New Approach to Networked Systems
• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with
Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems
• Elements of a unified Experimental Prototype • Summary and Conclusions
18
Research Challenge: Self-sensing and Reactive
Systems• Internet scale attacks are fundamentally
different than host scale attacks• Traditional “Intrusion Detection Systems”
(IDS) have had some success with host scale attacks, but also many false positives
• Internet scale attacks offer opportunity (more evidence of wide scale attack) but also more challenge (integrating data from a large number of disparate sources)
19
Observe, Analyze, Act
• Observe: what to monitor, how to monitor• Analyze: Learning from patterns of messages
(not parsing their contents)• Act:
– How to exchange minimal information (in system under attack)
– rapidly evolving security protocols (for resilience to attack)
• Applications: Worm detection, spam detection
• Ultimate challenge: beyond detection and into response
20
Security of Networked Systems
Technical Approach• Mechanisms to learn, share, repair against potential
threats to dependability• Strengthen assurance of shared information via
lightweight authentication and encryption– TESLA authentication system: replaces public-key crypto with
lightweight symmetric encryption; uses time asymmetry to provide assurance
» Messages initially encrypted, verification keys revealed later—prevents attacker from using a received key to forge messages
» Variations provide instant authentication. – Athena system: generate random instances of secure protocols
» Ultra-fast checking software—model-checking & proof-theoretic techniques to verify protocols against stated requirements
» Intelligently generate most efficient secure protocol satisfying requirements or a random instance of a secure protocol satisfying a given set of requirements
– Apply for SLT systems to more quickly exchange information
21
Presentation Outline
• Why We Need a New Approach to Networked Systems
• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with
Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems
• Elements of a unified Experimental Prototype • Summary and Conclusions
22
Statistical Learning Theory
• Toolbox for design/analysis of adaptive systems– Algorithms for classification, diagnosis, prediction, novelty
detection, outlier detection, quantile estimation, density estimation, feature selection, variable selection, response surface optimization, sequential decision-making
• Classification algorithms– Recent scaling breakthroughs: 10K+ features, millions of data
points– Kernel machines; functional analysis and convex optimization
» Generalized inner product—similarities among data point pairs
» Defined for many data types» Classical linear statistical algorithms “kernelized” for state-
of-the-art nonlinear SLT algorithms in many areas
23
Statistical Learning Theory
• Novelty Detection Problem– Unlimited observations reflecting normal activity
Yet few (or no) instances that reflect an attack or a bug » E.g. intrusion detection, machine diagnostics
– Second-order cone program; a convex optimization problem with an efficient solution method
» Given cloud of data in a high-dimensional feature space, place a boundary around these to guarantee that only a small fraction falls outside
• Basic problem---find a boundary that encloses a desired fraction of the data, and is as tight as possible
– can be done using the generalized Chebyshev inequality
– using kernels, this is a convex problem
24
Example: Statistical Bug-finding• Programs are buggy, yet many people use them
– Instrument programs to take samples of program state at runtime – Collect information over the Internet from many users’ runs– Learn a statistical classifier based on successful and failed runs, using
feature selection methods to pinpoint the bugs
• Example: finding a bug in Unix bc utility– 2908 features instrumented– All top feature indicate indx being unusually large in
more_arrays subroutine:storage.c:176: more_arrays(): indx > optopt
storage.c:176: more_arrays(): indx > opterr
storage.c:176: more_arrays(): indx > use_math
– Indeed, array overrun bug in re-allocation routine more_arrays() found to cause memory corruption and sometimes an eventual crash
25
Example III: Diagnosis
• A probabilistic graphical model with 600 disease nodes, 4000 finding nodes
• Node probabilities p(f_i | d) were assessed from an expert (Shwe, et al., 1991)
• Want to compute posteriors: p(d_j | f)
• Is this tractable?
26
Case Study: Medical Diagnosis
• Symbolic complexity:– symbolic expressions fill dozens of pages– would take years to compute
• Numerical simplicity:– Jaakkola and Jordan (1999) describe a variational method
based on convexity that computes approximate posteriors in less than a second
27
Challenge for SLT
• Challenge: “on-line” versions of the best algorithms have yet to be developed
– update the learning system’s state based on small sets of data
– Available for some “kernelized” problems– On-line versions of the best algorithms have yet to be
developed!
28
Presentation Outline
• Why We Need a New Approach to Networked Systems
• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with
Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems
• Elements of a unified Experimental Prototype • Summary and Conclusions
29
System Prototype
• Comprehensive system architecture• Reduction of SLT to practical software
components embedded within a distributed systems context
• Exhibition of an architecture for dramatically improving the reliability and security of important systems through observation-coordination-adaptation mechanisms.
30
Messaging as an Application• E-mail is now mission-critical application
– Organizational storage capacity shifting from financial data bases to email (email is fastest growing storage)
– Loss of email more critical to continuing operation of organization than telephony (imagine if gov’t had no email for a week)
• Instant Messaging is now mission-critical application
– In a crisis, many communication schemes will be used: land-based telephony, cellular telephony, instant messaging, email, …
– Coordination among first-responders during crisis response in field (administrators & operators)
• Demands for dependability, resistance to attack, establishment of trust among interacting entities
– Despite attempts by hackers, terrorists, …
31
Measuring Sucess
• Build email/IM prototype using RADS design principles and tools
• Put realistic performance workload on prototype
• Subject prototype to increasingly difficult failure workloads and attack workloads
– E.g., hardware failures, software failures, operator failures, worms attacks, DDOS attacks, …
• Measure false positive rates, accuracy rates, time to analyze failures, time to act, performance impact of actions, availability of prototype, performability of prototype, …
• Compare results to conventional email/IM systems under similar performance, failure, and attack workloads
32
Allies NetworksAllies NetworksAdversary Allies Networks
Disaster Response Messaging Application
DHS/FederalNetwork
CoalitionInternet
Active AdversaryService Attacks
Compromised NetworkWith Embedded
Adversaries
TrustRelations
Incident ReportsResponder LocationsGIS DataEtc.
Net Failure
Net FailureAllies NetworksLocal Police, Fire,
State PoliceAdversary
33
Presentation Outline
• Why We Need a New Approach to Networked Systems
• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with
Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems
• Elements of a unified Experimental Prototype • Summary and Conclusions
34
Old Science vs. New Science
• First 50 years of computer science– manually-engineered systems– lack of adaptability, robustness, and security– no concern with closing the loop with the environment
• Next 50 years of computer science– statistical learning systems throughout the infrastructure– self-configuring, adaptive, sentient systems– perception, reasoning, decision-making cycle– systems are “always” recovering because of this ongoing
automatic and dynamic adaptation
• New way to think about and design adaptive systems
– Makes continuous monitoring and reaction a first-class goal– Provides point of leverage for applying SLT and related techniques
35
Scientific Foundation For “Self-*” Systems
• New design principles and tools for systems that continuously adjust their behavior in response to analysis of online observations
• New metrics and benchmarks for evaluating self-adapting networked systems
• Advances in Statistical Learning Theory to move from offline to online analysis of large-scale distributed systems
37
Statistical Learning Theory
• “Super kernels”: combine heterogeneous data via multiple kernels
– Semidefinite programs, convex optimization problems with efficient solutions involving efficient decomposition techniques
– Useful in fusing evidence at distributed nodes
• Problems of interest require combined parameter estimation and optimization
– Response surface methodology: building local mappings from configurations to performance, and suggesting gradient directions in configuration space leading to performance improvements
– Policy-gradient methods: SLT algorithms that make sequences of decisions, yielding a “behavior” or “policy”; successfully developed policies for nonlinear control problems involving high degrees of freedom
38
Statistical Machine Learning
• Kernel methods– neural network heritage– convex optimization algorithms– kernels available for strings, trees, graphs, vectors, etc.– state-of-the-art performance in many problem domains– frequentist theoretical foundations
• Graphical models– marriage of graph theory and probability theory– recursive algorithms on graphs– modular design – state-of-the-art performance in many problem domains– Bayesian theoretical foundations
39
Self-Verifiable Protocols:BGP Whisper
• AS1 advertises its address prefix– Chooses a secrete key “x”, and sends y = h(x)– h(): well-known one-way hash function
• Every router forwards y = h(y)• AS4 performs consistency check: (y1)3 = (y2)3 ?
– If yes, assume both routes are correct– If no, at least one rout is incorrect (but don’t know which) rise a flag
AS4
AS3 AS2
AS1
AS3
Chose secretkey “x”
(AS1, y1=h(x))
(AS1,AS2,y1=h2(x))(AS1,AS2,AS3,y1=h3(x))
(AS1, y2=h(x))(AS1,AS3,y2=h2(x))
40
Enabling Technology:Edge Services by Network Appliances
In-the-Network Processing: the Computer IS THE Network
F5 Networks BIG-IP LoadBalancerWeb server load balancer
Packeteer PacketShaperTraffic monitor and shaper
Ingrian i225SSL offload appliance
Network Appliance NetCacheLocalized content delivery platform
Nortel Alteon Switched FirewallCheckPoint firewall and L7 switch
Cisco IDS 4250-XLIntrusion detection system
Cisco SN 5420IP-SAN storage gateway
Extreme Networks SummitPx1L2-L7 application switch
NetScreen 500Firewall and VPN
41
Self-Verifiable Protocols:Status and Future Plans
• Two examples:– BGP verifications (Listen & Whisper)
» Can trigger alarms and contain malicious routers» Minimal changes to BGP; incrementally deployable
(Listen)– Self-verifiable CSFQ
» Per-flow isolation without maintaining per flow state» Detect and contain malicious flows
• Ultimate goal: develop distributed system able to self diagnose and self-repair
– Eliminate faulty components– Minimum raise a flag in case of configurations and attackers– Develop set of principles and techniques for robust protocols
42
Enabling Technology:Programmable Networks
• Problem– Common programming/control environment for diverse
network elements to realize full power of “inside the network” services and applications
• Approach– Software toolkit and VM architecture for PNEs, with
retargetable optimized backend for diverse appliance-specific architectures
• Current Focus – Network health monitoring, protocol interworking and packet
translation services, iSCSI processing and performance enhancement, intrusion and worm detection and quarantining
• Potential Impact– Open framework for multi-platform appliances, enabling
third party service development– Provable application properties and invariants; avoidance of
configuration and “latest patch not installed” errors
43
Enabling Technology:Programmable Networks
• Generalized PNE programming and control model– Generalized “virtual machine” model for this class of devices– Retargetable for different underlying implementations
• Edge services of interest– Network measurement and monitoring supporting model
formation and statistical anomaly detection» Framework for inside-the-network “protocol listening”» Selective blocking/filtering/quarantining of traffic
– Application-specific routing» Faster detection and recovery from routing failures than is
possible from existing Internet protocols» Implementation of self-verifiable protocols
44
Crash-Only + Statistical Monitoring = Resilience to Real-
World Transients • Simple fault model: observed anomalies
“coerced” into crash faults • Surprise! Statistical
monitoring catches manyreal-world faults, withouta pre-established baseline
Offered Load vs. GoodputAI/MD Admission Control
0
1000
2000
3000
4000
5000
6000
7000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 21 24 27 30
Number of machines
Nu
mb
er o
f re
qu
ests
p
er s
eco
nd
Offered Load
Goodput
45
Self-Verifiable Protocols:Statement of the Problem
• Problem: Detect and contain network effects of misconfigurations and faulty/malicious components
• Approach: design network protocols so each component verifies correct behavior of the protocol
• Examples:– e2e protocols
– routing (BGP) protocolssender receiver
data/control flow
verif. protocol
46
Self-Verifiable ProtocolsCase Study: BGP
• Propagating invalid BGP routes can bring the Internet down
• Multiple causes– Router misconfigurations: happen daily, yielding
outages lasting hours– Malicious routers: huge potential threat
» Routers with default passwords » Possible to “buy” routers’ passwords on darknets
• Existing solutions – Hard to deploy (e.g., Secure-BGP), or insufficient
security
• Our solution: – Whisper: verify the correctness of router
advertisements– Listen: verify the reachability on the data plane
47
Self-Verifiable Protocols:BGP Whisper
• Use redundancy to check consistency of peer’s information• Whisper game:
– Group sits in a circle., person whispers secret phrase to neighbors– Person at other end concludes:
» Phrase is correct if same phrase from both neighbors » Otherwise, at least one phrase is incorrect
48
Self-Verifiable Protocols:BGP Listen
• Monitor progress of TCP flows• If TCP flow doesn’t make progress, might
be because route is incorrect• Use heuristics to reduce number of false
positives and negatives– Still difficult to handle traffic patterns like port
scanners
• Use SLT techniques to improve the detection accuracy?