1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,

1

A Research Program inReliable Adaptive

Distributed Systems (RADS)

Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,

Doug TygarUniversity of California, Berkeley

and *Stanford University

2

Presentation Outline

• Why We Need a New Approach to Networked Systems

• New Design Philosophy for RADS• Applying the Philosophy: Early Experience with

Specific Approaches– Approaches for Software and Hardware Dependability– Approaches for Networking– Approaches for Security– Applying SLT to dependability problems

• Elements of a unified Experimental Prototype • Summary and Conclusions

3

New Approach for RADS(Reliable Adaptive Distributed

Systems)

Dramatically improve the trustworthiness of networked systems

• Observe: design observation points throughout system

• Analyze: SLT as an enabling technology– Respond: detect anomalous behavior vs. baseline– Learn: use observations to modify responses to future

observations

• Act:– Reactive: use control points in system for rapid recovery if

detect something wrong– Proactive/protective: prophylactically act on system to

prevent predicted impending failure

4

Today’s Systems are Too Brittle

• Fragile, easily broken, yielding poor trustworthiness (dependability and security).

– Amazon: Revenue $3.1B, Downtime Costs $600,000 per hour

• Why? Overly focused on performance, performance, and cost-performance

• Systems based on fundamentally incorrect assumptions– Humans are perfect – Software will eventually be bug free – Maintenance is “free”

• People/HW/SW failures are facts, not problems“If a problem has no solution, it may not be a problem,

but a fact--not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”)

5

If Failure is Inevitable...then Design for Rapid Adaptation

• Encompasses rapid server recovery, network rerouting, prophylactic/protective actions...

• Blurs distinction between “normal operation” and “recovery”

• Elements of the solution– Programming paradigms for robust recovery– Crash-only software design for rapid server recovery– Network protocols designed for rapid detection of assertion

violations– Instrumentation and SLT for online analysis, anomaly

detection, and diagnosis of failure

• Recovery benchmarks to measure progress– What you can’t measure, you can’t improve– Collect real failure data to drive benchmarks

6

RADS Conceptual Architecture

CommodityInternet & IP networks

EdgeNetwork

DistributedMiddleware

Client

SLT ServicesDistributedMiddleware

Server

Router Router

EdgeNetwork

PNE PNE

Prototype Applications:E-voting, Messaging,

E-Mail, etc.

OperatorUser

Application-Specific

Overlay Network

ProgrammingAbstractionsFor Roll-back

(Necula

Crash-Only Middleware &

Servers, System O&CInfrastructure

(Fox)

Protocols Enabling Fast Detection &

Route Recovery,Network O&C Infrastructure (Katz, Stoica)

Online Statistical Learning

Algorithms (Jordan)

Benchmarks,Tools for Human

Operators (Patterson)

• Reduction to practice of online SLT and observe/analyze/act infrastructure

•Reusable embeddable components

7






8

Crash-Only Software:Dramatically Simplifying

Recovery• Since robust systems must be crash-safe, make crashes

the only supported form of shutdown/restart• Software components’

external “power switch” is independent ofmisbehaving component

• Recovery becomes inexpensive/safe to try– Simplifies failure detection, since can be overly aggressive– Simplifies recovery, since only 1 type of recovery action and

always safe to try– Idea: if something looks anomalous, it’s probably wrong

• Can machine learning and statistical monitoring approaches be applied during online operations?

9

Crash-Only Software:Practical to Build

• [[refocus on JAGR, talk about relevance of middleware]]• Case studies: two crash-only state-storage subsystems (for

session state and durable state)– OK to crash any node at any time for any reason– Recovery is highly predictable, doesn’t impact online performance – Replication provides probabilistic durability & capacity during recovery– Access pattern of workload exploited for consistency guarantees

• 9 “activity” & “state” statistics monitored per storage brick– Metrics compared against those of “peer” bricks– Basic idea: Changes in workload tend to affect all bricks equally– Underlying (weak) assumption: “Most bricks are doing mostly the right

thing most of the time”– Anomaly in 6 or more (out of 9) metrics => reboot brick– Simple thresholding and substring-frequency used to determine

“anomalous”

10

Supporting Crash-Only in Middleware• Add observation & control points

to Java application middleware– Observe: capture paths taken through

system by user request– Analyze: look for highly-unlikely anomalous

(therefore probably buggy) paths– Act: micro-reboot suspected-faulty J2EE

components transparently to rest of system

• Result: fast recovery improves overall performability

– micro-reboot is 2-3 orders of magnitude faster than full application reboot

– Improves performability (total amount of work per unit time in presence of faults)

– Minimizes disruption to users of other (non-faulty) parts of system

Fast, cheap uRB’s + statistical monitoring provide a degree of application-generic failure detection &

recovery

11

Crash-Oriented Software:Systematic Approach

• Needed: Systematic mechanism for determining when micro-reboots are safe

– Programming-language level support for rollback and state tracking

• Needed: Better integration with SLT– Which clustering/analysis techniques best correlate anomalous

paths to particular observed failure types? (current prototype uses very simple data clustering techniques)

– Are these techniques suitable for online use? (current prototype does offline analysis)

12






13

Research Challenges

• No protection against DoS attacks– MS Blaster inflicted Internet packet loss > 20%

• Routing protocols blindly believe routes advertised by neighbors

– BGP router misconfigurations» 200-1200 prefixes affected every day» C&W’s (AS3561) misconfiguration caused an

outage for > 5000 prefixes for 2 hours (April 2001) – Malicious routers: huge potential threat

» Drop packets and render a destination unreachable

» Eavesdrop the traffic to a given destination» Impersonate the destination

14

Observe, Analyze, Act• Observe:

– Use multiple vantage points to monitor the network– Design protocols whose behaviors can be verified

• Analyze: – Learn from protocol behavior – Identify bogus information

• Act:– Contain misbehaving components– Rise flags for network operators– Empower end-hosts (e.g., enable end-hosts to stop unwanted packets in

the network infrastructure)» End-hosts know better when under attack (flashcrowds vs. DoS

attacks) » End-hosts can react faster than infrastructure

sender receiver

15

Case Study: BGP (Listen & Whisper)

• Whisper: – Use redundancy to check for route

advertisements consistency

• Listen:– Monitor TCP flow progress to detect

reachability problems

• Results:– Whisper: reduce the region of

Internet vulnerable to an isolated adversary to 5%

» Scalable, implementation: can handle 10 times today’s BGP load

– Listen: detect reachability problems» Probability of false positives ~

1% » Vulnerable to port scans plan

to use SLT

16

Commodity Internet & IP networks

Programmable Network Elements

• Enabling Technology– Edge network elements for IDS, firewall, traffic shaping, etc.– Next generation: exposed APIs for 3rd party programming– Location for efficient network-level monitoring and control

» Observe: rapid detection of route failure or network attack» Act: e.g., filter intrusions, quarantine propagating worms» Avoid configuration and “latest patch not installed” errors

Router Router

EdgeNetwork

EdgeNetwork

In-PortClassify

TransformOut-Port

17






18

Research Challenge: Self-sensing and Reactive

Systems• Internet scale attacks are fundamentally

different than host scale attacks• Traditional “Intrusion Detection Systems”

(IDS) have had some success with host scale attacks, but also many false positives

• Internet scale attacks offer opportunity (more evidence of wide scale attack) but also more challenge (integrating data from a large number of disparate sources)

19

Observe, Analyze, Act

• Observe: what to monitor, how to monitor• Analyze: Learning from patterns of messages

(not parsing their contents)• Act:

– How to exchange minimal information (in system under attack)

– rapidly evolving security protocols (for resilience to attack)

• Applications: Worm detection, spam detection

• Ultimate challenge: beyond detection and into response

20

Security of Networked Systems

Technical Approach• Mechanisms to learn, share, repair against potential

threats to dependability• Strengthen assurance of shared information via

lightweight authentication and encryption– TESLA authentication system: replaces public-key crypto with

lightweight symmetric encryption; uses time asymmetry to provide assurance

» Messages initially encrypted, verification keys revealed later—prevents attacker from using a received key to forge messages

» Variations provide instant authentication. – Athena system: generate random instances of secure protocols

» Ultra-fast checking software—model-checking & proof-theoretic techniques to verify protocols against stated requirements

» Intelligently generate most efficient secure protocol satisfying requirements or a random instance of a secure protocol satisfying a given set of requirements

– Apply for SLT systems to more quickly exchange information

21






22

Statistical Learning Theory

• Toolbox for design/analysis of adaptive systems– Algorithms for classification, diagnosis, prediction, novelty

detection, outlier detection, quantile estimation, density estimation, feature selection, variable selection, response surface optimization, sequential decision-making

• Classification algorithms– Recent scaling breakthroughs: 10K+ features, millions of data

points– Kernel machines; functional analysis and convex optimization

» Generalized inner product—similarities among data point pairs

» Defined for many data types» Classical linear statistical algorithms “kernelized” for state-

of-the-art nonlinear SLT algorithms in many areas

23


• Novelty Detection Problem– Unlimited observations reflecting normal activity

Yet few (or no) instances that reflect an attack or a bug » E.g. intrusion detection, machine diagnostics

– Second-order cone program; a convex optimization problem with an efficient solution method

» Given cloud of data in a high-dimensional feature space, place a boundary around these to guarantee that only a small fraction falls outside

• Basic problem---find a boundary that encloses a desired fraction of the data, and is as tight as possible

– can be done using the generalized Chebyshev inequality

– using kernels, this is a convex problem

24

Example: Statistical Bug-finding• Programs are buggy, yet many people use them

– Instrument programs to take samples of program state at runtime – Collect information over the Internet from many users’ runs– Learn a statistical classifier based on successful and failed runs, using

feature selection methods to pinpoint the bugs

• Example: finding a bug in Unix bc utility– 2908 features instrumented– All top feature indicate indx being unusually large in

more_arrays subroutine:storage.c:176: more_arrays(): indx > optopt

storage.c:176: more_arrays(): indx > opterr

storage.c:176: more_arrays(): indx > use_math

– Indeed, array overrun bug in re-allocation routine more_arrays() found to cause memory corruption and sometimes an eventual crash

25

Example III: Diagnosis

• A probabilistic graphical model with 600 disease nodes, 4000 finding nodes

• Node probabilities p(f_i | d) were assessed from an expert (Shwe, et al., 1991)

• Want to compute posteriors: p(d_j | f)

• Is this tractable?

26

Case Study: Medical Diagnosis

• Symbolic complexity:– symbolic expressions fill dozens of pages– would take years to compute

• Numerical simplicity:– Jaakkola and Jordan (1999) describe a variational method

based on convexity that computes approximate posteriors in less than a second

27

Challenge for SLT

• Challenge: “on-line” versions of the best algorithms have yet to be developed

– update the learning system’s state based on small sets of data

– Available for some “kernelized” problems– On-line versions of the best algorithms have yet to be

developed!

28






29

System Prototype

• Comprehensive system architecture• Reduction of SLT to practical software

components embedded within a distributed systems context

• Exhibition of an architecture for dramatically improving the reliability and security of important systems through observation-coordination-adaptation mechanisms.

30

Messaging as an Application• E-mail is now mission-critical application

– Organizational storage capacity shifting from financial data bases to email (email is fastest growing storage)

– Loss of email more critical to continuing operation of organization than telephony (imagine if gov’t had no email for a week)

• Instant Messaging is now mission-critical application

– In a crisis, many communication schemes will be used: land-based telephony, cellular telephony, instant messaging, email, …

– Coordination among first-responders during crisis response in field (administrators & operators)

• Demands for dependability, resistance to attack, establishment of trust among interacting entities

– Despite attempts by hackers, terrorists, …

31

Measuring Sucess

• Build email/IM prototype using RADS design principles and tools

• Put realistic performance workload on prototype

• Subject prototype to increasingly difficult failure workloads and attack workloads

– E.g., hardware failures, software failures, operator failures, worms attacks, DDOS attacks, …

• Measure false positive rates, accuracy rates, time to analyze failures, time to act, performance impact of actions, availability of prototype, performability of prototype, …

• Compare results to conventional email/IM systems under similar performance, failure, and attack workloads

32

Allies NetworksAllies NetworksAdversary Allies Networks

Disaster Response Messaging Application

DHS/FederalNetwork

CoalitionInternet

Active AdversaryService Attacks

Compromised NetworkWith Embedded

Adversaries

TrustRelations

Incident ReportsResponder LocationsGIS DataEtc.

Net Failure

Net FailureAllies NetworksLocal Police, Fire,

State PoliceAdversary

33






34

Old Science vs. New Science

• First 50 years of computer science– manually-engineered systems– lack of adaptability, robustness, and security– no concern with closing the loop with the environment

• Next 50 years of computer science– statistical learning systems throughout the infrastructure– self-configuring, adaptive, sentient systems– perception, reasoning, decision-making cycle– systems are “always” recovering because of this ongoing

automatic and dynamic adaptation

• New way to think about and design adaptive systems

– Makes continuous monitoring and reaction a first-class goal– Provides point of leverage for applying SLT and related techniques

35

Scientific Foundation For “Self-*” Systems

• New design principles and tools for systems that continuously adjust their behavior in response to analysis of online observations

• New metrics and benchmarks for evaluating self-adapting networked systems

• Advances in Statistical Learning Theory to move from offline to online analysis of large-scale distributed systems

36

BACKUP SLIDES

37


• “Super kernels”: combine heterogeneous data via multiple kernels

– Semidefinite programs, convex optimization problems with efficient solutions involving efficient decomposition techniques

– Useful in fusing evidence at distributed nodes

• Problems of interest require combined parameter estimation and optimization

– Response surface methodology: building local mappings from configurations to performance, and suggesting gradient directions in configuration space leading to performance improvements

– Policy-gradient methods: SLT algorithms that make sequences of decisions, yielding a “behavior” or “policy”; successfully developed policies for nonlinear control problems involving high degrees of freedom

38

Statistical Machine Learning

• Kernel methods– neural network heritage– convex optimization algorithms– kernels available for strings, trees, graphs, vectors, etc.– state-of-the-art performance in many problem domains– frequentist theoretical foundations

• Graphical models– marriage of graph theory and probability theory– recursive algorithms on graphs– modular design – state-of-the-art performance in many problem domains– Bayesian theoretical foundations

39

Self-Verifiable Protocols:BGP Whisper

• AS1 advertises its address prefix– Chooses a secrete key “x”, and sends y = h(x)– h(): well-known one-way hash function

• Every router forwards y = h(y)• AS4 performs consistency check: (y1)3 = (y2)3 ?

– If yes, assume both routes are correct– If no, at least one rout is incorrect (but don’t know which) rise a flag

AS4

AS3 AS2

AS1

AS3

Chose secretkey “x”

(AS1, y1=h(x))

(AS1,AS2,y1=h2(x))(AS1,AS2,AS3,y1=h3(x))

(AS1, y2=h(x))(AS1,AS3,y2=h2(x))

40

Enabling Technology:Edge Services by Network Appliances

In-the-Network Processing: the Computer IS THE Network

F5 Networks BIG-IP LoadBalancerWeb server load balancer

Packeteer PacketShaperTraffic monitor and shaper

Ingrian i225SSL offload appliance

Network Appliance NetCacheLocalized content delivery platform

Nortel Alteon Switched FirewallCheckPoint firewall and L7 switch

Cisco IDS 4250-XLIntrusion detection system

Cisco SN 5420IP-SAN storage gateway

Extreme Networks SummitPx1L2-L7 application switch

NetScreen 500Firewall and VPN

41

Self-Verifiable Protocols:Status and Future Plans

• Two examples:– BGP verifications (Listen & Whisper)

» Can trigger alarms and contain malicious routers» Minimal changes to BGP; incrementally deployable

(Listen)– Self-verifiable CSFQ

» Per-flow isolation without maintaining per flow state» Detect and contain malicious flows

• Ultimate goal: develop distributed system able to self diagnose and self-repair

– Eliminate faulty components– Minimum raise a flag in case of configurations and attackers– Develop set of principles and techniques for robust protocols

42

Enabling Technology:Programmable Networks

• Problem– Common programming/control environment for diverse

network elements to realize full power of “inside the network” services and applications

• Approach– Software toolkit and VM architecture for PNEs, with

retargetable optimized backend for diverse appliance-specific architectures

• Current Focus – Network health monitoring, protocol interworking and packet

translation services, iSCSI processing and performance enhancement, intrusion and worm detection and quarantining

• Potential Impact– Open framework for multi-platform appliances, enabling

third party service development– Provable application properties and invariants; avoidance of

configuration and “latest patch not installed” errors

43

Enabling Technology:Programmable Networks

• Generalized PNE programming and control model– Generalized “virtual machine” model for this class of devices– Retargetable for different underlying implementations

• Edge services of interest– Network measurement and monitoring supporting model

formation and statistical anomaly detection» Framework for inside-the-network “protocol listening”» Selective blocking/filtering/quarantining of traffic

– Application-specific routing» Faster detection and recovery from routing failures than is

possible from existing Internet protocols» Implementation of self-verifiable protocols

44

Crash-Only + Statistical Monitoring = Resilience to Real-

World Transients • Simple fault model: observed anomalies

“coerced” into crash faults • Surprise! Statistical

monitoring catches manyreal-world faults, withouta pre-established baseline

Offered Load vs. GoodputAI/MD Admission Control

0

1000

2000

3000

4000

5000

6000

7000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 21 24 27 30

Number of machines

Nu

mb

er o

f re

qu

ests

p

er s

eco

nd

Offered Load

Goodput

45

Self-Verifiable Protocols:Statement of the Problem

• Problem: Detect and contain network effects of misconfigurations and faulty/malicious components

• Approach: design network protocols so each component verifies correct behavior of the protocol

• Examples:– e2e protocols

– routing (BGP) protocolssender receiver

data/control flow

verif. protocol

46

Self-Verifiable ProtocolsCase Study: BGP

• Propagating invalid BGP routes can bring the Internet down

• Multiple causes– Router misconfigurations: happen daily, yielding

outages lasting hours– Malicious routers: huge potential threat

» Routers with default passwords » Possible to “buy” routers’ passwords on darknets

• Existing solutions – Hard to deploy (e.g., Secure-BGP), or insufficient

security

• Our solution: – Whisper: verify the correctness of router

advertisements– Listen: verify the reachability on the data plane

47

Self-Verifiable Protocols:BGP Whisper

• Use redundancy to check consistency of peer’s information• Whisper game:

– Group sits in a circle., person whispers secret phrase to neighbors– Person at other end concludes:

» Phrase is correct if same phrase from both neighbors » Otherwise, at least one phrase is incorrect

48

Self-Verifiable Protocols:BGP Listen

• Monitor progress of TCP flows• If TCP flow doesn’t make progress, might

be because route is incorrect• Use heuristics to reduce number of false

positives and negatives– Still difficult to handle traffic patterns like port

scanners

• Use SLT techniques to improve the detection accuracy?

49

Allies NetworksAllies NetworksAdversary Allies Networks

Military Messaging Application

US ForcesNetwork

CoalitionInternet

Active AdversaryService Attacks

Compromised NetworkWith Embedded

Adversaries

TrustRelations

SitReps

Net Failure

Net FailureAllies NetworksAllies

NetworksAdversary

1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,

Documents