Large-Scale Distributed Systems

Large-Scale Distributed Systems

Andrew WhitakerCSE451

Textbook Definition

“A distributed system is a collection of loosely coupled processors interconnected by a communication network”

Typically, the nodes run software to create an application/service e.g., 1000s of Google nodes work together to

build a search engine

Why Not to Build a Distributed System (1)Must handle partial failures

System must stay up, even when individual components fail

Amazon.com

Why Not to Build a Distributed System (2) No global state

Machines can only communicate with messages

This makes it difficult to agree on anything “What time is it?” “Which happened first, A or B?”

Theory: consensus is slow and doesn’t work in the presence of failure So, we try to avoid needing to agree in the first place

A B

Reasons to Build a Distributed System (1)The application or service is inherently

distributed

Andrew Whitaker Joan Whitaker

Reason to Build a Distributed System (2)Application requirements

Must scale to millions of requests / sec Must be available despite component failures

This is why Amazon, Google, Ebay, etc. are all large distributed systems

Internet Service Requirements

Basic goal: build a site that satisfies every user requests

Detailed requirements: Handle billions of transactions per day Be available 24/7 Handle load spikes that are 10x normal capacity Do it with a random selection of mismatched hardware

An Overview of HotMail (Jim Gray) ~7,000 servers 100 backend stores with 300TB (cooked) Many data centers Links to

Internet Mail gateways Ad-rotator Passport

~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly).

Availability Strategy #1: Perfect Hardware

Pay extra $$$ for components that do not fail

People have tried this “fault tolerant computing”

This isn’t practical for Amazon / Google: It’s impossible to get rid of all faults Software and administrative errors still exist

Availability Strategy #2: Over-provisionStep 1: buy enough hardware to handle

your workloadStep 2: buy more hardware

ReplicateReplicate

Replicate

Replicate

Benefits of Replication

ScalabilityGuards against hardware failuresGuards against software failures (bugs)

Replication Meets Probability

p is probability that a single machine failsProbability of N failures is: 1-p^n

Siteunavailability

0.000001

0.00001

0.0001

0.001

0.01

0.1

10 1 2 3 4 5 6 7

Number of Replicas

Availability in the Real World

Phone network: 5 9’s 99.999% available

ATMs: 4 9’s 99.99% available

What about Internet services? Not very good…

2006: typical 97.48% Availability

97.48%

Source: Jim Gray

Netcraft’s Crisis-of-the-Day

What Gives?

Why isn’t simple redundancy enough to give very high availability?

Failure Modes

Fail-stop failure: A component fails by stopping It’s totally dead: doesn’t respond

to input or output Ideally, this happens fast

Like a light-bulb

Byzantine failure: Component fails in an arbitrary way Produces unpredictable output

Byzantine Generals

Basic goal: reach consensus in the presence of arbitrary failures

Results: More than 2/3 of the nodes must be “loyal”

3t + 1 nodes with t traitors Consensus is possible, but expensive

Lot’s of messages Many rounds of communication

In practice, people assume that failures are fail-stop, and hope for the best…

Example of a non Fail-Stop Failure

Server

Server

Server

Server

Server

LoadbalancerInternet

Load Balancer uses a “Least Connections” policyServer fails by returning an HTTP error 400Net result: “failed” server becomes a black hole

Amazon.com

Correlated Failures

In practice, components often fail at the same time Natural disasters Security vulnerabilities Correlated manufacturing defects Human error…

Human errorHuman operator error is the leading cause of

dependability problems in many domains

Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD-02-1175, March 2002.

59%22%

8%

11%

OperatorHardwareSoftwareOverload

51%

15%

34%

0%

Public Switched Telephone Network Average of 3 Internet Sites

Sources of Failure

Understanding Human Error

Administrator actions tend to involve many nodes at once: Upgrade from Apache 1.3 to Apache 2.0 Change the root DNS server Network / router misconfiguration

This can lead to (highly) correlated failures

Learning to Live with Failures

If we can’t prevent failures outright, how can we make their impact less severe?

Understanding availability: MTTF: Mean-time-to-failure MTTR: Mean-time-to-repair Availability = MTTR / (MTTR + MTTF)

Approximately MTTR / MTTF

Note: recovery timeis just as importantas failure time!

Summary

Large distributed systems are built from many flaky components Key challenge: don’t let component failures become

system failures Basic approach: throw lots of hardware at the

problem; hope everything doesn’t fail at once Try to decouple failures Try to avoid single points-of-failure Try to fail fast

Availability is affected as much by recovery time as by error frequency

Large-Scale Distributed Systems

Documents