Top Banner
Large-Scale Distributed Systems Andrew Whitaker CSE451
24

Large-Scale Distributed Systems

Feb 25, 2016

Download

Documents

Large-Scale Distributed Systems. Andrew Whitaker CSE451. Textbook Definition. “A distributed system is a collection of loosely coupled processors interconnected by a communication network” Typically, the nodes run software to create an application/service - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large-Scale Distributed Systems

Large-Scale Distributed Systems

Andrew WhitakerCSE451

Page 2: Large-Scale Distributed Systems

Textbook Definition

“A distributed system is a collection of loosely coupled processors interconnected by a communication network”

Typically, the nodes run software to create an application/service e.g., 1000s of Google nodes work together to

build a search engine

Page 3: Large-Scale Distributed Systems

Why Not to Build a Distributed System (1)Must handle partial failures

System must stay up, even when individual components fail

Amazon.com

Page 4: Large-Scale Distributed Systems

Why Not to Build a Distributed System (2) No global state

Machines can only communicate with messages

This makes it difficult to agree on anything “What time is it?” “Which happened first, A or B?”

Theory: consensus is slow and doesn’t work in the presence of failure So, we try to avoid needing to agree in the first place

A B

Page 5: Large-Scale Distributed Systems

Reasons to Build a Distributed System (1)The application or service is inherently

distributed

Andrew Whitaker Joan Whitaker

Page 6: Large-Scale Distributed Systems

Reason to Build a Distributed System (2)Application requirements

Must scale to millions of requests / sec Must be available despite component failures

This is why Amazon, Google, Ebay, etc. are all large distributed systems

Page 7: Large-Scale Distributed Systems

Internet Service Requirements

Basic goal: build a site that satisfies every user requests

Detailed requirements: Handle billions of transactions per day Be available 24/7 Handle load spikes that are 10x normal capacity Do it with a random selection of mismatched hardware

Page 8: Large-Scale Distributed Systems

An Overview of HotMail (Jim Gray) ~7,000 servers 100 backend stores with 300TB (cooked) Many data centers Links to

Internet Mail gateways Ad-rotator Passport

~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly).

Page 9: Large-Scale Distributed Systems

Availability Strategy #1: Perfect Hardware

Pay extra $$$ for components that do not fail

People have tried this “fault tolerant computing”

This isn’t practical for Amazon / Google: It’s impossible to get rid of all faults Software and administrative errors still exist

Page 10: Large-Scale Distributed Systems

Availability Strategy #2: Over-provisionStep 1: buy enough hardware to handle

your workloadStep 2: buy more hardware

ReplicateReplicate

Replicate

Replicate

Page 11: Large-Scale Distributed Systems

Benefits of Replication

ScalabilityGuards against hardware failuresGuards against software failures (bugs)

Page 12: Large-Scale Distributed Systems

Replication Meets Probability

p is probability that a single machine failsProbability of N failures is: 1-p^n

Siteunavailability

0.000001

0.00001

0.0001

0.001

0.01

0.1

10 1 2 3 4 5 6 7

Number of Replicas

Page 13: Large-Scale Distributed Systems

Availability in the Real World

Phone network: 5 9’s 99.999% available

ATMs: 4 9’s 99.99% available

What about Internet services? Not very good…

Page 14: Large-Scale Distributed Systems

2006: typical 97.48% Availability

97.48%

Source: Jim Gray

Page 15: Large-Scale Distributed Systems

Netcraft’s Crisis-of-the-Day

Page 16: Large-Scale Distributed Systems

What Gives?

Why isn’t simple redundancy enough to give very high availability?

Page 17: Large-Scale Distributed Systems

Failure Modes

Fail-stop failure: A component fails by stopping It’s totally dead: doesn’t respond

to input or output Ideally, this happens fast

Like a light-bulb

Byzantine failure: Component fails in an arbitrary way Produces unpredictable output

Page 18: Large-Scale Distributed Systems

Byzantine Generals

Basic goal: reach consensus in the presence of arbitrary failures

Results: More than 2/3 of the nodes must be “loyal”

3t + 1 nodes with t traitors Consensus is possible, but expensive

Lot’s of messages Many rounds of communication

In practice, people assume that failures are fail-stop, and hope for the best…

Page 19: Large-Scale Distributed Systems

Example of a non Fail-Stop Failure

Server

Server

Server

Server

Server

LoadbalancerInternet

Load Balancer uses a “Least Connections” policyServer fails by returning an HTTP error 400Net result: “failed” server becomes a black hole

Amazon.com

Page 20: Large-Scale Distributed Systems

Correlated Failures

In practice, components often fail at the same time Natural disasters Security vulnerabilities Correlated manufacturing defects Human error…

Page 21: Large-Scale Distributed Systems

Human errorHuman operator error is the leading cause of

dependability problems in many domains

Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD-02-1175, March 2002.

59%22%

8%

11%

OperatorHardwareSoftwareOverload

51%

15%

34%

0%

Public Switched Telephone Network Average of 3 Internet Sites

Sources of Failure

Page 22: Large-Scale Distributed Systems

Understanding Human Error

Administrator actions tend to involve many nodes at once: Upgrade from Apache 1.3 to Apache 2.0 Change the root DNS server Network / router misconfiguration

This can lead to (highly) correlated failures

Page 23: Large-Scale Distributed Systems

Learning to Live with Failures

If we can’t prevent failures outright, how can we make their impact less severe?

Understanding availability: MTTF: Mean-time-to-failure MTTR: Mean-time-to-repair Availability = MTTR / (MTTR + MTTF)

Approximately MTTR / MTTF

Note: recovery timeis just as importantas failure time!

Page 24: Large-Scale Distributed Systems

Summary

Large distributed systems are built from many flaky components Key challenge: don’t let component failures become

system failures Basic approach: throw lots of hardware at the

problem; hope everything doesn’t fail at once Try to decouple failures Try to avoid single points-of-failure Try to fail fast

Availability is affected as much by recovery time as by error frequency