Top Banner
Consistency in Distributed Systems: II Mike Miller Co-Founder, Chief Scientist @mlmilleratmit
44

Consistency in Distributed Systems, Part 2

Jun 25, 2015

Download

Technology

DATAVERSITY

Building on the introductory work of a past webinar, we take a deep dive in the locking, replication, and failure modes of leading NoSQL databases. We focus on three main areas critical for modern developers and architects:

Industry survey of ACID compliance.
Best practices to store 1-many and many-many relationships in Riak, Cassandra, MongoDB, DyanamoDB, and Cloudant.
Consistency between primary and secondary indexes (an often neglected subject) and implications for immutable data models.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Consistency in Distributed Systems, Part 2

Consistency in Distributed Systems: IIMike MillerCo-Founder, Chief Scientist @mlmilleratmit

Page 2: Consistency in Distributed Systems, Part 2

2014-07-31 2

AMP on Consistency

https://amplab.cs.berkeley.edu/tag/consistency/

Page 3: Consistency in Distributed Systems, Part 2

2014-07-31

Topics For Today‣ Brief review of Part I (2014-06-12)

‣ Why is Consistency hard?

‣ What should you really care about? • Single object/row/document operations • Multi-part transactions • Primary/secondary indexes

‣ Dirty little (ACID) secrets: Results from industry survey

‣ Failure modes, strategies, and gotchas

3

Page 4: Consistency in Distributed Systems, Part 2

2014-07-31

Motivation

4

Page 5: Consistency in Distributed Systems, Part 2

2014-07-31 5

MobileBig Data

=> Stress models for consistency, transactional reasoning

Page 6: Consistency in Distributed Systems, Part 2

2014-07-31

This is your problem when… !

… data doesn’t fit on one server. … data replicated between servers (e.g. read slaves). … data spread between data centers. … state spread across more than one device (mobile!) … mixed workloads with concurrency. … state spread across more than one process.

6

Page 7: Consistency in Distributed Systems, Part 2

2014-07-31

This is now everyone’s problem

7

Page 8: Consistency in Distributed Systems, Part 2

2014-07-31

Good news — market response: NewSQL, NoSQL, Cloud, …

8

Page 9: Consistency in Distributed Systems, Part 2

2014-07-31 9

ships with a mobile strategy

Page 10: Consistency in Distributed Systems, Part 2

2014-07-31

{Write: ‘Local’, Sync: ‘Later’}

Embedded, Edge, Satellites

Desktop, Browser

Cloud

10

Page 11: Consistency in Distributed Systems, Part 2

2014-07-31

NoSQL Taxonomy

11

Page 12: Consistency in Distributed Systems, Part 2

2014-07-31 12

Page 13: Consistency in Distributed Systems, Part 2

2014-07-31 13

…http://www.bailis.org/papers/ramp-sigmod2014.pdf

Fundamental reason: CAP Theorem

Page 14: Consistency in Distributed Systems, Part 2

2014-07-31

You do need to understand your datastore.

14

Page 15: Consistency in Distributed Systems, Part 2

2014-07-31

Why is Consistency Hard?

15

Page 16: Consistency in Distributed Systems, Part 2

2014-07-31

1. The network is reliable.

2. Latency is zero. (Fallacies of Distributed Computing, P. Deutsch)

16

Page 17: Consistency in Distributed Systems, Part 2

MySQL, MongoDB, CouchDB, SOLR, …

Dynamo, Cloudant, Cassandra, Riak, …

Page 18: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)success

Clientr(x)x=1

time

Perfect Network

Page 19: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)success

Clientr(x)x=Null

time

Network Partition: Primary Only

Available, temporarily inconsistent

Page 20: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)

Client success

time

Network Partition: Primary+Secondary

Consistent

Page 21: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)

failure

time

Network Partition: Primary+Secondary

Not Available

Page 22: Consistency in Distributed Systems, Part 2

2014-07-31

Partition Failures Dominate‣ 2011 (AWS): • misconfiguration => 12 hour outage

‣ 2011 Survey (Microsoft): • 13,300 customer impacting network failures • Median 60,000 packts lost per failure • mean 41 link failures per day (95% of 136) • median time to repair of 5 minutes (up to a week) • Redundant networks only reduce failure impact by 40%

‣ HP Managed Enterprise Networks • 28% of customer tickets due to network problems • 39% of all support tickets due to network problems • Median incident duration: 114-188 minutes

22

http

://qu

eue.

acm

.org

/det

ail.c

fm?i

d=26

5573

6

Page 23: Consistency in Distributed Systems, Part 2

2014-07-31

LatencyNetwork health really depends on your latency tolerance.

A slow network can be just as bad as a broken network.

The tails matter.

23

Page 24: Consistency in Distributed Systems, Part 2

2014-07-31

Median Latencies

24

Same AZ Different AZs

Different Regions

http://www.bailis.org/blog/communication-costs-in-real-world-networks/

Page 25: Consistency in Distributed Systems, Part 2

2014-07-31

99.99% Latencies

25

Same AZ Different AZs

Different Regions

http://www.bailis.org/blog/communication-costs-in-real-world-networks/

Page 26: Consistency in Distributed Systems, Part 2

2014-07-31

Latency Summary‣ Distributed, coordinated operations: ‣ rate ~ 1/latency

‣ Real world latencies are substantial, with long tails

‣ At scale, 0.01% events happen constantly

‣ Picture actually much worse due to systematic fluctuations

‣ 99.99% Latencies: ‣ Same AZ: ~50 ms ‣ Same Region: ~80 ms ‣ Inter-Region: 200-400 ms!

26

Page 27: Consistency in Distributed Systems, Part 2

2014-07-31

Thank god for ACID (New)SQL, right? !

… not so fast

27

Page 28: Consistency in Distributed Systems, Part 2

2014-07-31

ACID in the Wild

28

http://arxiv-web3.library.cornell.edu/abs/1302.0309v1

Page 29: Consistency in Distributed Systems, Part 2

2014-07-31

Beware the Marketing

29

http://arxiv-web3.library.cornell.edu/abs/1302.0309v1

Page 30: Consistency in Distributed Systems, Part 2

2014-07-31 30

Wow!

Page 31: Consistency in Distributed Systems, Part 2

2014-07-31

So… What do we use? What should we worry about?

31

Page 32: Consistency in Distributed Systems, Part 2

2014-07-31

1. Locks / Concurrency 2. Relationships / Foreign Keys 3. Inter-index consistency

32

Distinguishing Characteristics

Page 33: Consistency in Distributed Systems, Part 2

2014-07-31

Subjective Classification

33

Cassandra Cloudant MongoDB Riak

Locking Minimal None Writes and Reads Minimal

Consistency Quorum, Optional Paxos Quorum Single document

LocksQuorum,

Optional Paxos

Relationships, “JOINs”

De-normalize, Materialized

Views

Normalize, Materialized

Views

De-normalize, Application Joins

De-normalize or Link Walking

Leading Strategies Immutability Immutability Fat Documents Immutability

“Intention” HA, Shared Nothing, Many Servers

HA, Shared Nothing, Many Servers

Master/Slave, Single Server

HA, Shared Nothing, Many Servers

Page 34: Consistency in Distributed Systems, Part 2

2014-07-31

It happens in all no-SQL systems. Is it the application's responsibility or the DB?

34

De-normalization

Page 35: Consistency in Distributed Systems, Part 2

2014-07-31

Relationships as Single Documents

35

Natural fit for some applicationshttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Page 36: Consistency in Distributed Systems, Part 2

2014-07-31

Relationships as Single Documents

36

Duplication sucks, pathologicallyhttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Page 37: Consistency in Distributed Systems, Part 2

2014-07-31

Materialized Views Rule

37

Cassandra, Cloudant: “JOINs” via materialized views

Page 38: Consistency in Distributed Systems, Part 2

2014-07-31

Review: Cassandra

‣ Highly Available

‣ CQL eases pain of de-normalization

‣ 1-many, many-many relationships via inserts into multiple column families at update

‣ Eventual consistency as those updates propagate

‣ Can appeal to Paxos API with latency, availability hit

38

Page 39: Consistency in Distributed Systems, Part 2

2014-07-31

Review: Cloudant‣ Highly Available

‣ Normalize document structure, include foreign keys to other documents.

‣ Manage foreign key integrity yourself

‣ 1-many, many-many relationships via materialized views

‣ Eventual consistency between primary-index and (batch updated) materialized view

39

Page 40: Consistency in Distributed Systems, Part 2

2014-07-31

Review: MongoDB‣ Understand when MongoDB locks

‣ Go as far as you can with “fat”, de-normalized documents

‣ Beware the consistency subtleties of replica sets, de-normalization

40

Page 41: Consistency in Distributed Systems, Part 2

2014-07-31

Review: Riak‣ Highly Available

‣ Include foreign keys to other documents.

‣ Manage foreign key integrity yourself

‣ one-way (“graphy”) relationships via link-walking API

‣ Can appeal to Paxos API with latency, availability hit

41

Page 42: Consistency in Distributed Systems, Part 2

2014-07-31

My Final $0.02‣ Time to market should be your #1 concern.

‣ You will probably run both SQL and NoSQL.

‣ We’ve focused on the database, but all new apps need a mobile strategy.

‣ You’ll never engineer a perfect network • Focus on Availability and Partition Tolerance

‣ You will need to become advanced/expert in data modeling for your choice of DB

42

Page 43: Consistency in Distributed Systems, Part 2

2014-07-31

cloudant.com

[email protected]

@mlmilleratmit

#Cloudant

Thanks!

43

IRC

Page 44: Consistency in Distributed Systems, Part 2

2014-07-31 44