Top Banner
Runaway complexity in Big Data Nathan Marz @nathanmarz 1 And a plan to stop it
125

Runaway complexity in Big Data... and a plan to stop it

Sep 08, 2014

Download

Technology

nathanmarz

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Runaway complexity in Big Data... and a plan to stop it

Runaway complexity in Big Data

Nathan Marz@nathanmarz 1

And a plan to stop it

Page 2: Runaway complexity in Big Data... and a plan to stop it

Agenda

• Common sources of complexity in data systems• Design for a fundamentally better data system

Page 3: Runaway complexity in Big Data... and a plan to stop it

What is a data system?

A system that manages the storage and querying of data

Page 4: Runaway complexity in Big Data... and a plan to stop it

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years

Page 5: Runaway complexity in Big Data... and a plan to stop it

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist

Page 6: Runaway complexity in Big Data... and a plan to stop it

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure

Page 7: Runaway complexity in Big Data... and a plan to stop it

What is a data system?

A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure, and every human mistake ever made

Page 8: Runaway complexity in Big Data... and a plan to stop it

Common sources of complexity

Lack of human fault-tolerance

Schemas done wrong

Conflation of data and queries

Page 9: Runaway complexity in Big Data... and a plan to stop it

Lack of human fault-tolerance

Page 10: Runaway complexity in Big Data... and a plan to stop it

Human fault-tolerance• Bugs will be deployed to production over the lifetime of a data system • Operational mistakes will be made• Humans are part of the overall system, just like your hard disks, CPUs, memory, and software

• Must design for human error like you’d design for any other fault

Page 11: Runaway complexity in Big Data... and a plan to stop it

Human fault-tolerance

Examples of human error• Deploy a bug that increments counters by two instead of by one• Accidentally delete data from database• Accidental DOS on important internal service

Page 12: Runaway complexity in Big Data... and a plan to stop it

The worst consequence is data loss or data corruption

Page 13: Runaway complexity in Big Data... and a plan to stop it

As long as an error doesn’t lose or corrupt good data,

you can fix what went wrong

Page 14: Runaway complexity in Big Data... and a plan to stop it

Mutability• The U and D in CRUD• A mutable system updates the current state of the world• Mutable systems inherently lack human fault-tolerance• Easy to corrupt or lose data

Page 15: Runaway complexity in Big Data... and a plan to stop it

Immutability• An immutable system captures a historical record of events• Each event happens at a particular time and is always true

Page 16: Runaway complexity in Big Data... and a plan to stop it

Capturing change with mutable data model

Person Location

Sally Philadelphia

Bob Chicago

Person Location

Sally New York

Bob Chicago

Sally moves to New York

Page 17: Runaway complexity in Big Data... and a plan to stop it

Capturing change with immutable data model

Person Location Time

Sally Philadelphia 1318358351

Bob Chicago 1327928370

Person Location Time

Sally Philadelphia 1318358351

Bob Chicago 1327928370

Sally New York 1338469380

Sally moves to New York

Page 18: Runaway complexity in Big Data... and a plan to stop it

Immutability greatly restricts the range of errors that can cause

data loss or data corruption

Page 19: Runaway complexity in Big Data... and a plan to stop it

Vastly more human fault-tolerant

Page 20: Runaway complexity in Big Data... and a plan to stop it

ImmutabilityOther benefits• Fundamentally simpler• CR instead of CRUD• Only write operation is appending new units of data• Easy to implement on top of a distributed filesystem

• File = list of data records• Append = Add a new file into a directory

basing a system on mutability is like pouring gasoline on your house (but don’t worry, i checked all the wires carefully to make sure there won’t be any sparks). when someone makes a mistake who knows what will burn

Page 21: Runaway complexity in Big Data... and a plan to stop it

Immutability

Please watch Rich Hickey’s talks to learn more about the enormous benefits of immutability

Page 22: Runaway complexity in Big Data... and a plan to stop it

Conflation of data and queries

Page 23: Runaway complexity in Big Data... and a plan to stop it

Conflation of data and queriesNormalization vs. denormalization

ID Name Location ID

1 Sally 3

2 George 1

3 Bob 3

Location ID City State Population

1 New York NY 8.2M

2 San Diego CA 1.3M

3 Chicago IL 2.7M

Normalized schema

Page 24: Runaway complexity in Big Data... and a plan to stop it

Join is too expensive, so denormalize...

Page 25: Runaway complexity in Big Data... and a plan to stop it

ID Name Location ID City State

1 Sally 3 Chicago IL

2 George 1 New York NY

3 Bob 3 Chicago IL

Location ID City State Population

1 New York NY 8.2M

2 San Diego CA 1.3M

3 Chicago IL 2.7M

Denormalized schema

Page 26: Runaway complexity in Big Data... and a plan to stop it

Obviously, you prefer all data to be fully normalized

Page 27: Runaway complexity in Big Data... and a plan to stop it

But you are forced to denormalize for performance

Page 28: Runaway complexity in Big Data... and a plan to stop it

Because the way data is modeled, stored, and queried is complected

Page 29: Runaway complexity in Big Data... and a plan to stop it

We will come back to how to build data systems in which these are

disassociated

Page 30: Runaway complexity in Big Data... and a plan to stop it

Schemas done wrong

Page 31: Runaway complexity in Big Data... and a plan to stop it

Schemas have a bad rap

Page 32: Runaway complexity in Big Data... and a plan to stop it

Schemas• Hard to change• Get in the way• Add development overhead• Requires annoying configuration

Page 33: Runaway complexity in Big Data... and a plan to stop it

I know! Use a schemaless database!

Page 34: Runaway complexity in Big Data... and a plan to stop it

This is an overreaction

Page 35: Runaway complexity in Big Data... and a plan to stop it

Confuses the poor implementation of schemas with the value that

schemas provide

Page 36: Runaway complexity in Big Data... and a plan to stop it

What is a schema exactly?

Page 37: Runaway complexity in Big Data... and a plan to stop it

function(data unit)

Page 38: Runaway complexity in Big Data... and a plan to stop it

That says whether this data is valid or not

Page 39: Runaway complexity in Big Data... and a plan to stop it

This is useful

Page 40: Runaway complexity in Big Data... and a plan to stop it

Value of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruption

Page 41: Runaway complexity in Big Data... and a plan to stop it

Otherwise you’ll detect corruption issues at read-time

Page 42: Runaway complexity in Big Data... and a plan to stop it

Potentially long after the corruption happened

Page 43: Runaway complexity in Big Data... and a plan to stop it

With little insight into the circumstances of the corruption

Page 44: Runaway complexity in Big Data... and a plan to stop it

Much better to get an exception where the mistake is made,

before it corrupts the database

Page 45: Runaway complexity in Big Data... and a plan to stop it

Saves enormous amounts of time

Page 46: Runaway complexity in Big Data... and a plan to stop it

Why are schemas considered painful?• Changing the schema is hard (e.g., adding a column to a table)• Schema is overly restrictive (e.g., cannot do nested objects)• Require translation layers (e.g. ORM)• Requires more typing (development overhead)

Page 47: Runaway complexity in Big Data... and a plan to stop it

None of these are fundamentally linked with function(data unit)

Page 48: Runaway complexity in Big Data... and a plan to stop it

These are problems in the implementation of schemas, not in

schemas themselves

Page 49: Runaway complexity in Big Data... and a plan to stop it

Ideal schema tool• Data is represented as maps• Schema tool is a library that helps construct the schema function:• Concisely specify required fields and types• Insert custom validation logic for fields (e.g. ages are between 0 and 200)

• Built-in support for evolving the schema over time• Fast and space-efficient serialization/deserialization• Cross-language this is easy to use and gets out of your way

i use apache thrift, but it lacks the custom validation logic

i think it could be done better with a clojure-like data as maps approach

given that parameters of a data system: long-lived, ever changing, with mistakes being made, the amount of work it takes to make a schema (not that much) is absolutely worth it

Page 50: Runaway complexity in Big Data... and a plan to stop it

Let’s get provocative

Page 51: Runaway complexity in Big Data... and a plan to stop it

The relational database will be a footnote in history

Page 52: Runaway complexity in Big Data... and a plan to stop it

Not because of SQL, restrictive schemas, or scalability issues

Page 53: Runaway complexity in Big Data... and a plan to stop it

But because of fundamental flaws in the RDBMS approach

to managing data

Page 54: Runaway complexity in Big Data... and a plan to stop it

Mutability

Page 55: Runaway complexity in Big Data... and a plan to stop it

Conflating the storage of data with how it is queried

Back in the day, these flaws were feature – because space was a premium. The landscape has changed, and this is no longer the constraint it once was. So these properties of mutability and conflating data and queries are now major, glaring flaws. Because there are better ways to design data system

Page 56: Runaway complexity in Big Data... and a plan to stop it

“NewSQL” is misguided

Page 57: Runaway complexity in Big Data... and a plan to stop it

Let’s use our ability to cheaply store massive amounts of data

Page 58: Runaway complexity in Big Data... and a plan to stop it

To do data right

Page 59: Runaway complexity in Big Data... and a plan to stop it

And not inherit the complexities of the past

Page 60: Runaway complexity in Big Data... and a plan to stop it

if SQL’s wrong, and NoSQL isn’t SQL, then NoSQL must be right

I know! Use a NoSQL database!

Page 61: Runaway complexity in Big Data... and a plan to stop it

NoSQL databases are generally not a step in the right direction

Page 62: Runaway complexity in Big Data... and a plan to stop it

Some aspects are, but not the ones that get all the attention

Page 63: Runaway complexity in Big Data... and a plan to stop it

Still based on mutability and not general purpose

Page 64: Runaway complexity in Big Data... and a plan to stop it

Let’s start from scratch

Let’s see how you design a data system that doesn’t suffer from these complexities

Page 65: Runaway complexity in Big Data... and a plan to stop it

What does a data system do?

Page 66: Runaway complexity in Big Data... and a plan to stop it

Retrieve data that you previously stored?

GetPut

Page 67: Runaway complexity in Big Data... and a plan to stop it

Not really...

Page 68: Runaway complexity in Big Data... and a plan to stop it

Counterexamples

Store location information on people

Where does Sally live?

What are the most populous locations?

How many people live in a particular location?

Page 69: Runaway complexity in Big Data... and a plan to stop it

Counterexamples

Store pageview information

How many unique visitors over time?

How many pageviews on September 2nd?

Page 70: Runaway complexity in Big Data... and a plan to stop it

Counterexamples

Store transaction history for bank account

How much money do people spend on housing?

How much money does George have?

Page 71: Runaway complexity in Big Data... and a plan to stop it

What does a data system do?

Query = Function(All data)

Page 72: Runaway complexity in Big Data... and a plan to stop it

Sometimes you retrieve what you stored

Page 73: Runaway complexity in Big Data... and a plan to stop it

Oftentimes you do transformations, aggregations, etc.

Page 74: Runaway complexity in Big Data... and a plan to stop it

Queries as pure functions that take all data as input is the most general

formulation

Page 75: Runaway complexity in Big Data... and a plan to stop it

Example query

Total number of pageviews to a URL over a range of time

Page 76: Runaway complexity in Big Data... and a plan to stop it

Example query

Implementation

Page 77: Runaway complexity in Big Data... and a plan to stop it

Too slow: “all data” is petabyte-scale

Page 78: Runaway complexity in Big Data... and a plan to stop it

On-the-fly computation

Alldata

Query

Page 79: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputedview Query

Page 80: Runaway complexity in Big Data... and a plan to stop it

Precomputed view

Example query

All data

Pageview

Pageview

Pageview

Pageview

Pageview

Query 2930

Page 81: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputedview Query

Page 82: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputedview Query

Function Function

Page 83: Runaway complexity in Big Data... and a plan to stop it

Data system

Alldata

Precomputedview Query

Function Function

Two problems to solve

Page 84: Runaway complexity in Big Data... and a plan to stop it

Data system

Alldata

Precomputedview Query

Function Function

How to compute views

Page 85: Runaway complexity in Big Data... and a plan to stop it

Data system

Alldata

Precomputedview Query

Function Function

How to compute queries from views

Page 86: Runaway complexity in Big Data... and a plan to stop it

Computing views

Alldata

Precomputedview

Function

Page 87: Runaway complexity in Big Data... and a plan to stop it

Function that takes in all data as input

Page 88: Runaway complexity in Big Data... and a plan to stop it

Batch processing

Page 89: Runaway complexity in Big Data... and a plan to stop it

MapReduce

Page 90: Runaway complexity in Big Data... and a plan to stop it

MapReduce is a framework for computing arbitrary functions on

arbitrary data

Page 91: Runaway complexity in Big Data... and a plan to stop it

Expressing those functions

Cascalog

Scalding

Page 92: Runaway complexity in Big Data... and a plan to stop it

All

data

Batch view #1

Batch view #2

MapReduce workflow

MapReduce workflow

MapReduce precomputation

Page 93: Runaway complexity in Big Data... and a plan to stop it

Batch view database

Need a database that...

• Is batch-writable from MapReduce

• Has fast random reads

• Examples: ElephantDB, Voldemort

Page 94: Runaway complexity in Big Data... and a plan to stop it

Batch view database

No random writes required!

Page 95: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

SimpleElephantDB is only a few thousand lines of code

Page 96: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

Scalable

Page 97: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

Highly available

Page 98: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

Can be heavily optimized (b/c no random writes)

Page 99: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

Normalized

Page 100: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

“Denormalized”

Not exactly denormalization, because you’re doing more than just retrieving data that you stored (can do aggregations)

You’re able to optimize data storage separately from data modeling, without the complexity typical of denormalization in relational databases

*This is because the batch view is a pure function of all data* -> hard to get out of sync, and if there’s ever a problem (like a bug in your code that computes the wrong batch view) you can recompute

also easy to debug problems, since you have the input that produced the batch view -> this is not true in a mutable system based on incremental updates

Page 101: Runaway complexity in Big Data... and a plan to stop it

So we’re done, right?

Page 102: Runaway complexity in Big Data... and a plan to stop it

Not quite...• A batch workflow is too slow

• Views are out of date

Absorbed into batch views Not absorbed

Now

Time

Just a few hoursof data!

Page 103: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

Eventually consistent

Page 104: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

(without the associated complexities)

Page 105: Runaway complexity in Big Data... and a plan to stop it

Properties

Alldata

Batchview

Function

(such as divergent values, vector clocks, etc.)

Page 106: Runaway complexity in Big Data... and a plan to stop it

What’s left?

Precompute views for last few hours of data

Page 107: Runaway complexity in Big Data... and a plan to stop it

Realtime views

Page 108: Runaway complexity in Big Data... and a plan to stop it

NoSQL databases

New data stream

Realtime view #1

Realtime view #2

Stream processor

Page 109: Runaway complexity in Big Data... and a plan to stop it

Application queries

Realtime view

Batch view

Merge

Page 110: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputedview Query

Page 111: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

“Lambda Architecture”

Page 112: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

Most complex part of system

Page 113: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

Random write dbs much more complexThis is where things like vector clocks have to be dealt with if using eventually consistent NoSQL database

Page 114: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

But only represents few hours of data

Page 115: Runaway complexity in Big Data... and a plan to stop it

Precomputation

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

If anything goes wrong, auto-correctsCan continuously discard realtime views, keeping them small

Page 116: Runaway complexity in Big Data... and a plan to stop it

CAPRealtime layer decides whether to guarantee C or A• If it chooses consistency, queries are consistent• If it chooses availability, queries are eventually consistent

All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer. If anything goes wrong, it’s *auto-corrected*

CAP is now a choice, as it should be, rather than a complexity burden. Making a mistake w.r.t. eventual consistency *won’t corrupt* your data

Page 117: Runaway complexity in Big Data... and a plan to stop it

Eventual accuracy

Sometimes hard to compute exact answer in realtime

Page 118: Runaway complexity in Big Data... and a plan to stop it

Eventual accuracy

Example: unique count

Page 119: Runaway complexity in Big Data... and a plan to stop it

Eventual accuracy

Can compute exact answer in batch layer and approximate answer in realtime layer

Though for functions which can be computed exactly in the realtime layer (e.g. counting), you can achieve full accuracy

Page 120: Runaway complexity in Big Data... and a plan to stop it

Eventual accuracy

Best of both worlds of performance and accuracy

Page 121: Runaway complexity in Big Data... and a plan to stop it

Tools

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

Hadoop

Storm

“Lambda Architecture”

Storm

ElephantDB, Voldemort

Cassandra, Riak, HBaseKafka

Page 122: Runaway complexity in Big Data... and a plan to stop it

Lambda Architecture• Can discard batch views and realtime views and recreate everything from scratch

• Mistakes corrected via recomputation

• Data storage layer optimized independently from query resolution layer

what mistakes can be made?

- write bad data? - remove the data and recompute the views

- bug in the functions that compute view? - recompute the view

- bug in query function? just deploy the fix

Page 123: Runaway complexity in Big Data... and a plan to stop it

Future• Abstraction over batch and realtime• More data structure implementations for batch and realtime views

Page 124: Runaway complexity in Big Data... and a plan to stop it

Learn more

http://manning.com/marz

Page 125: Runaway complexity in Big Data... and a plan to stop it

Questions?