Top Banner
The Return of Big Iron? Ben Stopford Distinguished Engineer RBS Markets
58
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big iron 2 (published)

The Return of Big Iron?

Ben StopfordDistinguished Engineer

RBS Markets

Page 2: Big iron 2 (published)

Much diversity

Page 3: Big iron 2 (published)

What does this mean?

• A change in what customers (we) value

• The mainstream is not serving customers (us) sufficiently

Page 4: Big iron 2 (published)

The Database field has problems

Page 5: Big iron 2 (published)

We Lose: Joe Hellerstein (Berkeley) 2001

“Databases are commoditised and cornered to slow-moving, evolving, structure intensive, applications that require schema evolution.“ … “The internet companies are lost and we will remain in the doldrums of the enterprise space.” …“As databases are black boxes which require a lot of coaxing to get maximum performance”

Page 6: Big iron 2 (published)

His question was how to win them back?

Page 7: Big iron 2 (published)

These new technologies also caused frustration

Page 8: Big iron 2 (published)

Backlash (2009)Not novel (dates back to the 80’s)

Physical level not the logical level (messy?)Incompatible with tooling

Lack of integrity (referential) & ACIDMR is brute force ignoring indexing, scew

Page 9: Big iron 2 (published)

All points are reasonable

Page 10: Big iron 2 (published)

And they proved it too!

“A comparison of Approaches to Large Scale Data Analysis” – Sigmod 2009

• Vertica vs. DBMSX vs. Hadoop

• Vertica up to 7 x faster than Hadoop over benchmarks

Databases faster than Hadoop

Page 11: Big iron 2 (published)

But possibly missed the point?

Page 12: Big iron 2 (published)

Databases were traditionally designed to keep data safe

Page 13: Big iron 2 (published)

NoSQL grew from a need to scale

Page 14: Big iron 2 (published)
Page 15: Big iron 2 (published)

It’s more than just scale, they facilitate different practices

Page 16: Big iron 2 (published)

A Better Fit

They better match the way software is engineered today.– Iterative development– Fast feedback– Frequent releases

Page 17: Big iron 2 (published)

Is NoSQL a Disruptive Technology?

Christensen’s observation:Market leaders are displaced when markets shift in ways that the incumbent leaders are not prepared for.

Page 18: Big iron 2 (published)

Aside: MongoDB

• Impressive trajectory• Slightly crappy product (from a

traditional database standpoint)• Most closely related to relational DB

(of the NoSQLs)• Plays to the agile mindset

Page 19: Big iron 2 (published)

Yet the NoSQL market is relatively small

• Currently around $600 but projected to grow strongly

• Database and systems management market is worth around $34billion

Page 20: Big iron 2 (published)

There is more to NoSQL than just scale, it sits better with the way we

build software today

Key Point

Page 21: Big iron 2 (published)

We have new building blocks to play with!

Page 22: Big iron 2 (published)

My Problem

• Sprawling application space, built over many years, grouped into both vertical and horizontal silos

• Duplication of effort• Data corruption & preventative

measures• Consolidation is costly, time

consuming and technically challenging.

Page 23: Big iron 2 (published)

Traditional solutions (in chronological order)

–Messaging– SOA– Enterprise Data Warehouse– Data virtualisation

Page 24: Big iron 2 (published)

Bringing data, applications, people together is hard

Page 25: Big iron 2 (published)

A popular choice is an EDW

Page 26: Big iron 2 (published)

EDW pattern is workable, but tough

– As soon as you take a ‘view’ on what the shape of the data is, it becomes harder to change.• Leave ‘taking a view” to the last responsible

moment

–Multifaceted: Shape, diversity of source, diversity of population, temporal change

Page 27: Big iron 2 (published)

Harder to do iteratively

Page 28: Big iron 2 (published)

Is this the only way?

Page 29: Big iron 2 (published)

The Google Approach

MapReduce

Google Filesystem

BigTable

Tenzing

Megastore

F1

Dremel

Spanner

Page 30: Big iron 2 (published)

And just one code base!

So no enterprise schema secret society!

Page 31: Big iron 2 (published)

The Ebay Approach

Page 32: Big iron 2 (published)

The Partial-Schematic Approach

Often termed Clobs & Cracking

Page 33: Big iron 2 (published)

Problems with solidifying a schematic representation

• Risk of throwing information away, keeping only what you think you need. – OK if you create data– Bad if you got data from elsewhere

• Data tends to be poly-structured in programs and on the wire

• Early-binding slows down development

Page 34: Big iron 2 (published)

But schemas are good

• They guarantee a contract • That contract spans the whole

dataset– Similar to static typing in programming

languages.

Page 35: Big iron 2 (published)

Compromise positions

• Query schema can be a subset of data schema.

• Use schemaless databases to capture diversity early and evolve it as you build.

Page 36: Big iron 2 (published)

Common solutions today use multiple technologies

Page 37: Big iron 2 (published)

We use an late-bound schema, sitting over a schemaless store

Late Bound

Schema

Page 38: Big iron 2 (published)

Evolutionary Approach

• Late-binding makes consolidation incremental– Schematic representation delivered at the ‘last

responsible moment’ (schema on demand)– A trade in this model has 4 mandatory nodes.

A fully modeled trade has around 800.

• The system of record is raw data, not our ‘view’ of it

• No schema migration! But this comes at a price.

Page 39: Big iron 2 (published)

Scaling

Page 40: Big iron 2 (published)

Key based access always scales

Client

Page 41: Big iron 2 (published)

But queries (without the sharding key) always broadcast

Client

Page 42: Big iron 2 (published)

As query complexity increases so does the overhead

Client

Page 43: Big iron 2 (published)

Course grained shardsClien

t

Page 44: Big iron 2 (published)

Data Replicas provide hardware isolation

Client

Page 45: Big iron 2 (published)

Scaling

• Key based sharding is only sufficient very simple workloads

• Course grained shards help (but suffer from skew)

• Replication provides useful, if expensive, hardware isolation

• Workload management is less useful in my experience

Page 46: Big iron 2 (published)

Weak consistency forces the problem onto the developer

Particularly bad for banks!

Page 47: Big iron 2 (published)

Scaling two phase commit is hard to do efficiently

• Requires distributed lock/clock/counter

• Requires synchronisation of all readers & writers

Page 48: Big iron 2 (published)

Alternatives to traditional 2PC

• MVCC over explicit locking• Timestamp based strong consistency – E.g. Granola

• Optimistic concurrency control– Leverage short running transactions

(avoid cross-network transactions)– Tolerate different temporal viewpoints to

reduce synchronization costs.

Page 49: Big iron 2 (published)

Immutable Data

• Safety• ‘As was’ view• Sits well with MVCC• Efficiency problems• Gaining popularity (e.g. Datomic)

Page 50: Big iron 2 (published)

Use joins to avoid ‘over aggregating’

Joins are ok, so long as they are– Local– via a unique key Trade

PartyTrade

r

Page 51: Big iron 2 (published)

Memory/Disk Tradeoff

• Memory only (possibly overplayed)• Pinned indexes (generally good idea

if you can afford the RAM)• Disk resident (best general purpose

solution and for very large datasets)

Page 52: Big iron 2 (published)

Balance flexibility and complexity

Page 53: Big iron 2 (published)

Supple at the front, more rigid at the back

Page 54: Big iron 2 (published)

Principals

• Record everything• Grow a schema, don’t do it upfront• Avoid using a ‘view’ as your system of record.• Differentiate between sourced data (out of

your control) and generated data (in your control).

• Use automated replication (for isolation) as well as sharding (for scale)

• Leverage asynchronicity to reduce transaction overheads

Page 55: Big iron 2 (published)

Consolidation means more

trust, less impedance

mismatches and managing tighter

couplings

Page 56: Big iron 2 (published)

Target architectures are starting to look more like large applications of cloud enabled services than heterogeneous application conglomerates

Page 57: Big iron 2 (published)

Are we going back to the mainframe?

Page 58: Big iron 2 (published)

Thanks

http://www.benstopford.com