Big iron 2 (published)

The Return of Big Iron?

Ben StopfordDistinguished Engineer

RBS Markets

Much diversity

What does this mean?

• A change in what customers (we) value

• The mainstream is not serving customers (us) sufficiently

The Database field has problems

We Lose: Joe Hellerstein (Berkeley) 2001

“Databases are commoditised and cornered to slow-moving, evolving, structure intensive, applications that require schema evolution.“ … “The internet companies are lost and we will remain in the doldrums of the enterprise space.” …“As databases are black boxes which require a lot of coaxing to get maximum performance”

His question was how to win them back?

These new technologies also caused frustration

Backlash (2009)Not novel (dates back to the 80’s)

Physical level not the logical level (messy?)Incompatible with tooling

Lack of integrity (referential) & ACIDMR is brute force ignoring indexing, scew

All points are reasonable

And they proved it too!

“A comparison of Approaches to Large Scale Data Analysis” – Sigmod 2009

• Vertica vs. DBMSX vs. Hadoop

• Vertica up to 7 x faster than Hadoop over benchmarks

Databases faster than Hadoop

But possibly missed the point?

Databases were traditionally designed to keep data safe

NoSQL grew from a need to scale

It’s more than just scale, they facilitate different practices

A Better Fit

They better match the way software is engineered today.– Iterative development– Fast feedback– Frequent releases

Is NoSQL a Disruptive Technology?

Christensen’s observation:Market leaders are displaced when markets shift in ways that the incumbent leaders are not prepared for.

Aside: MongoDB

• Impressive trajectory• Slightly crappy product (from a

traditional database standpoint)• Most closely related to relational DB

(of the NoSQLs)• Plays to the agile mindset

Yet the NoSQL market is relatively small

• Currently around $600 but projected to grow strongly

• Database and systems management market is worth around $34billion

There is more to NoSQL than just scale, it sits better with the way we

build software today

Key Point

We have new building blocks to play with!

My Problem

• Sprawling application space, built over many years, grouped into both vertical and horizontal silos

• Duplication of effort• Data corruption & preventative

measures• Consolidation is costly, time

consuming and technically challenging.

Traditional solutions (in chronological order)

–Messaging– SOA– Enterprise Data Warehouse– Data virtualisation

Bringing data, applications, people together is hard

A popular choice is an EDW

EDW pattern is workable, but tough

– As soon as you take a ‘view’ on what the shape of the data is, it becomes harder to change.• Leave ‘taking a view” to the last responsible

moment

–Multifaceted: Shape, diversity of source, diversity of population, temporal change

Harder to do iteratively

Is this the only way?

The Google Approach

MapReduce

Google Filesystem

BigTable

Tenzing

Megastore

F1

Dremel

Spanner

And just one code base!

So no enterprise schema secret society!

The Ebay Approach

The Partial-Schematic Approach

Often termed Clobs & Cracking

Problems with solidifying a schematic representation

• Risk of throwing information away, keeping only what you think you need. – OK if you create data– Bad if you got data from elsewhere

• Data tends to be poly-structured in programs and on the wire

• Early-binding slows down development

But schemas are good

• They guarantee a contract • That contract spans the whole

dataset– Similar to static typing in programming

languages.

Compromise positions

• Query schema can be a subset of data schema.

• Use schemaless databases to capture diversity early and evolve it as you build.

Common solutions today use multiple technologies

We use an late-bound schema, sitting over a schemaless store

Late Bound

Schema

Evolutionary Approach

• Late-binding makes consolidation incremental– Schematic representation delivered at the ‘last

responsible moment’ (schema on demand)– A trade in this model has 4 mandatory nodes.

A fully modeled trade has around 800.

• The system of record is raw data, not our ‘view’ of it

• No schema migration! But this comes at a price.

Scaling

Key based access always scales

Client

But queries (without the sharding key) always broadcast

Client

As query complexity increases so does the overhead

Client

Course grained shardsClien

t

Data Replicas provide hardware isolation

Client

Scaling

• Key based sharding is only sufficient very simple workloads

• Course grained shards help (but suffer from skew)

• Replication provides useful, if expensive, hardware isolation

• Workload management is less useful in my experience

Weak consistency forces the problem onto the developer

Particularly bad for banks!

Scaling two phase commit is hard to do efficiently

• Requires distributed lock/clock/counter

• Requires synchronisation of all readers & writers

Alternatives to traditional 2PC

• MVCC over explicit locking• Timestamp based strong consistency – E.g. Granola

• Optimistic concurrency control– Leverage short running transactions

(avoid cross-network transactions)– Tolerate different temporal viewpoints to

reduce synchronization costs.

Immutable Data

• Safety• ‘As was’ view• Sits well with MVCC• Efficiency problems• Gaining popularity (e.g. Datomic)

Use joins to avoid ‘over aggregating’

Joins are ok, so long as they are– Local– via a unique key Trade

PartyTrade

r

Memory/Disk Tradeoff

• Memory only (possibly overplayed)• Pinned indexes (generally good idea

if you can afford the RAM)• Disk resident (best general purpose

solution and for very large datasets)

Balance flexibility and complexity

Supple at the front, more rigid at the back

Principals

• Record everything• Grow a schema, don’t do it upfront• Avoid using a ‘view’ as your system of record.• Differentiate between sourced data (out of

your control) and generated data (in your control).

• Use automated replication (for isolation) as well as sharding (for scale)

• Leverage asynchronicity to reduce transaction overheads

Consolidation means more

trust, less impedance

mismatches and managing tighter

couplings

Target architectures are starting to look more like large applications of cloud enabled services than heterogeneous application conglomerates

Are we going back to the mainframe?

Thanks

http://www.benstopford.com

Big iron 2 (published)

Technology

data bad

subset of data schema

data replicas

data safe

schema evolution

schema migration

responsible moment schema

latebound schema