The Return of Big Iron? Ben Stopford Distinguished Engineer RBS Markets
May 27, 2015
The Return of Big Iron?
Ben StopfordDistinguished Engineer
RBS Markets
Much diversity
What does this mean?
• A change in what customers (we) value
• The mainstream is not serving customers (us) sufficiently
The Database field has problems
We Lose: Joe Hellerstein (Berkeley) 2001
“Databases are commoditised and cornered to slow-moving, evolving, structure intensive, applications that require schema evolution.“ … “The internet companies are lost and we will remain in the doldrums of the enterprise space.” …“As databases are black boxes which require a lot of coaxing to get maximum performance”
His question was how to win them back?
These new technologies also caused frustration
Backlash (2009)Not novel (dates back to the 80’s)
Physical level not the logical level (messy?)Incompatible with tooling
Lack of integrity (referential) & ACIDMR is brute force ignoring indexing, scew
All points are reasonable
And they proved it too!
“A comparison of Approaches to Large Scale Data Analysis” – Sigmod 2009
• Vertica vs. DBMSX vs. Hadoop
• Vertica up to 7 x faster than Hadoop over benchmarks
Databases faster than Hadoop
But possibly missed the point?
Databases were traditionally designed to keep data safe
NoSQL grew from a need to scale
It’s more than just scale, they facilitate different practices
A Better Fit
They better match the way software is engineered today.– Iterative development– Fast feedback– Frequent releases
Is NoSQL a Disruptive Technology?
Christensen’s observation:Market leaders are displaced when markets shift in ways that the incumbent leaders are not prepared for.
Aside: MongoDB
• Impressive trajectory• Slightly crappy product (from a
traditional database standpoint)• Most closely related to relational DB
(of the NoSQLs)• Plays to the agile mindset
Yet the NoSQL market is relatively small
• Currently around $600 but projected to grow strongly
• Database and systems management market is worth around $34billion
There is more to NoSQL than just scale, it sits better with the way we
build software today
Key Point
We have new building blocks to play with!
My Problem
• Sprawling application space, built over many years, grouped into both vertical and horizontal silos
• Duplication of effort• Data corruption & preventative
measures• Consolidation is costly, time
consuming and technically challenging.
Traditional solutions (in chronological order)
–Messaging– SOA– Enterprise Data Warehouse– Data virtualisation
Bringing data, applications, people together is hard
A popular choice is an EDW
EDW pattern is workable, but tough
– As soon as you take a ‘view’ on what the shape of the data is, it becomes harder to change.• Leave ‘taking a view” to the last responsible
moment
–Multifaceted: Shape, diversity of source, diversity of population, temporal change
Harder to do iteratively
Is this the only way?
The Google Approach
MapReduce
Google Filesystem
BigTable
Tenzing
Megastore
F1
Dremel
Spanner
And just one code base!
So no enterprise schema secret society!
The Ebay Approach
The Partial-Schematic Approach
Often termed Clobs & Cracking
Problems with solidifying a schematic representation
• Risk of throwing information away, keeping only what you think you need. – OK if you create data– Bad if you got data from elsewhere
• Data tends to be poly-structured in programs and on the wire
• Early-binding slows down development
But schemas are good
• They guarantee a contract • That contract spans the whole
dataset– Similar to static typing in programming
languages.
Compromise positions
• Query schema can be a subset of data schema.
• Use schemaless databases to capture diversity early and evolve it as you build.
Common solutions today use multiple technologies
We use an late-bound schema, sitting over a schemaless store
Late Bound
Schema
Evolutionary Approach
• Late-binding makes consolidation incremental– Schematic representation delivered at the ‘last
responsible moment’ (schema on demand)– A trade in this model has 4 mandatory nodes.
A fully modeled trade has around 800.
• The system of record is raw data, not our ‘view’ of it
• No schema migration! But this comes at a price.
Scaling
Key based access always scales
Client
But queries (without the sharding key) always broadcast
Client
As query complexity increases so does the overhead
Client
Course grained shardsClien
t
Data Replicas provide hardware isolation
Client
Scaling
• Key based sharding is only sufficient very simple workloads
• Course grained shards help (but suffer from skew)
• Replication provides useful, if expensive, hardware isolation
• Workload management is less useful in my experience
Weak consistency forces the problem onto the developer
Particularly bad for banks!
Scaling two phase commit is hard to do efficiently
• Requires distributed lock/clock/counter
• Requires synchronisation of all readers & writers
Alternatives to traditional 2PC
• MVCC over explicit locking• Timestamp based strong consistency – E.g. Granola
• Optimistic concurrency control– Leverage short running transactions
(avoid cross-network transactions)– Tolerate different temporal viewpoints to
reduce synchronization costs.
Immutable Data
• Safety• ‘As was’ view• Sits well with MVCC• Efficiency problems• Gaining popularity (e.g. Datomic)
Use joins to avoid ‘over aggregating’
Joins are ok, so long as they are– Local– via a unique key Trade
PartyTrade
r
Memory/Disk Tradeoff
• Memory only (possibly overplayed)• Pinned indexes (generally good idea
if you can afford the RAM)• Disk resident (best general purpose
solution and for very large datasets)
Balance flexibility and complexity
Supple at the front, more rigid at the back
Principals
• Record everything• Grow a schema, don’t do it upfront• Avoid using a ‘view’ as your system of record.• Differentiate between sourced data (out of
your control) and generated data (in your control).
• Use automated replication (for isolation) as well as sharding (for scale)
• Leverage asynchronicity to reduce transaction overheads
Consolidation means more
trust, less impedance
mismatches and managing tighter
couplings
Target architectures are starting to look more like large applications of cloud enabled services than heterogeneous application conglomerates
Are we going back to the mainframe?
Thanks
http://www.benstopford.com