Email [email protected] Twitter @PolymathicCoder Finding Your Way In the Midst of the NoSQL Haze Abdelmonaim Remani JAX London 2014 London, UK October14, 2014
Jul 16, 2015
Email [email protected]
Twitter @PolymathicCoder
Finding Your Way In the Midst of the NoSQL Haze
Abdelmonaim Remani
JAX London 2014 London, UK
October14, 2014
The DatabaseAny system that primarily allows to store data in a
certain way with the main purpose of reading it back in a later time
The Story of Data
The Invention of Paper
Gutenberg’s Printing Press
Turing Machine
The Flat File Model
The Invention of Paper
The Flat File Model
A single binary or text file that contains data organized in a tabular format.Each line corresponds to a record represented as a series of values formatted
as fixed or variable length fields. The latter requires one or more extra characters to function as the separator or the delimiter.
A header record specifying the field names associated with the values given their positions.
The most primitive form of databasesData storage and retrieval
The Flat File Model
• Nothing to describe the one record outside of what is implied in the table structure in the done file (A header record is the closest to any metadata)
• No way to model associations between records or fields
• Data schema can be expressed in DTD or XML Schema• Associations are modeled by nesting tags or embedding URIs of other files
The cost is XML Parsing/building overhead
Comma-Separated Values (CSV)(One of the most popular flat file formats)
eXtendable Markup Language (XML)
We needed a data schema & a way to represent more structured data
A Double-Edged Sword
Easy to share across the network
(Using ubiquitous protocols like FTP)
Data access is deferred to the underlying OS
Straight-forward to manipulate
(A File I/O API is a part of the core library of most
Every programing language)
Easy to secure (File system permissions of
the host OS)
Not Always Available When a file is open for
reading or writing it is lockedNo concurrent access
Sequential Operations involve loading the
entire dataset into memory(Sorting, Filtering, Aggregations, etc…)
But…
Simple
Relational Databases
Gutenberg’s Printing Press
A system that is always available, providing concurrent access to data at all times, while preserving data integrity and consistency
Relational DatabasesA system that manages data beyond basic storage
and retrieval
A wide adoption and familiarity A plethora of tools and mature RDBMS solutions
The De-facto Standard
A solution that is generic enough to be applicable to any business case,yet accommodating enough to fulfill its very specific needs
The Relational Model
First-introduced in 1969 by Edgar F. Codd in “Deliverability, Redundancy,, and Consistency of Relations Stored in Large Data Banks”
Data is stored according to a schema of two diminutional tables (relations) as a collection of rows (tuples) of values to columns (attributes). Data integrity and referential integrity are guaranteed within the schema by
defining a set of explicitly defined constraints
Normalization, SQL, and ACID Transactions
A new data model
Relational Concepts
The Relational Model
With the goal of guaranteeing consistency, adhering to a set recommendations while designing or evolving a database schema with the
goal of preventing addition, deletion, and modification anomalies
Normal Form A schema is said to be normalized or in the normal form when it
satisfies all recommendationsThere are many normal forms. The most notable is Byce Codd Normal Form
(BCNF or 3.5 NF)
Normalization
Most normal forms favor smaller tables with well-defined constraints and discourage redundancy
The Relational Model
Very flexible and somewhat standardized that allows for managing data in relational databases
DDL (Data Definition Language)
To define the database schema
DML (Data manipulation Langaue)
Basic CRUD on one tableOperations across multiple tables (JOIN, etc…)
SQL (Structured Query language)
The Relational Model
A Transaction is a logical unit of work that constitutes of multiple operations
When processing a transaction the relation model guarantees all ACID properties
All occur or none occurs
Will aways transition from a valid state to another valid state
Concurrent operations will never force the database into an invalid state
The data is permanently persisted once the transaction is committed
ACID Transactions
Consistency Durability
IsolationAtomicity
The Relational Model
The user will directly interact with the RDBMS through a terminal console
Usability as a design goal(It must provide a reasonable and human-friendly interface)• The system enforces data integrity and guarantee ACIDity• SQL is english-like
The data access pattern are unknown and virtually infinite• Data is structured in such a way that is not biased towards any
particular access pattern• SQL is flexible enough to virtually express any query
Designed under an important assumption
Implication 1
Implication 2
Conversations Messages Medias1 * 1 *
For a “Conversation n”🔑 CVn Unique Identifier / ID📜 CVn Associated Data
For a “Media n”🔑 MDn Unique Identifier / ID📜 MDn Associated Data
For a “Message n”🔑 MGn Unique Identifier / ID📜 MGn Associated Data
Write📜 CVn
Read📜 CVn by 🔑 CVn
Read (Direction A)
Read (Direction B)📜 CVm by 🔑 MGn📜 CVm by 🔑 MDn
Write📜 MGn
Read📜 MGn by 🔑 MGn
Read (Direction A)📜 MG* by🔑 CVn
Read (Direction B)📜 MGm by🔑 MDn
Write 📜 MDn
Read📜 MDn by 🔑 MDn
Read (Direction A)📜 MD* by 🔑 MGn📜 MD* by 🔑 CVn
Read (Direction B)
1 1 1 1
Direction A
Direction B
Assuming the Logical Data Model below
With the Following Access Patterns
🔑 MG1 / 📜 MG1🔑 MG2 / 📜 MG2🔑 MG3 / 📜 MG3
🔑 MD1 / 📜 MD1🔑 MD2 / 📜 MD2🔑 MD3 / 📜 MD3🔑 MD4 / 📜 MD4
🔑 CV1 / 📜 CV1🔑 CV2 / 📜 CV2
🔑 CV1 - 🔑 MG1🔑 CV1 - 🔑 MG2🔑 CV2 - 🔑 MG3
🔑 MG1 - 🔑 MD1🔑 MG1 - 🔑 MD2🔑 MG2 - 🔑 MD3🔑 MG3 - 🔑 MD4
The Relational Physical Data Model Would Be
is the root of all evil!
The Relational Model
The gist of it!
Redundancy
Database Darwinism
Survival of the fittest
RDBMS vs. The World Fight!
Let's get ready to rumble….
Michael Buffer
Embedding business logic through stored procedures and triggersSupporting user management and security
Etc…
Some morphed into full-blown application platforms embedding and providing generic extensions of SQL and runtimes
(Oracle introduced PL/SQL and shipped with an implementation of the JVM)
More than a datastore
Relational databases evolved
Software evolved beyond data management to data processing
Users demand more elaborate user interfaces
RDBMS vs. The World 1 - 0
No Silver Bullet
Performance issues worsenedBrittle Deployment
Data corruption risk increasedSecurity concerns
Fueling Developer-DBA Wars
An architecture where the UI and business logic are off-loaded to external applications built on top of the database
As complexity of data processing increased and GUI (Graphical User Interface) becoming the norm
Relational databases struggled to deliver adequate performance
Deploying code on the Database
The database is designed to manage data not to process it
Software got more complex
RDBMS vs. The World 1 - 1
O/R Impedance Mismatch
OOP (Object-Oriented Programming) Interaction of hierarchical object structures each encapsulating their own data and
behavior
The single object does not necessarily map to a single row and vise-versa
OOP concepts like polymorphism and inheritance simply do not exist in the relational model
We dealt with it “The Third Manifesto” by Christopher J. Date
Design Patterns Active RecordThe development of ORM frameworks like Hibernate, etc…
Software got even more complex
Mismatch!
Persisting data in relational databases became difficult
RDBMS vs. The World 1 - 1
Hoarding data & asking questions…
At the level of the Schema
The increasing data volume inversely impacted the performance of relational databases
Tuning Vendor-specific parametersSecondary indexesOptimizing queries
Higher-throughput application with bigger datasets and more complex queries
Database Tuning and optimization
Scaling Up/Vertically Buy the beefiest machine you can affordExpensive and certainly not sustainable
At the level of the RDBMS as a whole
RDBMS vs. The World 2 - 1
Scaling Out/Horizontally Running on a cluster
Never designed to be that way
Master/Slave Model Data Sharding
Periodically refreshed Materialized viewsDe-normalizing the schema
Hoarding more data & asking more questions…
More intrusive measures
At the level of the Schema
At the level of the RDBMS as a whole
Writes are handled by a single node (the master)Reads are handled by the rest of the nodes (the slaves)
The master propagates updates to slaves
The Master/Slave Model
• Improves reads only• The dataset must fit in one machine• Risks dirty reads from out-of-synch nodes
Dividing up the dataset according to some criteria, called the shard key, into subsets, called partitions. A partition of data must be small enough to fit in a single node, and a good shard key is one that distributes data evenly across all
partitions.
Data Sharding
• Improves both reads and writes• The dataset does not have to fit in one machine• Applications must be aware of the sharing strategy• Re-sharding sucks (Possibly having to re-shuffle data)• You can’t JOIN across partitions or enforce referential integrity
Hoarding more data & asking more questions…
RDBMS vs. The World
What happened to “Redundancy is the root of all evil” !??
Time-out!
RDBMS vs. The World Forfeit
The World Wins!
The CAP TheoremEric Brewer on Distributed Systems
Pick two out of the three Consistency, Availability, and Partition Tolerance
There is no good, cheap, and fast service
Think “The Iron Triangle” (The Project Management Triangle)
The CAP Theorem
It is a CA systemIt favors Consistency and Availability
It can never be Partition Tolerant
The Relational Model?
It makes sense… it was designed to run on one machine!
It is not like we have choice… Are there any successful distributed systems that are
Partition tolerant?
An AP SystemDNS (Domain Naming Service)
Not all the nodes have the most updated record set You register a domain name and wait for some time for the rest of the
DNS systems on the internet to be synched up eventually
Did we give up consistency all together?
Eventual Consistency as opposed to Immediate Consistency
We settled for a lesser degree of consistency
BASE Basically Available
Soft State Eventual Consistency
An AP Datastore
• Mohammed in Morocco changed his relationship status to single on a nearby edge node
• His cousin in Spain saw the status change immediately because they happen to get the data from the node
• His secret admirer Sara who lives across the Atlantic in the United States could not see it until an hour later
• His bother in Japan got the update the next day
They all got it eventually!
Is that even possible?
Welcome to curious case of NoSQL datastores!
NoSQL
Turing Machine
A wide range of specialized datastores with the goal of addressing the challenge of the relational model
“The whole point of seeking alternatives is that you need to solve a problem that relational
databases are a bad fit for” -Eric Evans
NoSQL
NoSQL doe not mean anti-SQL or anti-relational It is simply
any datastore that is not relational
It’s a slippery slope…
Logical Schema? Well-defined and rigid in
relationalWhy not going commando?
Integrity Constraints? Who cares!
A query language That can wait!
Security & User Management
Forget it!
Physical Schema? B-Trees in relational
Why not use another data structure?
Since we are willing to drop consistency why not…
Designed to run on a single machine Designed to run on a cluster
CA AP/CA/CP
Scales Vertically Scales Horizontally
Full Indexes On keys mostly
Regid schema Flexible or no schema
Any queries Pre-defined queries
SQL vs. NoSQL
It’s the wild west… There are many outliers and hybrid datastore!
A wide range of specialized datastores with the goal of addressing the challenge of the relational model
Key-value Datastores Columnar Datastores Document Datastores
Graph Datastore
The NoSQL Zoo
A wide variety!
Conversations Messages Medias1 * 1 *
For a “Conversation n”🔑 CVn Unique Identifier / ID📜 CVn Associated Data
For a “Media n”🔑 MDn Unique Identifier / ID📜 MDn Associated Data
For a “Message n”🔑 MGn Unique Identifier / ID📜 MGn Associated Data
Write📜 CVn
Read📜 CVn by 🔑 CVn
Read (Direction A)
Read (Direction B)📜 CVm by 🔑 MGn📜 CVm by 🔑 MDn
Write📜 MGn
Read📜 MGn by 🔑 MGn
Read (Direction A)📜 MG* by🔑 CVn
Read (Direction B)📜 MGm by🔑 MDn
Write 📜 MDn
Read📜 MDn by 🔑 MDn
Read (Direction A)📜 MD* by 🔑 MGn📜 MD* by 🔑 CVn
Read (Direction B)
1 1 1 1
Direction A
Direction B
Assuming the Logical Data Model below
With the Following Access Patterns
Document Datastores
Documents of nested structures of hashes and their values
Biggest Concern Complex JOINs across documents
Querying against the root of aggregates
Biggest Advantage Very flexible schema
Good queriablity No impedance mismatch
Very good leverage of Map/Reduce Great JSON support
Most Popular Solutions MongoDB CounchDB
🔑 CV1 / 📜 CV1
🔑 MD1 / 📜 MD1
🔑 MG1 / 📜 MG1
🔑 MG2 / 📜 MG2
🔑 MD2 / 📜 MD2
🔑 MD3 / 📜 MD3
🔑 CV2 / 📜 CV2
🔑 MG3 / 📜 MG3
🔑 MD4 / 📜 MD4
The Document Physical Data Model Would Be
🔑 CV1 / 📜 CV1
🔑 MD1 / 📜 MD1
🔑 MG1 / 📜 MG1
🔑 MG2 / 📜 MG2
🔑 MD2 / 📜 MD2
🔑 MD3 / 📜 MD3
🔑 MD1 / 📜 MD1
🔑 MG1 / 📜 MG1
🔑 MD2 / 📜 MD2
🔑 CV1 / 📜 CV1
🔑 MD3 / 📜 MD4
🔑 MG2 / 📜 MG2
🔑 CV1 / 📜 CV1
🔑 MD1 / 📜 MD1
🔑 CV1 / 📜 CV1
🔑 MG1 / 📜 MG1
🔑 CV2 / 📜 CV2
🔑 MG3 / 📜 MG3
🔑 MD4 / 📜 MD4
🔑 MD4 / 📜 MD4
🔑 MG3 / 📜 MG3
🔑 CV2 / 📜 CV2
🔑 MD2 / 📜 MD2
🔑 CV1 / 📜 CV1
🔑 MG1 / 📜 MG1
🔑 MD3 / 📜 MD3
🔑 CV1 / 📜 CV1
🔑 MG2 / 📜 MG2
🔑 MD4 / 📜 MD4
🔑 CV2 / 📜 CV2
🔑 MG3 / 📜 MG3
The Document Physical Data Model Would Be
Key-Value Datastores
A big distributed hash map or associative array
Biggest Concern Querying by anything other than the key
(No secondary indexes mostly)
Biggest Advantage A simple data model
Very fast reads and writes Highly scalable
Most Popular Solutions Amazon DynamoDB
Riak Redis
🔑 MG1 📜 MG1
🔑 MG2 📜 MG2
🔑 MG3 📜 MG3
🔑 MD1 📜 MD1
🔑 MD2 📜 MD2
🔑 MD3 📜 MD3
🔑 MD4 📜 MD4
🔑 CV1 📜 CV1🔑 CV2 📜 CV2
The Key-Value Physical Data Model Would Be
🔑 MG1 📜 MG1🔑 CV1#MG1 📜 MG1
🔑 MG2 📜 MG2🔑 CV1#MG2 📜 MG2
🔑 MG3 📜 MG3🔑 CV2#MG3 📜 MG3
🔑 MD1 📜 MD1🔑 MG1#MD1 📜 MD1🔑 CV1#MD2 📜 MD2
🔑 MD2 📜 MD2🔑 MG1#MD2 📜 MD2🔑 CV1#MD2 📜 MD2
🔑 MD3 📜 MD3🔑 MG2#MD3 📜 MD3🔑 CV1#MD3 📜 MD3
🔑 MD4 📜 MD4🔑 MG3#MD4 📜 MD4🔑 CV2#MD4 📜 MD4
🔑 CV1 📜 CV1🔑 CV2 📜 CV2
The Key-Value Physical Data Model Would Be
Columnar Datastores
A table where data of the same column is stored together
Biggest Concern Key design is not trivial
(Need to know your access pattern before-hand)
Biggest Advantage Great for sparse data
Very fast column operations (Ex. Aggregation) Support versioning and data compression
Most Popular Solutions Google BigTable
HBase Cassandra
Conversations Messages Medias1 * 1 *
For a “Conversation n”🔑 CVn Unique Identifier / ID📜 CVn Associated Data
For a “Media n”🔑 MDn Unique Identifier / ID📜 MDn Associated Data
For a “Message n”🔑 MGn Unique Identifier / ID📜 MGn Associated Data
Write📜 CVn
Read📜 CVn by 🔑 CVn
Read (Direction A)
Read (Direction B)📜 CVm by 🔑 MGn📜 CVm by 🔑 MDn
Write📜 MGn
Read📜 MGn by 🔑 MGn
Read (Direction A)📜 MG* by🔑 CVn
Read (Direction B)📜 MGm by🔑 MDn
Write 📜 MDn
Read📜 MDn by 🔑 MDn
Read (Direction A)📜 MD* by 🔑 MGn📜 MD* by 🔑 CVn
Read (Direction B)
1 1 1 1
Direction A
Direction B
Assuming the Logical Data Model below
With the Following Access Patterns
🔑 MG1 📜 MG1🔑 CV1#MG1 📜 MG1
🔑 MG2 📜 MG2🔑 CV1#MG2 📜 MG2
🔑 MG3 📜 MG3🔑 CV2#MG3 📜 MG3
🔑 MD1 📜 MD1🔑 MG1#MD1 📜 MD1🔑 CV1#MD2 📜 MD2
🔑 MD2 📜 MD2🔑 MG1#MD2 📜 MD2🔑 CV1#MD2 📜 MD2
🔑 MD3 📜 MD3🔑 MG2#MD3 📜 MD3🔑 CV1#MD3 📜 MD3
🔑 MD4 📜 MD4🔑 MG3#MD4 📜 MD4🔑 CV2#MD4 📜 MD4
🔑 CV1 📜 CV1🔑 CV2 📜 CV2
The Columnar Physical Data Model Would Be
Graph Datastores
A graph data structure
Biggest Concern Does NOT scale horizontally
Biggest Advantage Perfect for interconnected data
Allows for model explicit relationships Fine-grained graph travel
Supports ACID Transactions
Most Popular Solutions Neo4J
🔑 CV1 / 📜 CV1
🔑 CV2 / 📜 CV2
🔑 MG1 / 📜 MG1
🔑 MG3 / 📜 MG3
🔑 MG2 / 📜 MG2
🔑 MD1 / 📜 MD1
🔑 MD2 / 📜 MD2
🔑 MD3 / 📜 MD3
🔑 MD4 / 📜 MD4
The Graph Physical Data Model Would Be
Conversation 1
Conversation 2
Message 1
Message 3
Message 2
Media 1
Media 2
Media 3
Media 4
Has
Has
Has
Has
Has
Has
Has
Reply To
Belongs In
Belongs In
The Graph Physical Data Model Would Be
@PolymathicCoder
ImplicationsConcurrent Joins & Distributed Transactions
in Code…
@PolymathicCoder
How to choose?
Use a relational database unless you are expecting a lot of data
Know your data Schema, density, V3 (Volume, Velocity, and Variety), etc…
Know your access patterns Read/Write ratio, frequency, the likelihood of it changing, etc…
Know the associate development effort and cost
Know your administration effort and cost
Consider using Big Data technologies if needed(HDFS, Hadoop, Pige Hive, etc…)
Go PolyglotThe idea of that one data model will perfectly fit the
complexity of data and accommodate the variety of its all access patterns of the one sufficiently-elaborate application is
absurd
Leveraging multiple datastores based on the specific way the data is structured and the way it is accessed
Polyglot Persistence?
Don’t go overboard! The learning curve can be very steep
The Dev effort can be significant
To be fair…
A flathead screwdriver to work on a Philips screw as well as one with the matching Philips blade
You can’t expect…
Relational Databases are so awesome
they deserve the title of “The Honey Badger of Datastores”