© 2014 MapR Technologies 1 © 2014 MapR Technologies HBase and Drill How loosely typed SQL is ideal for NoSQL Ted Dunning June 10, 2015
Jul 28, 2015
© 2014 MapR Technologies 1© 2014 MapR Technologies
HBase and Drill
How loosely typed SQL is ideal for NoSQL
Ted Dunning
June 10, 2015
© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & Mahout
VP of Incubator at Apache Foundation
Email [email protected] [email protected]
Twitter @ted_dunning
Hashtag today: #BDE2015
© 2014 MapR Technologies 3
Agenda
• What does good mean?• What do we mean by loose typing?• Examples of what you can do• Real database with 10-20x fewer tables• Looking forward• Questions
© 2014 MapR Technologies 4
What Does Good Mean (for a DB)?
• Expressive– Must express the concepts we need
• Efficient– Must run fast enough on cheap enough hardware
© 2014 MapR Technologies 5
What Does Good Mean (for a DB)?
• Expressive– Must express the concepts we need
• Efficient– Must run fast enough on cheap enough hardware
• Introspectable– Must be able to inspect the data and schema and gain understanding
© 2014 MapR Technologies 6
What is New Here
• Introspection is better when– A minimum of data entities are used to describe our model– No name overflow– Referential scoping helps narrow our focus to a simpler problem– Many-to-one relations can in-lined
• Introspection was not a goal for the design of the relational model
• Introspection was therefore not a result either
© 2014 MapR Technologies 7
Older than Dirt
• Relational theory is old (1970)– Pre-dates data structures– Predates mainstream recursive procedures– Predates lexical scoping– Predates logic programming– Predates real functional programming (Church, McCarthy, Iverson,
Backus and not-withstanding)
• Some updates are in order to enhance introspection
© 2014 MapR Technologies 8
Contrast Relational and HBase Style noSQL
Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)
• Expressions over sets of rows
HBase / MapR DB• Rows contain fields• Fields bytes• Structure is flexible• No pre-defined structure• Single key• Column families• Timestamps• Versions
© 2014 MapR Technologies 9
Contrast relational and HBase with Structuring
Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)
• Expressions over sets of rows
HBase + Structuring• Rows contain fields• Fields contain primitive types
– Or objects, or lists
• Structure is flexible, ragged• No pre-defined structure• Single key
© 2014 MapR Technologies 10
Turtle Models for Databases
• Allows complex objects in field values– JSON style lists and objects
• Allow references to objects via join– Includes references localized within lists
• Lists of objects and objects of lists are isomorphic to tables so …
• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables
© 2014 MapR Technologies 11
Proviso and Warning
• This is not your father’s BLOB• And not the same as arrays with lateral view joins
• Rationale to come as we talk about idioms
© 2014 MapR Technologies 13
Tables as Objects, Objects as Tables
Row-wise form
Column-wise form
List of objects
Object containing lists
© 2014 MapR Technologies 14
Micro Columnar Formats
An entire table stored in columnar form can be a
first-class value using these techniques
This is very powerful for in-lining one-to-many relations.
© 2014 MapR Technologies 15
Note
• If embedded tables are first-class, schema becomes data
• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible
• Thus, embedded first-class objects implies late discovery of schema information
© 2014 MapR Technologies 17
Column names as data
• When column names are not pre-defined, they can convey information
• Examples– Time offsets within a window for time series– Top-level domains for web crawlers– Vendor id’s for customer purchase profiles
• Predefined schema is impossible for this idiom
© 2014 MapR Technologies 20
Table Design: Hybrid Point-by-Point + Sub-table
After close of window, data in row is restated as column-oriented tabular value in different column family.
© 2014 MapR Technologies 21
Compression Results
Samples are 64b time, 16 bit
sample
Sample time at 10kHz
Sample time jitter makes it important to keep original time-stamp
How much overhead to retain time-stamp?
© 2014 MapR Technologies 23
MusicBrainz on NoSQL
• Artists, albums, tracks and labels are key objects• Reality check:
– Add works (compositions), recordings, release, release group
• 7 tables for artist alone• 12 for place, 7 for label, 17 for release/group, 8 for work
– (but only 4 for recording!)– Total of 12 + 7 + 17 + 8 + 4 = 48 tables
• But wait, there’s more!– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86
link tables, 5 cover art tables and 3 tables for CD timing info (138 total)– And 50 more tables that aren’t documented yet
© 2014 MapR Technologies 35
Further Reductions
• All 86 link tables become properties on artists, releases and other entities
• All 44 tag, rating and annotation tables become list properties• All 5 cover art tables become lists of file references
• Current score: 162 tables become 4
• You get the idea
© 2014 MapR Technologies 36
Is This Good?
• Expressivity– The JSON data model is at least as expressive as the original relational
model• Many cases easier to describe in nested data• No cases are harder
• Efficiency– Inlining can increase data size. Locality improves, however– Sessionizing can substantially decrease data size– Inlining back-references is more efficient than ordinary indexes– Inlined columnar data allows 1000x speedup for time series
• Introspection (you decide)
© 2014 MapR Technologies 37
But How Can We Query This?
• Can’t use SQL– SQL is strongly typed– SQL is heavily tied into the original relational model– SQL generating tools require relational model
• Must use SQL– Vast numbers of tools and people understand how to write SQL– SQL is the lingua franca of databases
© 2014 MapR Technologies 38
Squaring the Circle
• Enter Apache Drill
• Drill is SQL compliant– Uses standard syntax and semantics
• Drill extends SQL– First class treatment of objects, lists– Full support for destructuring, flattening– Full power of relational model can be applied to complex data
© 2014 MapR Technologies 40
Sample Query
• Find Elvis select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )
© 2014 MapR Technologies 41
Example Query• Find discs where Elvis was credited
select distinct album_id, namefrom( select id album_id, name, flatten(credit) from release) albumsjoin( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ )) artistsusing artist_id
© 2014 MapR Technologies 42
Summary
• Extended relational model allows massive simplification– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection– This is good
• Apache Drill gives very high performance execution for extended relational problems
• You can try this out today
© 2014 MapR Technologies 44
Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 and 2015• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-world-hadoop
http://bit.ly/mapr-tsdb-ebook
http://bit.ly/ebook-anomaly
http://bit.ly/recommendation-ebook
© 2014 MapR Technologies 45
Real World Hadoopby Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
Free copies at book signing today
© 2014 MapR Technologies 47
Q & A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies