Top Banner
© 2014 MapR Technologies 1 © 2014 MapR Technologies HBase and Drill How loosely typed SQL is ideal for NoSQL Ted Dunning June 10, 2015
47
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 1© 2014 MapR Technologies

HBase and Drill

How loosely typed SQL is ideal for NoSQL

Ted Dunning

June 10, 2015

Page 2: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 2

Contact Information

Ted Dunning

Chief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & Mahout

VP of Incubator at Apache Foundation

Email [email protected] [email protected]

Twitter @ted_dunning

Hashtag today: #BDE2015

Page 3: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 3

Agenda

• What does good mean?• What do we mean by loose typing?• Examples of what you can do• Real database with 10-20x fewer tables• Looking forward• Questions

Page 4: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 4

What Does Good Mean (for a DB)?

• Expressive– Must express the concepts we need

• Efficient– Must run fast enough on cheap enough hardware

Page 5: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 5

What Does Good Mean (for a DB)?

• Expressive– Must express the concepts we need

• Efficient– Must run fast enough on cheap enough hardware

• Introspectable– Must be able to inspect the data and schema and gain understanding

Page 6: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 6

What is New Here

• Introspection is better when– A minimum of data entities are used to describe our model– No name overflow– Referential scoping helps narrow our focus to a simpler problem– Many-to-one relations can in-lined

• Introspection was not a goal for the design of the relational model

• Introspection was therefore not a result either

Page 7: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 7

Older than Dirt

• Relational theory is old (1970)– Pre-dates data structures– Predates mainstream recursive procedures– Predates lexical scoping– Predates logic programming– Predates real functional programming (Church, McCarthy, Iverson,

Backus and not-withstanding)

• Some updates are in order to enhance introspection

Page 8: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 8

Contrast Relational and HBase Style noSQL

Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)

• Expressions over sets of rows

HBase / MapR DB• Rows contain fields• Fields bytes• Structure is flexible• No pre-defined structure• Single key• Column families• Timestamps• Versions

Page 9: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 9

Contrast relational and HBase with Structuring

Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)

• Expressions over sets of rows

HBase + Structuring• Rows contain fields• Fields contain primitive types

– Or objects, or lists

• Structure is flexible, ragged• No pre-defined structure• Single key

Page 10: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 10

Turtle Models for Databases

• Allows complex objects in field values– JSON style lists and objects

• Allow references to objects via join– Includes references localized within lists

• Lists of objects and objects of lists are isomorphic to tables so …

• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables

Page 11: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 11

Proviso and Warning

• This is not your father’s BLOB• And not the same as arrays with lateral view joins

• Rationale to come as we talk about idioms

Page 12: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 12

A Catalog of noSQL Idioms

Page 13: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 13

Tables as Objects, Objects as Tables

Row-wise form

Column-wise form

List of objects

Object containing lists

Page 14: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 14

Micro Columnar Formats

An entire table stored in columnar form can be a

first-class value using these techniques

This is very powerful for in-lining one-to-many relations.

Page 15: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 15

Note

• If embedded tables are first-class, schema becomes data

• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible

• Thus, embedded first-class objects implies late discovery of schema information

Page 16: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 16

A first example:Time-series data

Page 17: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 17

Column names as data

• When column names are not pre-defined, they can convey information

• Examples– Time offsets within a window for time series– Top-level domains for web crawlers– Vendor id’s for customer purchase profiles

• Predefined schema is impossible for this idiom

Page 18: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 18

Relational Model for Time-series

Page 19: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 19

Table Design: Point-by-Point

Page 20: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 20

Table Design: Hybrid Point-by-Point + Sub-table

After close of window, data in row is restated as column-oriented tabular value in different column family.

Page 21: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 21

Compression Results

Samples are 64b time, 16 bit

sample

Sample time at 10kHz

Sample time jitter makes it important to keep original time-stamp

How much overhead to retain time-stamp?

Page 22: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 22

A second example:Music meta-data

Page 23: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 23

MusicBrainz on NoSQL

• Artists, albums, tracks and labels are key objects• Reality check:

– Add works (compositions), recordings, release, release group

• 7 tables for artist alone• 12 for place, 7 for label, 17 for release/group, 8 for work

– (but only 4 for recording!)– Total of 12 + 7 + 17 + 8 + 4 = 48 tables

• But wait, there’s more!– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86

link tables, 5 cover art tables and 3 tables for CD timing info (138 total)– And 50 more tables that aren’t documented yet

Page 24: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 24

Page 25: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 25

180 tablesnot shown

Page 26: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 26

236 tablesto describe 7 kinds of things

Page 27: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 27

Can we do better?

Page 28: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 28

Page 29: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 29

Page 30: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 30

Page 31: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 31

Page 32: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 32

Page 33: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 33

27 tables reduce to 4

Page 34: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 34

27 tables reduce to 4so far

Page 35: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 35

Further Reductions

• All 86 link tables become properties on artists, releases and other entities

• All 44 tag, rating and annotation tables become list properties• All 5 cover art tables become lists of file references

• Current score: 162 tables become 4

• You get the idea

Page 36: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 36

Is This Good?

• Expressivity– The JSON data model is at least as expressive as the original relational

model• Many cases easier to describe in nested data• No cases are harder

• Efficiency– Inlining can increase data size. Locality improves, however– Sessionizing can substantially decrease data size– Inlining back-references is more efficient than ordinary indexes– Inlined columnar data allows 1000x speedup for time series

• Introspection (you decide)

Page 37: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 37

But How Can We Query This?

• Can’t use SQL– SQL is strongly typed– SQL is heavily tied into the original relational model– SQL generating tools require relational model

• Must use SQL– Vast numbers of tools and people understand how to write SQL– SQL is the lingua franca of databases

Page 38: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 38

Squaring the Circle

• Enter Apache Drill

• Drill is SQL compliant– Uses standard syntax and semantics

• Drill extends SQL– First class treatment of objects, lists– Full support for destructuring, flattening– Full power of relational model can be applied to complex data

Page 39: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 39

Drill Provides Scalable and Extended SQL

Page 40: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 40

Sample Query

• Find Elvis select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )

Page 41: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 41

Example Query• Find discs where Elvis was credited

select distinct album_id, namefrom( select id album_id, name, flatten(credit) from release) albumsjoin( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ )) artistsusing artist_id

Page 42: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 42

Summary

• Extended relational model allows massive simplification– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection– This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

Page 43: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 43

Page 44: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 44

Short Books by Ted Dunning & Ellen Friedman

• Published by O’Reilly in 2014 and 2015• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook

Page 45: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 45

Real World Hadoopby Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)

Free copies at book signing today

Page 46: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 46

Thank You!

Page 47: HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 47

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies