© 2014 MapR Technologies 1#NoSQLNow @apachedrill © 2014 MapR Technologies#NoSQLNow
Drilling on JSON
Keshav Murthy
August 19st 2014
[email protected] Twitter: @rkeshavmurthy
Senior Director, Product Management, MapR
© 2014 MapR Technologies 2#NoSQLNow @apachedrill
NoSQL
We don't need no transactionWe don't need no ACID control
No schema in the tablesNo limit to the scale out
DBA, leave them JSON aloneHey DBA, leave them JSON alone
All in all it's just another data in the BASEAll in all it’s just another shard into cloud.
…With apologies to Roger Waters
© 2014 MapR Technologies 3
Martin Fowler says: “aggregate-oriented”What you're most likely to access as a unit.
Key Value Store Couchbase Riak Citrusleaf Redis BerkeleyDB Membrain ...
Document MongoDB CouchDB RavenDB Couchbase ... Graph
OrientDB DEX Neo4j GraphBase ...Wide Column
HBase Hypertable Cassandra MapR-DB ...
NoSQL Landscape
© 2014 MapR Technologies 4
Data landscape is changing
New types of applications• Social, mobile, Web, “Internet
of Things”, Cloud…• Iterative/Agile in nature• More users, more data
New data models & data types• Flexible Schema/Schema less• Rapidly changing• Semi-structured/Nested data
{ "data": [ "id": "X999_Y999", "from": { "name": "Tom Brady", "id": "X12" }, "message": "Looking forward to 2014!", "actions": [ { "name": "Comment", "link": "http://www.facebook.com/X99/posts Y999" }, { "name": "Like", "link": "http://www.facebook.com/X99/posts Y999" } ], "type": "status", "created_time": "2013-08-02T21:27:44+0000", "updated_time": "2013-08-02T21:27:44+0000" } }
JSON
© 2014 MapR Technologies 5
• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors150+ years of experience buildingdatabases and distributed systems
© 2014 MapR Technologies 6#NoSQLNow @apachedrill
Zero to Results in 2 Minutes (3 Commands)
$ tar xzf apache-drill.tar.gz
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local
0: jdbc:drill:zk=local> SELECT DISTINCT users.name as name, users.emails.work as email FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5;+------------+------------+| name | email |+------------+------------+| john | [email protected]|| jack | [email protected]|| Ronn | [email protected] || Pat | [email protected]|...35 rows selected (0.847 seconds)
Install
Launch shell (embedded mode)
Query
Query
© 2014 MapR Technologies 7
Drill Supports Schema Discovery On-The-Fly
• Fixed schema• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or schema-less
• Leverage schema in centralized repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
© 2014 MapR Technologies 8#NoSQLNow @apachedrill
Self-Describing Data is Ubiquitous
Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)
Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
© 2014 MapR Technologies 9#NoSQLNow @apachedrill
Drill’s Data Model is Flexible
HBase
JSONBSON
CSVTSV
ParquetAvro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2014 MapR Technologies 10#NoSQLNow @apachedrill
Core Modules within a Drillbit
SQL Parser Optimizer
Phy
sica
l Pla
n DFS
HBase
RPC Endpoint
Distributed Cache
Sto
rage
Plu
gins
Logi
cal P
lan
Execution Hive
MongoDB
CouchBase
Cassandra
RDBMS
© 2014 MapR Technologies 11#NoSQLNow @apachedrill
Processing in Files
MapReduceGeneric
fileformats
Rows/Columns in files (tables)Hive – Pig - etc
QueryImpala
TezHive
NoSQLMongoDB
HbaseCassandra
RiakRedis
HADOOPDisk & Storage
RDBMS
Highly Structured Data
ANSI-SQL
SQL++R, etc
bits,bytes,blocks
$100K – $200K / TB$1K/TB$10K/TB
Semi Structured & Self describingNo Structure
OLTP EDW
ApacheDrill
© 2014 MapR Technologies 12#NoSQLNow @apachedrill
NoSQL NoETL
Drill, Baby, Drill: Self-Service Data Exploration using Apache DrillThursday, August 21st. 9.30 AM
Apache Drill