Top Banner
© 2014 MapR Technologies 1 ® © 2014 MapR Technologies Analyzing Real-World Data with Apache Drill Tomer Shiran VP Product Management, MapR Technologies Co-Founder, PMC Member and Committer, Apache Drill November 20, 2014
47
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 1

®

© 2014 MapR Technologies

Analyzing Real-World Data with Apache Drill Tomer Shiran VP Product Management, MapR Technologies Co-Founder, PMC Member and Committer, Apache Drill November 20, 2014

Page 2: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 2

Data is doubling in size every two years

Page 3: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 3

44 ZETTABYTES

4.4 ZETTABYTES

2011 2013

1.8 ZETTABYTES

IDC estimates that in 2020, there will be 44 zettabytes

of data in the world

2020

Source: IDC Digital Universe

Page 4: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 4

UNSTRUCTURED DATA

STRUCTURED DATA

1980 2000 2010 1990 2020

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored

Page 5: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 5 1980 2000 2010 1990 2020

Fixed schema

DBA controls structure

Dynamic schema (schema-free)

Application controls structure

“NOSCHEMA” DATASTORES RELATIONAL DATABASES

MBs-GBs TBs-PBs Volume

Database

NoSchema Datastores are Capturing this Data

Structure

Development

Structured Structured, semi-structured and unstructured

Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

Page 6: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 6

SQL in the Big Data World

•  SQL •  BI (Tableau, MicroStrategy, etc.) •  Low latency •  Scalability

•  Create and maintain schemas on: –  HDFS (Parquet, JSON, etc.) –  HBase –  MongoDB

•  Transform or copy data

2 DON’T WANT WANT

We want SQL and BI support without compromising the flexibility and agility of NoSchema datastores

Page 7: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 7

• Schema-free scale-out query engine for Hadoop and NoSQL • Point-and-query vs. schema-first • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs

APACHE DRILL

40+ contributors 150+ years of experience building databases and distributed systems

Page 8: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 8

Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Not needed

Self-service

Traditional BI w/ RDBMS

Self-Service BI w/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

Page 9: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 9

Page 10: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 10

Drill’s Data Model is Flexible

HBase

JSON BSON

CSV TSV

Parquet Avro

Schema-less Fixed schema

Flat

Complex

Flexibility

Flexibility

Name ! Gender ! Age !Michael ! M ! 6 !Jennifer ! F ! 3 !

{ ! name: { ! first: Michael, ! last: Smith ! }, ! hobbies: [ski, soccer], ! district: Los Altos !} !{ ! name: { ! first: Jennifer, ! last: Gates ! }, ! hobbies: [sing], ! preschool: CCLC !} !

RDBMS/SQL-on-Hadoop table

Apache Drill table

Page 11: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 11

Drill Supports Schema Discovery On-The-Fly

•  Fixed schema •  Leverage schema in centralized

repository (Hive Metastore)

•  Fixed schema, evolving schema or schema-less

•  Leverage schema in centralized repository or self-describing data

2 Schema Discovered On-The-Fly Schema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

Page 12: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 12

Native JSON

SELECT  json_value(po_document,      '$.AllowPartialShipment’  RETURNING  NUMBER)  FROM      j_purchaseorder;  

SELECT  po_document.AllowPartialShipment    FROM      j_purchaseorder;  

JSON query with Oracle:

JSON query with Drill:

Relational databases cannot provide true schema-free JSON support.

Page 13: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 13 © 2014 MapR Technologies ®

Architecture

Page 14: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 14

High Level Architecture •  Cluster of commodity servers

–  Daemon (drillbit) on each node

•  No dependency on other execution engines (MapReduce, Spark, Tez) –  Better performance and manageability

•  ZooKeeper maintains ephemeral cluster membership information –  drillbit uses ZooKeeper to find other drillbits in the cluster –  Client uses ZooKeeper to find drillbits

•  Data processing unit is columnar record batches  –  Enables schema flexibility with negligible performance impact

Page 15: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 15

Drill Maximizes Data Locality

Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)

drillbit  

DataNode/RegionServer/

mongod  

drillbit  

DataNode/RegionServer/

mongod  

drillbit  

DataNode/RegionServer/

mongod  

ZooKeeper ZooKeeper

ZooKeeper …

Page 16: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 16

SELECT* Query Execution

drillbit  ZooKeeper

Client (JDBC, ODBC,

REST)

1.  Find drillbits (once per session)

3.  Create logical and physical execution plans 4.  Farm out execution of fragments to cluster

(completely distributed execution)

ZooKeeper ZooKeeper

drillbit  drillbit  

2.  Submit query to drillbit

5.  Return results to client

* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4

Page 17: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 17

Core Modules within drillbit  

SQL Parser Hive

HBase

Distributed Cache

Sto

rage

Plu

gins

MongoDB

DFS

Phy

sica

l Pla

n

Execution Lo

gica

l Pla

n Optimizer

RPC Endpoint

Page 18: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 18 © 2014 MapR Technologies ®

Example: Analyzing Real-World Data

Page 19: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 19

Demo Plan 1.  Run Drill 2.  Configure DFS and MongoDB storage plugins 3.  Explore the data

–  Basics –  Complex data –  Views

Page 20: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 20 © 2014 MapR Technologies ®

Run Drill

Page 21: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 21

Run Drill in Embedded Mode (sqlline) $  tar  xf  apache-­‐drill-­‐0.7.0.tar.gz  $  cd  apache-­‐drill-­‐0.7.0  $  bin/sqlline  -­‐u  jdbc:drill:zk=local  >  SELECT  *      FROM  dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json`      LIMIT  1;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  yelping_since  |      votes        |  review_count  |        name        |    user_id      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  2012-­‐02              |  {"funny":1,"useful":5,"cool":0}  |  6                        |  Lee                |  qtrmBGNqCvupHMHL_bKFgQ  |  

•  drillbit (Drill daemon) starts automatically in embedded mode •  No ZooKeeper in embedded mode (hence zk=local) •  Can’t use BI clients (JDBC/ODBC) in embedded mode

You can now access the Web UI: http://localhost:8047

Page 22: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 22

Or Run Drill in Distributed Mode…

$  zkServer  start  •  Make sure ZooKeeper (zkServer) is running:

•  Access the Web UI: http://localhost:8047 •  Connect a client to the cluster (eg, sqlline):

•  Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes •  If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired

cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/<clustername>

•  Not sure if ZooKeeper is running? Run telnet  localhost  2181 and make sure it connects

•  Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf •  Start drillbit:  $  bin/drillbit.sh  start  

$  bin/sqlline  -­‐u  jdbc:drill:zk=localhost:2181  

Page 23: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 23 © 2014 MapR Technologies ®

Configure Storage Plugins

Page 24: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 24

Enable MongoDB Storage Plugin

Page 25: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 25

Define Workspaces in the DFS Storage Plugin •  d

Page 26: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 26 © 2014 MapR Technologies ®

Explore the Data: Basics

Page 27: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 27

Inventory: DFS Files

{      "votes":  {"funny":  0,  "useful":  2,  "cool":  1},      "user_id":  "Xqd0DzHaiyRqVH3WRG7hzg",      "review_id":  "15SdjuK7DmYqUAj6rjGowg",      "stars":  5,      "date":  "2007-­‐05-­‐17",      "text":  "dr.  goldberg  offers  everything  ...",      "type":  "review",      "business_id":  "vcNAWiLM4dR7D2nwwJ7nCA"  }  

Page 28: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 28

Inventory: MongoDB Collections $  mongo  MongoDB  shell  version:  2.6.5  >  show  databases;  admin    (empty)  local    0.078GB  yelp      0.453GB  >  use  yelp  >  db.users.findOne()  {  

 "_id"  :  ObjectId("54566cdf3237149de181a92a"),    "yelping_since"  :  "2012-­‐02",    "votes"  :  {      "funny"  :  1,      "useful"  :  5,      "cool"  :  0    },    "review_count"  :  6,    "name"  :  "Lee",    "user_id"  :  "qtrmBGNqCvupHMHL_bKFgQ",    "friends"  :  [  ]  

}  

Page 29: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 29

Let’s Go! >  SELECT  *      FROM  dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json`      WHERE  stars  =  1      LIMIT  1;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      votes        |    user_id      |  review_id    |      stars        |        date        |        text        |        type        |  business_id  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  {"funny":0,"useful":0,"cool":0}  |  Qrs3EICADUKNFoUq2iHStA  |  _ePLBPrkrf4bhyiKWEn4Qg  |  1                    |  2013-­‐04-­‐19  |  I  don't  know  what  Dr.  Goldberg  was  like  before    moving  to  Arizona,  but  let  me  tell  you,  STAY  AWAY  from  this  doctor  and  this  office.  |  review          |  vcNAWiLM4dR7D2nwwJ7nCA  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 30: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 30

Using Storage Plugins and Workspaces

>  SELECT  *  FROM  dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json`  LIMIT  1;  >  SELECT  *  FROM  dfs.demo.`yelp/review.json`  LIMIT  1;  >  SELECT  *  FROM  mongo.yelp.users  LIMIT  1;  >  USE  mongo.yelp;  >  SELECT  *  FROM  users  LIMIT  1;  

Storage plugin Workspace

Path relative to workspace

Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table

Page 31: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 31

Most Common User Names (MongoDB) >  SELECT  name,  count(*)  AS  users      FROM  mongo.yelp.users      GROUP  BY  name      ORDER  BY  users  DESC  LIMIT  10;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      users        |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  David            |  2453              |  |  John              |  2378              |  |  Michael        |  2322              |  |  Chris            |  2202              |  |  Mike              |  2037              |  |  Jennifer      |  1867              |  |  Jessica        |  1463              |  |  Jason            |  1457              |  |  Michelle      |  1439              |  |  Brian            |  1436              |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 32: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 32

Cities with the Most Businesses >  SELECT  state,  city,  count(*)  AS  businesses      FROM  dfs.demo.`/yelp/business.json`      GROUP  BY  state,  city      ORDER  BY  businesses  DESC  LIMIT  10;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      state        |        city        |    businesses  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  NV                  |  Las  Vegas    |  12021              |  |  AZ                  |  Phoenix        |  7499                |  |  AZ                  |  Scottsdale  |  3605                |  |  EDH                |  Edinburgh    |  2804                |  |  AZ                  |  Mesa              |  2041                |  |  AZ                  |  Tempe            |  2025                |  |  NV                  |  Henderson    |  1914                |  |  AZ                  |  Chandler      |  1637                |  |  WI                  |  Madison        |  1630                |  |  AZ                  |  Glendale      |  1196                |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 33: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 33 © 2014 MapR Technologies ®

Explore the Data: Complex Data

Page 34: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 34

business.json (1) {  

 "business_id":  "4bEjOyTaDG24SY5TxsaUNQ",    "full_address":  "3655  Las  Vegas  Blvd  S\nThe  Strip\nLas  Vegas,  NV  89109",    "hours":  {      "Monday":  {"close":  "23:00",  "open":  "07:00"},      "Tuesday":  {"close":  "23:00",  "open":  "07:00"},      "Friday":  {"close":  "00:00",  "open":  "07:00"},      "Wednesday":  {"close":  "23:00",  "open":  "07:00"},      "Thursday":  {"close":  "23:00",  "open":  "07:00"},      "Sunday":  {"close":  "23:00",  "open":  "07:00"},      "Saturday":  {"close":  "00:00",  "open":  "07:00"}    },    "open":  true,    "categories":  ["Breakfast  &  Brunch",  "Steakhouses",  "French",  "Restaurants"],    "city":  "Las  Vegas",    "review_count":  4084,    "name":  "Mon  Ami  Gabi",    "neighborhoods":  ["The  Strip"],    "longitude":  -­‐115.172588519464,  

Page 35: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 35

business.json (2)  "state":  "NV",    "stars":  4.0,  

   "attributes":  {      "Alcohol":  "full_bar”,  

     "Noise  Level":  "average",      "Has  TV":  false,      "Attire":  "casual",      "Ambience":  {        "romantic":  true,        "intimate":  false,        "touristy":  false,        "hipster":  false,  

       "classy":  true,        "trendy":  false,  

       "casual":  false      },      "Good  For":  {"dessert":  false,  "latenight":  false,  "lunch":  false,  

                                               "dinner":  true,  "breakfast":  false,  "brunch":  false},    }  

}  

Page 36: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 36

Which Places Are Open Right Now (22:00)? >  SELECT  name,  b.hours      FROM  dfs.demo.`yelp/business.json`  b      WHERE  b.hours.Saturday.`open`  <  '22:00'  AND                  b.hours.Saturday.`close`  >  '22:00'      LIMIT  2;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      hours        |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Chang  Jiang  Chinese  Kitchen  |  {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"22:30","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}}  |  |  Grand  China  Restaurant  |  {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"23:00","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","open":"11:00"}}  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 37: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 37

It’s 10pm in Vegas and I Want Good Hummus! >  SELECT  name,  stars,  b.hours.Friday,  categories      FROM  dfs.demo.`yelp/business.json`  b      WHERE  b.hours.Friday.`open`  <  '22:00'  AND                  b.hours.Friday.`close`  >  '22:00'  AND                  REPEATED_CONTAINS(categories,  'Mediterranean')  AND                  city  =  'Las  Vegas'      ORDER  BY  stars  DESC      LIMIT  2;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      stars        |      EXPR$2      |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Olives          |  4.0                |  {"close":"22:30","open":"11:00"}  |  ["Mediterranean","Restaurants"]  |  |  Marrakech  Moroccan  Restaurant  |  4.0                |  {"close":"23:00","open":"17:30"}  |  ["Mediterranean","Middle  Eastern","Moroccan","Restaurants"]  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 38: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 38

Flatten Repeated Values >  SELECT  name,  categories      FROM  dfs.demo.`yelp/business.json`  LIMIT  3;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Eric  Goldberg,  MD  |  ["Doctors","Health  &  Medical"]  |  |  Pine  Cone  Restaurant  |  ["Restaurants"]  |  |  Deforest  Family  Restaurant  |  ["American  (Traditional)","Restaurants"]  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

>  SELECT  name,  FLATTEN(categories)  AS  categories      FROM  dfs.demo.`yelp/business.json`  LIMIT  5;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Eric  Goldberg,  MD  |  Doctors        |  |  Eric  Goldberg,  MD  |  Health  &  Medical  |  |  Pine  Cone  Restaurant  |  Restaurants  |  |  Deforest  Family  Restaurant  |  American  (Traditional)  |  |  Deforest  Family  Restaurant  |  Restaurants  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 39: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 39

Most and Least Common Business Categories >  SELECT  category,  count(*)  AS  businesses      FROM  (SELECT  name,  FLATTEN(categories)  AS  category                  FROM  dfs.demo.`yelp/business.json`)  c      GROUP  BY  category  ORDER  BY  businesses  DESC;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |    category    |  businesses  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Restaurants  |  14303            |  …  |  Australian  |  1                    |  |  Boat  Dealers  |  1                    |  |  Firewood      |  1                    |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  715  rows  selected  (3.439  seconds)  

>  SELECT  name,  categories  FROM  dfs.demo.`yelp/business.json`  WHERE  true  and  REPEATED_CONTAINS(categories,  'Australian');  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |  categories  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  The  Australian  AZ  |  ["Bars","Burgers","Nightlife","Australian","Sports  Bars","Restaurants"]  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 40: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 40 © 2014 MapR Technologies ®

Explore the Data: Views

Page 41: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 41

Create a View for Name-Gender Mapping

>  CREATE  VIEW  dfs.tmp.`names`  AS          SELECT  columns[0]  AS  name,  columns[4]  AS  gender          FROM  dfs.demo.`names.csv`;  >  USE  dfs.tmp;  >  CREATE  VIEW  names1  ASSELECT  columns[0]  AS  name,  columns[4]  AS  gender  FROM  dfs.demo.`names.csv`;  >  SELECT  *  FROM  dfs.tmp.names  WHERE  name  =  'John';  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      gender      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  John              |  Male              |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

columns[0]   columns[4]  

names.csv:  

Page 42: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 42

Most Common Names (and their Genders) on Yelp >  SELECT  u.name,  n.gender,  count(*)  AS  number      FROM  mongo.yelp.users  u,  dfs.tmp.names  n      WHERE  u.name  =  n.name      GROUP  BY  u.name,  n.gender      ORDER  BY  number  DESC  LIMIT  10;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |        name        |      gender      |      number      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  David            |  Male              |  2453              |  |  John              |  Male              |  2378              |  |  Michael        |  Male              |  2322              |  |  Chris            |  Unknown        |  2202              |  |  Mike              |  Male              |  2037              |  |  Jennifer      |  Female          |  1867              |  |  Jessica        |  Female          |  1463              |  |  Jason            |  Male              |  1457              |  |  Michelle      |  Female          |  1439              |  |  Brian            |  Male              |  1436              |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 43: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 43

Who Rates Higher – Men or Women? >  SELECT  n.gender,  count(*)  AS  users,  round(avg(average_stars),  2)  stars      FROM  mongo.yelp.users  u,  dfs.tmp.names  n      WHERE  u.name  =  n.name      GROUP  BY  n.gender;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      gender      |      users        |      stars        |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Female          |  103684          |  3.77              |  |  Male              |  97430            |  3.696            |  |  Unknown        |  18409            |  3.727            |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

Page 44: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 44

Who Writes More – Men or Women?

>  SELECT  n.gender,  round(avg(length(r.text)))  AS  review_length      FROM  dfs.demo.`yelp/review.json`  r,                mongo.yelp.users  u,                dfs.tmp.names  n      WHERE  u.name  =  n.name  AND  r.user_id  =  u.user_id      GROUP  BY  n.gender;  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |      gender      |  review_length  |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  |  Male              |  665                      |  |  Female          |  730                      |  |  Unknown        |  711                      |  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  

It takes a 3-way join to find out…

Page 45: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 45

Drill Tweets (@ApacheDrill)

Page 46: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 46

Thank You •  Learn: incubator.apache.org/drill/

•  Download: incubator.apache.org/drill/download/

•  Ask questions: [email protected]

•  Contact me: [email protected]

Page 47: Analyzing Real-World Data with Apache Drill

®© 2014 MapR Technologies 47

Thank You

@mapr maprtech

[email protected]

Tomer Shiran, VP Product Management

MapRTechnologies

maprtech

mapr-technologies