Top Banner
Scott Leberknight Cloudera's 7/9/2013
60

Cloudera Impala Overview (via Scott Leberknight)

Aug 20, 2015

Download

Technology

Cloudera, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cloudera Impala Overview (via Scott Leberknight)

Scott Leberknight

Cloudera's

7/9/2013

Page 2: Cloudera Impala Overview (via Scott Leberknight)

History lesson...

Page 3: Cloudera Impala Overview (via Scott Leberknight)

Google Map/Reduce paper (2004)

Cutt ing & Cafare l la create Hadoop (2005)

Page 4: Cloudera Impala Overview (via Scott Leberknight)

Google Dremel paper (2010)

Facebook creates Hive (2007)*

Page 5: Cloudera Impala Overview (via Scott Leberknight)

Cloudera announces Impala (October 2012)

HortonWorks' Stinger (February 2013)

Apache Drill proposal (August 2012)

Page 6: Cloudera Impala Overview (via Scott Leberknight)

* Hive => "SQL on Hadoop"

Write SQL queries

Translate into Map/Reduce job(s)

Convenient & easy

High-latency (batch processing)

Page 7: Cloudera Impala Overview (via Scott Leberknight)

What is Impala?

In-memory, distributed SQL query engine (no Map/Reduce)

Native code (C++)

Distributed(on HDFS data nodes)

Page 8: Cloudera Impala Overview (via Scott Leberknight)

Why Impala?

Interactive data analysis

Low-latency response(roughly, 4 - 100x Hive)

Deploy on existing Hadoop clusters

Page 9: Cloudera Impala Overview (via Scott Leberknight)

Why Impala? (cont'd)

Data stored in HDFS avoids...

...duplicate storage

...data transformation

...moving data

Page 10: Cloudera Impala Overview (via Scott Leberknight)

Why Impala? (cont'd)

SPEED!

Page 11: Cloudera Impala Overview (via Scott Leberknight)

statestored & Hive metastore (for database metadata)

Overview

impalad daemon runs on HDFS nodes

Queries run on "relevant" nodes

Supports common HDFS file formats

(for cluster metadata)

Page 12: Cloudera Impala Overview (via Scott Leberknight)

Overview (cont'd)

Does not use Map/Reduce

Not fault tolerant ! (query fails if any query on any node fails)

Submit queries via Hue/Beeswax Thrift API, CLI, ODBC, JDBC

Page 13: Cloudera Impala Overview (via Scott Leberknight)

SQL Support

SELECT

Projection

UNION

INSERT OVERWRITE

INSERT INTO

ORDER BY(w/ LIMIT)

Aggregation

Subqueries(uncorrelated)

JOIN (equi-join only, subject to memory limitations)

(subset of Hive QL)

Page 14: Cloudera Impala Overview (via Scott Leberknight)

HBase Queries

Maps HBase tables via Hive metastore mapping

Row key predicates => start/stop row

Non-row key predicates => SingleColumnValueFilter

HBase scan translations:

Page 15: Cloudera Impala Overview (via Scott Leberknight)

(Very) Unscientific Benchmarks

Page 16: Cloudera Impala Overview (via Scott Leberknight)

9 queries, run in CDH Quickstart VM

Macbook Pro Retina, mid 201216GB RAM,4GB for VM (VMWare 5),Intel i7 2.6GHz quad-core processor

Hardware

No other load on system during queries

Pseudo-cluster + Impala daemons

CDH 4.2, Impala 1.0

Page 17: Cloudera Impala Overview (via Scott Leberknight)

Benchmarks (cont'd)

(from simple projection queries to multiple joins, aggregation, multiple

predicates, and order by)

Impala vs. Hive performance

"TPC-DS" sample dataset(http://www.tpc.org/tpcds/)

Page 18: Cloudera Impala Overview (via Scott Leberknight)

Query "A"

select c.c_first_name, c.c_last_namefrom customer c limit 50;

Page 19: Cloudera Impala Overview (via Scott Leberknight)

Query "B"

select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_state from customer c    join customer_address ca on c.c_current_addr_sk = ca.ca_address_sklimit 50;

Page 20: Cloudera Impala Overview (via Scott Leberknight)

Query "C"

select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_statefrom customer c    join customer_address ca on c.c_current_addr_sk = ca.ca_address_skwhere lower(c.c_last_name) like 'smi%'limit 50;

Page 21: Cloudera Impala Overview (via Scott Leberknight)

Query "D"

select distinct cd_credit_ratingfrom customer_demographics;

Page 22: Cloudera Impala Overview (via Scott Leberknight)

Query "E"

select    cd_credit_rating,    count(*)from customer_demographicsgroup by cd_credit_rating;

Page 23: Cloudera Impala Overview (via Scott Leberknight)

Query "F"select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_state,    cd.cd_marital_status,    cd.cd_education_statusfrom customer c    join customer_address ca        on c.c_current_addr_sk = ca.ca_address_sk    join customer_demographics cd        on c.c_current_cdemo_sk = cd.cd_demo_skwhere    lower(c.c_last_name) like 'smi%' and    cd.cd_credit_rating in ('Unknown', 'High Risk')limit 50;

Page 24: Cloudera Impala Overview (via Scott Leberknight)

Query "G"

select    count(c.c_customer_sk)from customer c    join customer_address ca        on c.c_current_addr_sk = ca.ca_address_sk    join customer_demographics cd        on c.c_current_cdemo_sk = cd.cd_demo_skwhere    ca.ca_zip in ('20191', '20194') and    cd.cd_credit_rating in ('Unknown', 'High Risk');

Page 25: Cloudera Impala Overview (via Scott Leberknight)

Query "H"select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_state,    cd.cd_marital_status,    cd.cd_education_statusfrom customer c    join customer_address ca        on c.c_current_addr_sk = ca.ca_address_sk    join customer_demographics cd        on c.c_current_cdemo_sk = cd.cd_demo_skwhere    ca.ca_zip in ('20191', '20194') and    cd.cd_credit_rating in ('Unknown', 'High Risk')limit 100;

Page 26: Cloudera Impala Overview (via Scott Leberknight)

select     i_item_id,   s_state,   avg(ss_quantity) agg1,   avg(ss_list_price) agg2,   avg(ss_coupon_amt) agg3,   avg(ss_sales_price) agg4from store_salesjoin date_dim    on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)join item    on (store_sales.ss_item_sk = item.i_item_sk)join customer_demographics    on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)join store    on (store_sales.ss_store_sk = store.s_store_sk)where   cd_gender = 'M' and   cd_marital_status = 'S' and   cd_education_status = 'College' and   d_year = 2002 and   s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')group by   i_item_id,   s_stateorder by   i_item_id,   s_statelimit 100;

Query "TPC-DS"

Page 27: Cloudera Impala Overview (via Scott Leberknight)

Query Hive (sec) # M/R jobs Impala (sec) x Hive perf.

A 13.8 1 0.25 54

B 30.0 1 0.41 73

C 33.3 1 0.42 79

D 23.2 1 0.64 36

E 21.6 1 0.62 35

F 59.1 2 1.96 30

G 78.5 3 1.56 50

H 59.6 2 1.89 32

TPC-DS 204.5 6 3.23 63

(remember, unscientific...)

Page 28: Cloudera Impala Overview (via Scott Leberknight)
Page 29: Cloudera Impala Overview (via Scott Leberknight)

Arch

itect

ure

Page 30: Cloudera Impala Overview (via Scott Leberknight)

Two daemonsimpaladstatestored

impalad on each HDFS data node

statestored - cluster metadata

Thrift APIs, ODBC, JDBC

Page 31: Cloudera Impala Overview (via Scott Leberknight)

impalad

Query execution

Query coordination

Query planning

Page 32: Cloudera Impala Overview (via Scott Leberknight)

impalad

Query Coordinator

Query Planner

Query Executor

HDFS DataNode

HBase RegionServer

Page 33: Cloudera Impala Overview (via Scott Leberknight)

Queries performed in-memory

Intermediate data never hits disk!

Data streamed to clients

C++runtime code generationintrinsics for optimization

Execution engine:

Page 34: Cloudera Impala Overview (via Scott Leberknight)

statestored

Cluster membership

Acts as a cluster monitor

Not a SPOF(single point of failure)

Page 35: Cloudera Impala Overview (via Scott Leberknight)

Metadata

Impala uses Hive metastore

Daemons cache metadata

REFRESH when tabledefinition/data change

Create tables in Hive or Impala

Page 36: Cloudera Impala Overview (via Scott Leberknight)

Next up - how queries work...

Page 37: Cloudera Impala Overview (via Scott Leberknight)

impalad

Query Coordinator

Query Planner

Query Executor

HDFS DataNode

HBase RegionServer

Client Statestore Hive Metastore

table/database metadata

SQL query

impalad

Query Coordinator

Query Planner

Query Executor

HDFS DataNode

HBase RegionServer

impalad

Query Coordinator

Query Planner

Query Executor

HDFS DataNode

HBase RegionServer

clustermonitoring

Page 38: Cloudera Impala Overview (via Scott Leberknight)

Read directly from disk

Short-circuit reads

Bypass HDFS DataNode(avoids overhead of HDFS API)

Page 39: Cloudera Impala Overview (via Scott Leberknight)

impalad

Query Coordinator

Query Planner

Query Executor

HBase RegionServer

HDFS DataNode

Local Filesystem

Read directly

from disk

Page 40: Cloudera Impala Overview (via Scott Leberknight)
Page 41: Cloudera Impala Overview (via Scott Leberknight)

Current Limitations(as of version 1.0.1)

No join order optimization

No custom file formats, SerDes or UDFs

Limit required when using ORDER BY

Joins limited by aggregate memory of cluster

("put larger table on left")

Page 42: Cloudera Impala Overview (via Scott Leberknight)

Current Limitations(as of version 1.0.1)

No advanced data structures (arrays, maps, json, etc.)

Only basic DDL (otherwise do in Hive)

Limited file formats and compression(though probably fine for most people)

Page 43: Cloudera Impala Overview (via Scott Leberknight)

Future...

Structure types (structs, arrays, maps, json, etc.)

DDL support

Additional file formats & compression support

"Performance"

Join optimization(e.g. cost-based)

UDFs (???)

YARN integration

Fault-tolerance (???)

Page 44: Cloudera Impala Overview (via Scott Leberknight)
Page 45: Cloudera Impala Overview (via Scott Leberknight)

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.

Comparing Impala to Dremel

- http://research.google.com/pubs/pub36632.html

Page 46: Cloudera Impala Overview (via Scott Leberknight)

Comparing Impala to Dremel

Impala = Dremel features circa 2010 + join support, assuming columnar data format

(but, Google doesn't stand still...)

Dremel is production, mature

Basis for Google's BigQuery

Page 47: Cloudera Impala Overview (via Scott Leberknight)

Comparing Impala to Hive

Hive uses Map/Reduce -> high latency

Impala is in-memory, low-latency query engine

Impala sacrifices fault tolerance for performance

Page 48: Cloudera Impala Overview (via Scott Leberknight)

Comparing Impala to Drill

Apache Drill

Based on Dremel

In early stages...

Page 49: Cloudera Impala Overview (via Scott Leberknight)

"Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an IaaS service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Currently, Drill is incubating at Apache."

- http://incubator.apache.org/drill/drill_overview.html

Comparing Impala to Drill

Page 50: Cloudera Impala Overview (via Scott Leberknight)

"The Stinger Initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility."

Comparing Impala to Stinger

- http://hortonworks.com/stinger/

Page 51: Cloudera Impala Overview (via Scott Leberknight)

Comparing Impala to Stinger

Stinger

Improve Hive performance (e.g. optimize execution plan)

Support for analytics (e.g. OVER clause, window functions)

TEZ framework to optimize execution

Columnar file format

http://hortonworks.com/stinger/

Page 52: Cloudera Impala Overview (via Scott Leberknight)

Stinger Phase 1 performance...

(Stinger phase 1 is really just Hive 0.11)

Page 53: Cloudera Impala Overview (via Scott Leberknight)

remember, these numbers are non-scientific micro-benchmarks!

Page 54: Cloudera Impala Overview (via Scott Leberknight)

Same 9 queries (as w/ Impala), run in HortonWorks Sandbox VM

Macbook Pro Retina, mid 201216GB RAM,4GB for VM (VMWare 5),Intel i7 2.6GHz quad-core processor

Hardware (same as w/ Impala)

No other load on system during queries

HortonWorks Data Platform (HDP) 1.3

Running pseudo-cluster

Page 55: Cloudera Impala Overview (via Scott Leberknight)

Query Hive (sec)# M/R jobs

StingerPhase 1 (sec)

# M/R jobs

x Hive perf.

A 13.8 1 10.0 1 1.4

B 30.0 1 15.8 1 1.9

C 33.3 1 14.1 1 2.4

D 23.2 1 18.7 1 1.2

E 21.6 1 19.7 1 1.1

F 59.1 2 34.3 1 1.7

G 78.5 3 35.2 1 2.2

H 59.6 2 31.5 1 1.9

TPC-DS 204.5 6 37.2 1 5.5

(remember, unscientific...)

Page 56: Cloudera Impala Overview (via Scott Leberknight)

QueryStinger Phase 1

(sec)Impala (sec) x Stinger perf.

A 10.0 0.25 39

B 15.8 0.41 38

C 14.1 0.42 33

D 18.7 0.64 29

E 19.7 0.62 32

F 34.3 1.96 18

G 35.2 1.56 23

H 31.5 1.89 17

TPC-DS 37.2 3.23 12

(remember, unscientific...)

Page 57: Cloudera Impala Overview (via Scott Leberknight)

Impala Review

In-memory, distributed SQL query engine

Integrates into existing HDFS

Not Map/Reduce

Focus on performance

(native code)

Competition...

Interactive data analysis

Page 58: Cloudera Impala Overview (via Scott Leberknight)

References

Google Dremel - http://research.google.com/pubs/pub36632.html

Apache Drill - http://incubator.apache.org/drill/

TPC-DS dataset - http://www.tpc.org/tpcds/

Stinger Initiative - http://hortonworks.com/blog/100x-faster-hive/ http://hortonworks.com/stinger/

Cloudera Impala resourceshttp://www.cloudera.com/content/support/en/documentation/cloudera-impala/cloudera-impala-documentation-v1-latest.html

Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

Page 59: Cloudera Impala Overview (via Scott Leberknight)

Photo Attributions

Impala - http://www.flickr.com/photos/gerardstolk/5897570970/

Measuring tape - http://www.morguefile.com/archive/display/24850

Bridge frame - http://www.morguefile.com/archive/display/9699

Balance - http://www.morguefile.com/archive/display/93433

* All others are iStockPhoto (I paid for them...)

Page 60: Cloudera Impala Overview (via Scott Leberknight)

My Info

twitter.com/sleberknight www.sleberknight.com/blog

scott dot leberknight at gmail dot com