Top Banner
Impala and BigQuery By David Gruzman BigDataCraft.com
47

Impala and BigQuery (1)

Jun 03, 2018

Download

Documents

durdurk
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 1/47

Impala and BigQuery

By David Gruzman

BigDataCraft.com

Page 2: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 2/47

 Impala and BigQuery

by David Gruzman

► Big Query is google's database service basedon the Dremel. Big Query is hosted by Google.

►Impala is open source database inspired by the

Dremel paper. Impala is part of the ClouderaHadoop distribution.

Page 3: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 3/47

Today agenda

► Overview of Dremel as a technology

► Overview of the BigQuery

► A few words about Impala

► DG Mediamind use case

► Deeper insights into Impala

► Conclusions►  Q&A

Page 4: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 4/47

Why dremel?

► Google is first who got MapReduce

► Google is first faced MapReduce main problem – latency. The problem was propagated to

engines on top of MapReduce also.► It is logical that Google was first who

approached it by developing real time query

capability for big data.

Page 5: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 5/47

How dremel is used in google

► Dremel is not replacement for the MapReduceor Tenzing but complements it. (Tenzing isGoogle's Hive)

► Analyst can make many fast queries usingDremel

► After getting good idea what is needed – runslow MapReduce (or SQL based onMapReduce) to get precise results

Page 6: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 6/47

Why dremel is Unique

► Dremel with BigQuery built on top of it isprobably only Interactive big data query enginetoday.

► I mean that it is only engine capable to produceresults over terabytes of data in seconds!

► Main idea (my guess) that is harness huge

cluster of machines for the single query.

Page 7: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 7/47

Dremel as technology

Novel Hierarchical columnar format.

► LLVM based code generation.

► Distributed aggregation Tree

► In-situ data processing. (inside the storage)

Page 8: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 8/47

Dremel : Aggregation tree

Page 9: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 9/47

Dremel : Nested columnar format

Page 10: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 10/47

Big Query

► Service built by google on top of the Dremelengine

► Only (known to me) query engine as a service

working with BigData.► Query time not depends on data size

Page 11: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 11/47

BigQuery main capabilities

► Aggregations

► Join of big table to small table.

► Join of two big tables (recently added)

► Hierarchical data format. It makes pre-aggregations cheaper.

Page 12: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 12/47

Main limitations

► Small results size

► Intermediate results should not exceed memorysize.

► No “external tables” 

Page 13: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 13/47

Why BigQuery is not popular

Page 14: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 14/47

So,why BigQuery is not popular

► Data is not created in google cloud. It is hardand not practical to move big data. It is heavy,after all.

► Google is used to change APIs. BigQuery alsochanged during last years. It is hard to buildbusines.

► Many companies in Internet related businessesa wary of sharing data with Google.

► It is expensive. 35$ per TB can give 1000th ofdollars bills per day.

Page 15: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 15/47

Dremel

Page 16: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 16/47

In the same time – it is goodtechnically

► I got referances from company doing serioustesting

► Marting Fawler's company also tested it and

give very good feedback.

Page 17: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 17/47

Question to all of you

Why Your organization decided not to usegoogle's Big Query?

Page 18: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 18/47

Where we can find Impala

Page 19: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 19/47

Page 20: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 20/47

What is impala

► Massive parralel processing (MPP) databaseengine, developed by Cloudera.

► Integrated into Hadoop stack on the same level

as MapReduce, and not above it (as Hive andPig)

HDFS

Map Reduce

HivePig

Impala

Page 21: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 21/47

Why impala

► Data has a gravity

► Today a lot of data live in HDFS

► It is not practical to move big data

► It is practical to bring engine to the data

► In the same time – MapReduce is not must

Impala process data in Hadoop cluster without using MapReduce

Page 22: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 22/47

MapReduce bypass

► Several other modern Database engines alsorealized the opportunity to bypass MapReducebut work right with HDFS.

► They takes various approaches.►  

Page 23: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 23/47

MapReduce Bypass

► Existing MPP databases, like Greenplum – store their external tables in the HDFS

Page 24: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 24/47

MapReduce bypass

► Jethrodata store data in their own format onHDFS and also work with it without MR layer.

► They have their proprietary format which enable

full indexing of the data together with columnarefficiency. In cases of high selectivity queriesthis approach has serious advantages.

Page 25: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 25/47

Use Case from DG

I think it is will be typical case in the future

► DG is using Hadoop and Hive

► Evaluation Impala to do part of things more

efficiently.

► After their case presentation we will back todiscuss insights of the Impala

A i I l h diff t l

Page 26: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 26/47

 Again – Impala has different placethen Pig and Hive

HDFS

Map Reduce

Hive and Pig

Impala

Page 27: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 27/47

Impala architecture

Page 28: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 28/47

Impala – Dremel traces

► LLVM code generation

► It is really fast

► C++ as implementation language (not Java...)

► Simple query engine. It actually doing thingswhich can be done in memory.

► Broadcast join algorithm is implemented

Page 29: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 29/47

LLVM code generation

► Assume you want to write custom code for thespecific query. It will be super efficient

► Code generation automate this process for

each query► We actually need to super-optimize inner loop

doing filtering (where) and group by.

► LLVM enables us to compile in fraction ofseconds into native code

► LLVM enable us to enjoy new CPU capabilitieslike SSE in a portable way.

Page 30: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 30/47

Why code generation it interesting?

► If you develop own engine, or some peace ofcode responsible to process serious datavolumes code generation may give you order ofmagnitude boost.

► I had cases when usage of such technologywas game changing

Page 31: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 31/47

Impala – Hive Traces

► While dremel converts data into own format,Impala supports multiple formats. It is kind ofschema on read.

► Impala shares metastore with Hive, whichenables very simple adoption

► Internally Impala have well defined way to addnew formats

Page 32: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 32/47

Page 33: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 33/47

Impala vs MPP

► It usually tooks many years to create MPPdatabase.

► There are serious simplifications:

► The data is read only

► There is actually not DBMS – only queryengine.

► No serious resource management, butmeasurement (all over code).

Page 34: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 34/47

Impala – hive killer?

► Not so quickly.

► Hive is doing things Impala can not do yet, like joins between several big tables.

► Hive has convinient java UDF, while impala isnot

► Impala does not have inter-query fault

tolerance.► In the same time – MapReduce is not good

framework for the database engine

Page 35: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 35/47

Impala – Data Formats

► There are scanners for the following types:

► RCFile

► Parquet (native dremel format)

► CSV

► AVRO

► Sequence File

Page 36: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 36/47

Impala – future

► Will get closer to other MPP engines

► Support more formats

► More advanced scheduling and resource

management

Page 37: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 37/47

Basic benchmark

► TPC-H, Q1, SF=10

► 4 EC2 large instances

► 4 seconds, while hive takes about 1 minute.

► This number means group by speed of about235MB/sec per core.

Page 38: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 38/47

Impala price per GB

► 1 Large instance costs $0.24

► Cluster costs 0.96 per hour.

► Cost of 1 second : 0.96 / 3600

► We process by such cluster 1.75GB per second

► So cost of 1 TB processing is about $0.15

► It is about 300 times cheaper then BigQuery

Page 39: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 39/47

Page 40: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 40/47

What with clouds?

I l i l d i t l ti

Page 41: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 41/47

Impala in cloud is not elastic

► To be elastic we need to create cluster whenwe need it.

► Even if we agree to by hour resolution – storage

will be a problem► S3 will not give us hundreds of Mbs per second

per instance

To store data in local file system – is transient

I l l i

Page 42: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 42/47

Impala - conclusions

► It is first time I remember when we can put ourhands on free MPP database.

► There is no risk to try it side-by-side with Hive

► It is possible to offload part of the work toImpala and do the rest with Hive

► It is part of the Cloudera Hadoop distribution

and easily installed by Cloudera Manager

M t i l d

Page 43: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 43/47

Materials used

► Benchmarks

http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-20121208-

15536323https://amplab.cs.berkeley.edu/benchmark/

► Architecture

http://www.slideshare.net/scottleber/impala-19176906

https://cloud.google.com/files/BigQueryTechnicalWP.pdf

M t i l d i

Page 44: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 44/47

Material used - comparisons

► To hive: http://www.quora.com/Cloudera/Does-Cloudera-Impala-have-any-drawbacks-when-compared-with-Hive

► To vertica: http://www.quora.com/Cloudera-Impala/How-does-Cloudera-Impala-compare-to-Vertica

► To dremel: http://www.quora.com/Cloudera-

Impala/How-does-Clouderas-Impala-compare-to-Googles-Dremel

Th k !!!

Page 45: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 45/47

Thank you!!!

► Special thanks to

► Faina Kamenetsky – who helped set up clustersin amazon.

BigDataCraft com

Page 46: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 46/47

BigDataCraft.com

► We are boutique consulting company► Our services are:

► On paper POC

► On hardware POC

► Architecture / Design reviews

► Custom integrations and bug fixing

Impala Flow

Page 47: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 47/47

Impala - Flow