Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 Performance Evaluation of Cloudera impala 1.0 May 1, 2013 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon
Nov 18, 2014
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1
Performance Evaluation ofCloudera impala 1.0
May 1, 2013CELLANT Corp. R&D Strategy Division
Yukinori SUDA@sudabon
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Support for a subset of ANSI-‐‑‒92 SQLv CREATE, ALTER, SELECT, INSERT, JOIN, and subqueries
v Support for partitioned joins, fully distributed aggregations, and fully distributed top-‐‑‒n queries
v Support for a variety of data formats:v Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed)
v text (uncompressed or LZO-‐‑‒compressed)v Parquet (Snappy or uncompressed)
v Support for all CDH4 64-‐‑‒bit packages:v RHEL 6.2/5.7, Ubuntu, Debian, SLES
v Connectivity via JDBC, ODBC, Hue GUI, or command-‐‑‒line shellv Kerberos authentication and MR/Impala resource isolationv etc
Cloudera Impala GA was released !!
2
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Our System Environment
3
v Install using Cloudera Manager Free Edition 4.5.2
Master Slave
11 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
ActiveNameNode
DataNodeTaskTrackerImpalad
Stand-‐‑‒byNameNode
JobTrackerstatestored
3 Servers
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v CPUl Intel Core 2 Duo 2.13 GHz with Hyper Threading
v Memoryl 4GB
v Diskl 7,200 rpm SATA mechanical Hard Disk Drive * 1
v OSl Cent OS 6.2
Our “wimpy” Server Specification
4
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Use CDH4.2.1 + Impala version 1.0v Use hivebench in open-‐‑‒sourced benchmark tool “HiBench”
l https://github.com/hibenchv Modified datasets to 1/10 scale
l Default configuration generates table with 1 billion rowsv Modified query sentence
l Deleted “INSERT INTO TABLE …” to evaluate read-‐‑‒only performancev Combines a few storage format with a few compression method
l TextFile, SequenceFile, RCFile, ParquestFilel No compression, Gzip, Snappy
v Comparison with job query latencyv Average job latency over 5 measurements
Benchmark
5
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
• Uservisits table– 100 million rows– 16,895 MB as TextFile– Table Definitions
• sourceIP string• destURL string• visitDate string• adRevenue double• userAgent string• countryCode string• languageCode string• searchWord string• duration int
• Rankings table– 12 million rows– 744 MB as TextFile– Table Definitions
• pageURL string• pageRank int• avgDuration int
Modified Datasets
6
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
SELECT sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank) FROM rankings_̲t RJOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits_̲t UV WHERE (datediff(UV.visitDate, '1999-‐‑‒01-‐‑‒01')>=0 AND datediff(UV.visitDate, '2000-‐‑‒01-‐‑‒01')<=0) ) NUV
ON (R.pageURL = NUV.destURL)group by sourceIPorder by totalRevenue DESClimit 1;
Modified Query
7
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Benchmark Result (Hive)cited from “Performance evaluation of Cloudera impala 0.6 beta...”
8
0 50 100 150 200 250
No Comp.
Gzip
Snappy
Gzip
Snappy
TextFile
SequenceFile
RCFile
235.843
227.883
213.616
234.289
197.894
Avg. Job Latency [sec]
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Benchmark Result (Impala)
9
0 50 100 150 200 250
No Comp.
Gzip
Snappy
Gzip
Snappy
Snappy
Text
File
Sequence
File
RCFile
Parquet
File
36.61
29.736
24.024
26.083
19.586
16.2
Avg. Job Latency [sec]
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Exchange the order of JOINed Tables like belowSELECT
sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank)FROM
(SELECT sourceIP, destURL, adRevenue FROM uservisits_̲ps UV WHERE (datediff(UV.visitDate, '1999-‐‑‒01-‐‑‒01')>=0 AND datediff(UV.visitDate, '2000-‐‑‒01-‐‑‒01')<=0)) NUV
JOINrankings_̲ps R
ON(R.pageURL = NUV.destURL)
group by sourceIPorder by totalRevenue DESClimit 1;
v Resultl Parquet compressed as Snappy: 34.374 sec
Additional Experiments
10
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Parquet + Snappy is the fastestv Specifically,
l ParquetFile compressed as Snappy: 16.2 secv Need to take care the order of JOINed tables
v Hope for future extensionl Support UDFl Window Functionl etc
Conclusion
11
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 12
Letʼ’s try it out on your envrionment!!Thanks!