Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 Evaluation of Cloudera impala 1.1 Aug 7, 2013 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon
Jan 15, 2015
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1
Evaluation of Cloudera impala 1.1
Aug 7, 2013CELLANT Corp. R&D Strategy Division
Yukinori SUDA@sudabon
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Sentry support:l Fine-‐‑‒grained authorizationl Role-‐‑‒based authorization
v Support for viewsv Performance improvements
l Parquet columnar performancel More efficient metadata refresh for larger installations
v Additional SQLl SQL-‐‑‒89 joins (in addition to existing SQL-‐‑‒92)l LOAD functionl REFRESH command for JDBC/ODBC
v Improved Hbase support:l Binary typesl Caching configuration
v Fixed many bugs
Cloudera Impala 1.1 was released !!
2
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Hive ⇒ Impalal On Impala shell, can read data in “VIEW” that was created via Hive command ?
v Impala ⇒ Hivel On Hive shell, can read data in “VIEW” that was created via Impala command ?
v ResultTwo “VIEW”s have compatibility
Check compatibility of “VIEW”
3
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Check performance (Hive on Cluster1)
4
0 50 100 150 200 250
No Comp.
Gzip
Snappy
Gzip
Snappy
TextFile
SequenceFile
RCFile
222.039
244.67
239.182
228.801
230.327
Avg. Job Latency [sec]
This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of “Check performance (Hive on Cluster2)”.
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Check performance (Impala on Cluster1)
5
0 50 100 150 200 250
No Comp.
Gzip
Snappy
Gzip
Snappy
Snappy
Text
File
Sequence
File
RCFile
Parquet
File
23.518
32.155
28.617
20.774
12.654
13.146
Avg. Job Latency [sec]
This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of “Check performance (Impala on Cluster2)”.
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Check performance (Hive on Cluster2)
6
0 50 100 150 200 250 300
No Comp.
Gzip
Snappy
Gzip
Snappy
TextFile
SequenceFile
RCFile
272.176
249.531
245.009
230.034
216.802
Avg. Job Latency [sec]
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Check performance (Impala on Cluster2)
7
0 50 100 150 200 250 300
No Comp.
Gzip
Snappy
Gzip
Snappy
Snappy
Text
File
Sequence
File
RCFile
Parquet
File
32.528
28.73
21.173
24.794
14.308
19.814
Avg. Job Latency [sec]
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v IMPALA-‐‑‒357l Insert into Parquet exceed mem-‐‑‒limit
v Probleml Even if set mem_̲limit setting, when create ParquetFile table with partitions, consumed memory isnʼ’t limited.
l At last, Impalad crashes due to memory shortage
v ResultCREATE command failed due to memory limit
Check fixed bug
8
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Thanks to dev. team, Impala is also going from “Good to Great”
v Both “VIEW” and “Parquet” are already readyv Performance
v RCFile+Snappy is the fastest on both Cluster1 and Cluster2
v If use larger size table, Parquet+Snappy may be the fastest
v Hope for future extensionl Support Structure Typesl Support UDF/UDTF, etc
Summary
9
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 10
Appendix. Benchmark Details
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Our System Environment(Cluster1)
11
v Install using Cloudera Manager Free Edition 4.6.0
Master Slave
14 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
ActiveNameNode
DataNodeTaskTrackerImpalad
Stand-‐‑‒byNameNode
JobTrackerstatestored
3 Servers
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNode
DataNode
DataNode
DataNode
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Our System Environment(Cluster2)
12
v Install using Cloudera Manager Free Edition 4.6.0
Master Slave
10 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
ActiveNameNode
DataNodeTaskTrackerImpalad
Stand-‐‑‒byNameNode
JobTrackerstatestored
3 Servers
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNodeTaskTrackerImpalad
DataNode
DataNode
DataNode
DataNode
Decommissioned
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v CPUl Intel Core 2 Duo 2.13 GHz with Hyper Threading
v Memoryl 8GB : Namenodes onlyl 4GB : Others
v Diskl 7,200 rpm SATA mechanical Hard Disk Drive * 1
v OSl Cent OS 6.3
Our Server Specification
13
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Use CDH4.3.0 + Impala 1.1v Use hivebench in open-‐‑‒sourced benchmark tool “HiBench”
l https://github.com/hibenchv Modified datasets to 1/10 scale
l Default configuration generates table with 1 billion rowsv Modified query sentence
l Deleted “INSERT INTO TABLE …” to evaluate read-‐‑‒only performancev Combines a few storage format with a few compression method
l TextFile, SequenceFile, RCFile, ParquestFilel No compression, Gzip, Snappy
v Comparison with job query latencyv Average job latency over 5 measurementsv Benchmark on both Cluster1 and Cluster2
Benchmark
14
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
• Uservisits table– 100 million rows– 16,895 MB as TextFile– Table Definitions
• sourceIP string• destURL string• visitDate string• adRevenue double• userAgent string• countryCode string• languageCode string• searchWord string• duration int
• Rankings table– 12 million rows– 744 MB as TextFile– Table Definitions
• pageURL string• pageRank int• avgDuration int
Modified Datasets
15
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
SELECT sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank) FROM rankings_̲t RJOIN [BROADCAST] ( SELECT sourceIP, destURL, adRevenue FROM uservisits_̲t UV WHERE (datediff(UV.visitDate, '1999-‐‑‒01-‐‑‒01')>=0 AND datediff(UV.visitDate, '2000-‐‑‒01-‐‑‒01')<=0) ) NUV
ON (R.pageURL = NUV.destURL)group by sourceIPorder by totalRevenue DESClimit 1;
Modified Query
16
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 17
Thanks!I want to use TPC in next evaluation…