Top Banner
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 Performance Evaluation of Cloudera impala 1.0 May 1, 2013 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon
12

Performance Evaluation of Cloudera Impala GA

Nov 18, 2014

Download

Technology

Yukinori Suda

Performance Evaluation of Cloudera Impala GA
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1

Performance  Evaluation  ofCloudera  impala  1.0

May  1,  2013CELLANT  Corp.  R&D  Strategy  Division

Yukinori  SUDA@sudabon

Page 2: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v  Support  for  a  subset  of  ANSI-‐‑‒92  SQLv  CREATE,  ALTER,  SELECT,  INSERT,  JOIN,  and  subqueries

v  Support  for  partitioned  joins,  fully  distributed  aggregations,  and  fully  distributed  top-‐‑‒n  queries

v  Support  for  a  variety  of  data  formats:v  Hadoop  native  (Apache  Avro,  SequenceFile,  RCFile  with  Snappy,  GZIP,  BZIP,  or  uncompressed)

v  text  (uncompressed  or  LZO-‐‑‒compressed)v  Parquet  (Snappy  or  uncompressed)

v  Support  for  all  CDH4  64-‐‑‒bit  packages:v  RHEL  6.2/5.7,  Ubuntu,  Debian,  SLES

v  Connectivity  via  JDBC,  ODBC,  Hue  GUI,  or  command-‐‑‒line  shellv  Kerberos  authentication  and  MR/Impala  resource  isolationv  etc

Cloudera  Impala  GA  was  released  !!

2

Page 3: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Our  System  Environment

3

v  Install  using  Cloudera  Manager  Free  Edition  4.5.2

Master Slave

11  Servers

All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch

ActiveNameNode

DataNodeTaskTrackerImpalad

Stand-‐‑‒byNameNode

JobTrackerstatestored

3  Servers

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

Page 4: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v CPUl Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading

v Memoryl 4GB

v Diskl 7,200  rpm  SATA  mechanical  Hard  Disk  Drive  *  1

v OSl Cent  OS  6.2

Our  “wimpy”  Server  Specification

4

Page 5: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v  Use  CDH4.2.1  +  Impala  version  1.0v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench”

l  https://github.com/hibenchv  Modified  datasets  to  1/10  scale

l  Default  configuration  generates  table  with  1  billion  rowsv  Modified  query  sentence

l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performancev  Combines  a  few  storage  format  with  a  few  compression  method

l  TextFile,  SequenceFile,  RCFile,  ParquestFilel  No  compression,  Gzip,  Snappy

v  Comparison  with  job  query  latencyv  Average  job  latency  over  5  measurements

Benchmark

5

Page 6: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

•  Uservisits  table–  100  million  rows–  16,895  MB  as  TextFile–  Table  Definitions

•  sourceIP  string•  destURL  string•  visitDate  string•  adRevenue  double•  userAgent  string•  countryCode  string•  languageCode  string•  searchWord  string•  duration  int

•  Rankings  table–  12  million  rows–  744  MB  as  TextFile–  Table  Definitions

•  pageURL string•  pageRank int•  avgDuration int

Modified  Datasets

6

Page 7: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

SELECT  sourceIP,  sum(adRevenue)  as  totalRevenue,  avg(pageRank)  FROM  rankings_̲t  RJOIN  (  SELECT    sourceIP,    destURL,    adRevenue  FROM    uservisits_̲t  UV  WHERE    (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0    AND    datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)  )  NUV

ON  (R.pageURL  =  NUV.destURL)group  by  sourceIPorder  by  totalRevenue  DESClimit  1;

Modified  Query

7

Page 8: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Benchmark  Result  (Hive)cited  from  “Performance  evaluation  of  Cloudera  impala  0.6  beta...”

8

0 50 100 150 200 250

No  Comp.

Gzip

Snappy

Gzip

Snappy

TextFile

SequenceFile

RCFile

235.843

227.883

213.616

234.289

197.894

Avg.  Job  Latency  [sec]

Page 9: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Benchmark  Result  (Impala)

9

0 50 100 150 200 250

No  Comp.

Gzip

Snappy

Gzip

Snappy

Snappy

Text

File

Sequence

File

RCFile

Parquet

File

36.61

29.736

24.024

26.083

19.586

16.2

Avg.  Job  Latency  [sec]

Page 10: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v Exchange  the  order  of  JOINed  Tables  like  belowSELECT

sourceIP,  sum(adRevenue)  as  totalRevenue,  avg(pageRank)FROM

(SELECT  sourceIP,  destURL,  adRevenue  FROM  uservisits_̲ps  UV  WHERE  (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0  AND  datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0))  NUV

JOINrankings_̲ps  R

ON(R.pageURL  =  NUV.destURL)

group  by  sourceIPorder  by  totalRevenue  DESClimit  1;

v Resultl Parquet  compressed  as  Snappy:  34.374  sec

Additional  Experiments

10

Page 11: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v Parquet  +  Snappy  is  the  fastestv Specifically,

l ParquetFile  compressed  as  Snappy:  16.2  secv Need  to  take  care  the  order  of  JOINed  tables

v Hope  for  future  extensionl Support  UDFl Window  Functionl etc

Conclusion

11

Page 12: Performance Evaluation of Cloudera Impala GA

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 12

Letʼ’s  try  it  out  on  your  envrionment!!Thanks!