Top Banner
Performance Evaluation of Enterprise Big Data Platforms with HiBench Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle and Roberto V. Zicari Frankfurt Big Data Lab Goethe University Frankfurt am Main, Germany http://www.bigdata.uni-frankfurt.de/ 9th IEEE International Conference On Big Data Science and Engineering 2015 August 20 – 22, Helsinki, Finland
20

BDSE 2015 Evaluation of Big Data Platforms with HiBench

Feb 12, 2017

Download

Software

t_ivanov
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Performance Evaluation of Enterprise Big Data

Platforms with HiBench

Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle and

Roberto V. Zicari

Frankfurt Big Data Lab

Goethe University Frankfurt am Main, Germany

http://www.bigdata.uni-frankfurt.de/

9th IEEE International Conference On Big Data Science and Engineering 2015

August 20 – 22, Helsinki, Finland

Page 2: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Agenda

Research Objectives

Experimental Methodology

Hardware & Software

Enterprise Big Data Platforms

HiBench

Evaluation

WordCount (CPU-bound)

Enhanced DFSIO (I/O-bound)

HiveBench (mixed)

Lessons Learned

Ongoing Work

9th IEEE International Conference On Big Data Science and Engineering 2015 2

Page 3: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Research Objectives

• Question What type of Big Data applications (CPU-bound, I/O-bound and

mixed) are more suitable for each platform, depending on the data size?

• Our Approach Evaluate and compare the Big Data systems using a widely

used Big Data benchmark.

• Suitable Big Data benchmark

• Different than the Yahoo! Cloud Serving Benchmark (OLTP workload)

Intel HiBench benchmark suite

9th IEEE International Conference On Big Data Science and Engineering 2015 3

Page 4: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Experimental Methodology

9th IEEE International Conference On Big Data Science and Engineering 2015 4

Setup

Phase 1

Setup

Phase 1

Workload Prepare Phase 2

Workload Prepare Phase 2

Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters

Hardware (8 Node Cluster) + Software

Hardware &

Software Setup

DataStax Enterprise (Cassandra)/Cloudera Hadoop

Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

[1] T. Ivanov, R. Niemann, S. Izberovic, M. Rosselli, K. Tolle, and R. V. Zicari, “Benchmarking DataStax Enterprise/Cassandra with HiBench,”

ArXiv Prepr. ArXiv14114044, 2014

Page 5: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Hardware & Software

Hardware - Fujitsu BX 620 S3 blade center

• 8 blade nodes each with:

• Around 246GB effective disk space per node

Software

• Operating system - Ubuntu Server 12.04 LTS 64 bit

• Java runtime environment - Oracle JRE 1.7.0.60-b19

9th IEEE International Conference On Big Data Science and Engineering 2015 5

CPU 2x Dual-core AMD Opteron 870 (2.0 GHz)

Main memory 16 GB DDR2 registered

Mass memory 2x Seagate ST3146854SS, 146 GB

Network adapter Broadcom NetExtreme BCM5704S,1 GBit/s transfer speed

Page 6: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Experimental Methodology

9th IEEE International Conference On Big Data Science and Engineering 2015 6

Setup

Phase 1

Setup

Phase 1

Workload Prepare Phase 2

Workload Prepare Phase 2

Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters

Hardware (8 Node Cluster) + Software

Hardware &

Software Setup

DataStax Enterprise (Cassandra)/Cloudera Hadoop

Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

Page 7: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Enterprise Big Data Platforms

DataStax Enterprise (DSE) 4.0.2

• Apache Cassandra 2.0.6 with Peer-to-Peer architecture

• Cassandra File System (CFS) offering compatible Hadoop File System (HDFS) interface

• Modified Apache MapReduce version 1.0.4

• OpsCenter tool for configuration and management

Cloudera Hadoop Distribution (CDH) 5.0.2

• Apache Hadoop 2.3.0 with Master/Slave architecture consisting of

– MapReduce (YARN)

– Hadoop File System (HDFS)

• Cloudera Manager for configuration and management

Both platforms offer

• Hadoop Ecosystem tools: Hive, Pig, Oozie, Sqoop, Spark and more

• Optimal default configuration

9th IEEE International Conference On Big Data Science and Engineering 2015 7

Page 8: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Experimental Methodology

9th IEEE International Conference On Big Data Science and Engineering 2015 8

Setup

Phase 1

Setup

Phase 1

Workload Prepare Phase 2

Workload Prepare Phase 2

Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters

Hardware (8 Node Cluster) + Software

Hardware &

Software Setup

DataStax Enterprise (Cassandra)/Cloudera Hadoop

Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

• Replication factor of 3 for both systems

Page 9: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Intel HiBench

• Benchmark suite for Hadoop (developed by Intel in 2010) (Huang et al. [2])

• 4 categories, 10 workloads & 3 types

• Metrics: Time (Sec) & Throughput (Bytes/Sec)

9th IEEE International Conference On Big Data Science and Engineering 2015 9

Category No Workload Tools Type

Micro Benchmarks

1 Sort MapReduce I/O Bound

2 WordCount MapReduce CPU Bound

3 TeraSort MapReduce Mixed

4 Enhanced TestDFSIO MapReduce I/O Bound

Web Search 5 Nutch Indexing Nutch, Lucene Mixed

6 Page Rank Pegasus Mixed

Machine Learning 7 Bayesian Classification Mahout Mixed

8 K-means Clustering Mahout Mixed

Analytical Query 9 Join Hive Mixed

10 Aggregation Hive Mixed

[2] Huang, S. et al.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis.

Data Engineering Workshops (ICDEW), 2010

Page 10: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Experimental Methodology

9th IEEE International Conference On Big Data Science and Engineering 2015 10

Setup

Phase 1

Setup

Phase 1

Workload Prepare Phase 2

Workload Prepare Phase 2

Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters

Hardware (8 Node Cluster) + Software

Hardware &

Software Setup

DataStax Enterprise (Cassandra)/Cloudera Hadoop

Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

• All workloads are executed for 3 data sizes: 240GB, 340GB and 440GB

• Leave approximately 25% of total storage for temporary use

Page 11: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Experimental Methodology

9th IEEE International Conference On Big Data Science and Engineering 2015 11

Setup

Phase 1

Setup

Phase 1

Workload Prepare Phase 2

Workload Prepare Phase 2

Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters

Hardware (8 Node Cluster) + Software

Hardware &

Software Setup

DataStax Enterprise (Cassandra)/Cloudera Hadoop

Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

Page 12: BDSE 2015 Evaluation of Big Data Platforms with HiBench

WordCount (CPU-bound)

• Configuration: 4 map/ 2 reduce tasks, 240GB, 340GB, 440GB input data sizes

• CDH achieves on average 17% (11 min.) to 18% (22 min.) faster execution times and

around 20% to 22% higher throughput than DSE

• Possible reasons:

– Improved YARN (MapReduce 2.0) performance compared to MapReduce 1.0.4

– Better integration between YARN and HDFS than between MapReduce and CFS

9th IEEE International Conference On Big Data Science and Engineering 2015 12

4068

5785

7471

3392 4727 6127 0

2000

4000

6000

8000

240 GB 340 GB 440 GB

Tim

e (S

ec)

Time (Lower is better) DSE CDH

Page 13: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Enhanced DFSIO (I/O-bound) - Read

• Configuration: 400MB file size, 240GB, 340GB, 440GB input data sizes

• CDH is between 14% (2 min.) and 32% (11 min.) faster in reading with 15% to 47% higher

read throughput compared to DSE

• Possible reason Different file system design

– CFS in DSE uses 64MB blocks which are further split in 2MB sub-blocks

– HDFS in CDH uses 128MB blocks

9th IEEE International Conference On Big Data Science and Engineering 2015 13

916

1406

2051

791 1086 1387

0

500

1000

1500

2000

2500

240 GB 340 GB 440 GB

Tim

e (S

ec)

Read Time (Lower is better) DSE CDH

Page 14: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Enhanced DFSIO (I/O-bound) - Write

• Configuration: 400MB file size, 240GB, 340GB, 440GB input data sizes

• DSE is between 81% (13 min.) and 54% (19 min.) faster in writing with 45% to 35%

higher write throughput than CDH

• Possible reason Different file system design

– Cassandra morphs all writes into sequential writes using an in-memory structure

– HDFS writes all of its 3 file replicas before a file is considered as written

9th IEEE International Conference On Big Data Science and Engineering 2015 14

974 1477 2110

1760

2490

3248

0

1000

2000

3000

4000

240 GB 340 GB 440 GB

Tim

e (S

ec)

Write Time (Lower is better) DSE CDH

Page 15: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Enhanced DFSIO (I/O-bound) – Read vs. Write

• Read/Write Δ (%) = ((Twrite * 100) / Tread) - 100

• DSE differences between reading and writing times are very small, starting with around

6% (240GB) and gradually decreasing to 3% (440GB)

• CDH reading times (reading throughput) are at least 2.2 times faster than the writing

times, with an increasing tendency

• Both tendencies are confirmed in related studies:

– In [3] for Cassandra

– In [4] and [5] for Hadoop

[3] Dede et al., “An Evaluation of Cassandra for Hadoop,” in Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on, 2013.

[4] N. Wakou, “Dell Apache Hadoop Performance Analysis,” (2013):http://en.community.dell.com/techcenter/extras/m/white_papers/20437989

[5] Islam et al., “A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters,” in Specifying Big Data Benchmarks, Springer

Berlin Heidelberg, 2014.

9th IEEE International Conference On Big Data Science and Engineering 2015 15

Data Size (GB) DSE Read/Write Δ (%) CDH Read/Write Δ (%)

240 6.37 122.59

340 5.10 129.35

440 2.89 134.23

Page 16: BDSE 2015 Evaluation of Big Data Platforms with HiBench

HiveBench – Join (mixed workload)

• HiveBench-Join executed with 240GB, 340GB, 440GB table data sizes

• DSE is around 6% (both in time and throughtput) faster than CDH for 240GB

• DSE is 4% slower than CDH for 340GB and 7% for 440GB respectively

• Possible reason:

– HiveBench-Join is moslty CPU-intensive workload that reads the entire tables (CDH is

better in CPU-intesive and read operations)

9th IEEE International Conference On Big Data Science and Engineering 2015 16

1486 2107 2646

1580 2027

2467

0

500

1000

1500

2000

2500

3000

240 GB 340 GB 440 GB

Tim

e (S

ec)

Time (Lower is better) DSE CDH

Page 17: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Lessons Learned

9th IEEE International Conference On Big Data Science and Engineering 2015 17

Based on our experimental results*:

* Valid for our hardware configuration and tested data sizes of 240GB, 340GB and 440GB

Workload Type DSE CDH CDH/DSE Δ (%)

CPU-intensivev (WordCount) - + -18%

Read-intensive (Enhanced DFSIO-read) - + -32%

Write-intensive (Enhanced DFSIO-write) + - +81%

Read to write throughput difference low high

Mixed workload (HiveBench-join) - + -7%

Page 18: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Ongoing Work

• Evaluating the availability and fault-tolerance of DataStax Enterprise/Cassandra.

• Analyzing the performance of MapReduce and Spark using BigBench.

• Comparing different file formats (rcfile, orc, parquet, avro, etc.) on Hive, Hive-on-Tez and

Impala using TPC-H and BigBench.

• Evaluating large-scale graph processing engines using appropriate graph benchmarks.

9th IEEE International Conference On Big Data Science and Engineering 2015 18

Page 19: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Acknowledgments

9th IEEE International Conference On Big Data Science and Engineering 2015 19

• Frankfurt Big Data Lab, Chair for Databases

and Information Systems (DBIS) at Goethe

University Frankfurt

• Institute for Information Systems (IISYS) at Hof

University of Applied Sciences

• Accenture Germany

Page 20: BDSE 2015 Evaluation of Big Data Platforms with HiBench

Contact

Todor Ivanov

[email protected]

Goethe University Frankfurt am Main, Germany

http://www.bigdata.uni-frankfurt.de/

9th IEEE International Conference On Big Data Science and Engineering 2015 20