BDSE 2015 Evaluation of Big Data Platforms with HiBench

Performance Evaluation of Enterprise Big Data

Platforms with HiBench

Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle and

Roberto V. Zicari

Frankfurt Big Data Lab

Goethe University Frankfurt am Main, Germany

http://www.bigdata.uni-frankfurt.de/

9th IEEE International Conference On Big Data Science and Engineering 2015

August 20 – 22, Helsinki, Finland




Agenda

Research Objectives

Experimental Methodology

Hardware & Software

Enterprise Big Data Platforms

HiBench

Evaluation

WordCount (CPU-bound)

Enhanced DFSIO (I/O-bound)

HiveBench (mixed)

Lessons Learned

Ongoing Work

9th IEEE International Conference On Big Data Science and Engineering 2015 2

Research Objectives

• Question What type of Big Data applications (CPU-bound, I/O-bound and

mixed) are more suitable for each platform, depending on the data size?

• Our Approach Evaluate and compare the Big Data systems using a widely

used Big Data benchmark.

• Suitable Big Data benchmark

• Different than the Yahoo! Cloud Serving Benchmark (OLTP workload)

Intel HiBench benchmark suite




Setup

Phase 1

Setup

Phase 1

Workload Prepare Phase 2


Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters

Hardware (8 Node Cluster) + Software

Hardware &

Software Setup

DataStax Enterprise (Cassandra)/Cloudera Hadoop

Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

[1] T. Ivanov, R. Niemann, S. Izberovic, M. Rosselli, K. Tolle, and R. V. Zicari, “Benchmarking DataStax Enterprise/Cassandra with HiBench,”

ArXiv Prepr. ArXiv14114044, 2014

Hardware & Software

Hardware - Fujitsu BX 620 S3 blade center

• 8 blade nodes each with:

• Around 246GB effective disk space per node

Software

• Operating system - Ubuntu Server 12.04 LTS 64 bit

• Java runtime environment - Oracle JRE 1.7.0.60-b19


CPU 2x Dual-core AMD Opteron 870 (2.0 GHz)

Main memory 16 GB DDR2 registered

Mass memory 2x Seagate ST3146854SS, 146 GB

Network adapter Broadcom NetExtreme BCM5704S,1 GBit/s transfer speed



Setup

Phase 1

Setup

Phase 1



Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters


Hardware &

Software Setup


Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

Enterprise Big Data Platforms

DataStax Enterprise (DSE) 4.0.2

• Apache Cassandra 2.0.6 with Peer-to-Peer architecture

• Cassandra File System (CFS) offering compatible Hadoop File System (HDFS) interface

• Modified Apache MapReduce version 1.0.4

• OpsCenter tool for configuration and management

Cloudera Hadoop Distribution (CDH) 5.0.2

• Apache Hadoop 2.3.0 with Master/Slave architecture consisting of

– MapReduce (YARN)

– Hadoop File System (HDFS)

• Cloudera Manager for configuration and management

Both platforms offer

• Hadoop Ecosystem tools: Hive, Pig, Oozie, Sqoop, Spark and more

• Optimal default configuration




Setup

Phase 1

Setup

Phase 1



Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters


Hardware &

Software Setup


Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

• Replication factor of 3 for both systems

Intel HiBench

• Benchmark suite for Hadoop (developed by Intel in 2010) (Huang et al. [2])

• 4 categories, 10 workloads & 3 types

• Metrics: Time (Sec) & Throughput (Bytes/Sec)


Category No Workload Tools Type

Micro Benchmarks

1 Sort MapReduce I/O Bound

2 WordCount MapReduce CPU Bound

3 TeraSort MapReduce Mixed

4 Enhanced TestDFSIO MapReduce I/O Bound

Web Search 5 Nutch Indexing Nutch, Lucene Mixed

6 Page Rank Pegasus Mixed

Machine Learning 7 Bayesian Classification Mahout Mixed

8 K-means Clustering Mahout Mixed

Analytical Query 9 Join Hive Mixed

10 Aggregation Hive Mixed

[2] Huang, S. et al.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis.

Data Engineering Workshops (ICDEW), 2010



Setup

Phase 1

Setup

Phase 1



Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters


Hardware &

Software Setup


Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

• All workloads are executed for 3 data sizes: 240GB, 340GB and 440GB

• Leave approximately 25% of total storage for temporary use



Setup

Phase 1

Setup

Phase 1



Workload

Execution

Phase 3

Workload

Execution

Phase 3

EvaluationPhase 4

EvaluationPhase 4

HiBenchWordCount

EnhancedDFSIO

HiveBench

Setup

Workload

Parameters


Hardware &

Software Setup


Distribution

Data

Generation

Workload

Data

Results

(Time

&

Throughput) Result

Analysis

3 Test Runs per

Wokload

Legend:

Next Step

Data Output/Input

Workload Parameters

WordCount (CPU-bound)

• Configuration: 4 map/ 2 reduce tasks, 240GB, 340GB, 440GB input data sizes

• CDH achieves on average 17% (11 min.) to 18% (22 min.) faster execution times and

around 20% to 22% higher throughput than DSE

• Possible reasons:

– Improved YARN (MapReduce 2.0) performance compared to MapReduce 1.0.4

– Better integration between YARN and HDFS than between MapReduce and CFS


4068

5785

7471

3392 4727 6127 0

2000

4000

6000

8000

240 GB 340 GB 440 GB

Tim

e (S

ec)

Time (Lower is better) DSE CDH

Enhanced DFSIO (I/O-bound) - Read

• Configuration: 400MB file size, 240GB, 340GB, 440GB input data sizes

• CDH is between 14% (2 min.) and 32% (11 min.) faster in reading with 15% to 47% higher

read throughput compared to DSE

• Possible reason Different file system design

– CFS in DSE uses 64MB blocks which are further split in 2MB sub-blocks

– HDFS in CDH uses 128MB blocks


916

1406

2051

791 1086 1387

0

500

1000

1500

2000

2500

240 GB 340 GB 440 GB

Tim

e (S

ec)

Read Time (Lower is better) DSE CDH

Enhanced DFSIO (I/O-bound) - Write

• Configuration: 400MB file size, 240GB, 340GB, 440GB input data sizes

• DSE is between 81% (13 min.) and 54% (19 min.) faster in writing with 45% to 35%

higher write throughput than CDH

• Possible reason Different file system design

– Cassandra morphs all writes into sequential writes using an in-memory structure

– HDFS writes all of its 3 file replicas before a file is considered as written


974 1477 2110

1760

2490

3248

0

1000

2000

3000

4000

240 GB 340 GB 440 GB

Tim

e (S

ec)

Write Time (Lower is better) DSE CDH

Enhanced DFSIO (I/O-bound) – Read vs. Write

• Read/Write Δ (%) = ((Twrite * 100) / Tread) - 100

• DSE differences between reading and writing times are very small, starting with around

6% (240GB) and gradually decreasing to 3% (440GB)

• CDH reading times (reading throughput) are at least 2.2 times faster than the writing

times, with an increasing tendency

• Both tendencies are confirmed in related studies:

– In [3] for Cassandra

– In [4] and [5] for Hadoop

[3] Dede et al., “An Evaluation of Cassandra for Hadoop,” in Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on, 2013.

[4] N. Wakou, “Dell Apache Hadoop Performance Analysis,” (2013):http://en.community.dell.com/techcenter/extras/m/white_papers/20437989

[5] Islam et al., “A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters,” in Specifying Big Data Benchmarks, Springer

Berlin Heidelberg, 2014.


Data Size (GB) DSE Read/Write Δ (%) CDH Read/Write Δ (%)

240 6.37 122.59

340 5.10 129.35

440 2.89 134.23

HiveBench – Join (mixed workload)

• HiveBench-Join executed with 240GB, 340GB, 440GB table data sizes

• DSE is around 6% (both in time and throughtput) faster than CDH for 240GB

• DSE is 4% slower than CDH for 340GB and 7% for 440GB respectively

• Possible reason:

– HiveBench-Join is moslty CPU-intensive workload that reads the entire tables (CDH is

better in CPU-intesive and read operations)


1486 2107 2646

1580 2027

2467

0

500

1000

1500

2000

2500

3000

240 GB 340 GB 440 GB

Tim

e (S

ec)

Time (Lower is better) DSE CDH

Lessons Learned


Based on our experimental results*:

* Valid for our hardware configuration and tested data sizes of 240GB, 340GB and 440GB

Workload Type DSE CDH CDH/DSE Δ (%)

CPU-intensivev (WordCount) - + -18%

Read-intensive (Enhanced DFSIO-read) - + -32%

Write-intensive (Enhanced DFSIO-write) + - +81%

Read to write throughput difference low high

Mixed workload (HiveBench-join) - + -7%

Ongoing Work

• Evaluating the availability and fault-tolerance of DataStax Enterprise/Cassandra.

• Analyzing the performance of MapReduce and Spark using BigBench.

• Comparing different file formats (rcfile, orc, parquet, avro, etc.) on Hive, Hive-on-Tez and

Impala using TPC-H and BigBench.

• Evaluating large-scale graph processing engines using appropriate graph benchmarks.


Acknowledgments


• Frankfurt Big Data Lab, Chair for Databases

and Information Systems (DBIS) at Goethe

University Frankfurt

• Institute for Information Systems (IISYS) at Hof

University of Applied Sciences

• Accenture Germany

Contact

Todor Ivanov

[email protected]

Goethe University Frankfurt am Main, Germany