Page 1
Performance Evaluation of Enterprise Big Data
Platforms with HiBench
Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle and
Roberto V. Zicari
Frankfurt Big Data Lab
Goethe University Frankfurt am Main, Germany
http://www.bigdata.uni-frankfurt.de/
9th IEEE International Conference On Big Data Science and Engineering 2015
August 20 – 22, Helsinki, Finland
Page 2
Agenda
Research Objectives
Experimental Methodology
Hardware & Software
Enterprise Big Data Platforms
HiBench
Evaluation
WordCount (CPU-bound)
Enhanced DFSIO (I/O-bound)
HiveBench (mixed)
Lessons Learned
Ongoing Work
9th IEEE International Conference On Big Data Science and Engineering 2015 2
Page 3
Research Objectives
• Question What type of Big Data applications (CPU-bound, I/O-bound and
mixed) are more suitable for each platform, depending on the data size?
• Our Approach Evaluate and compare the Big Data systems using a widely
used Big Data benchmark.
• Suitable Big Data benchmark
• Different than the Yahoo! Cloud Serving Benchmark (OLTP workload)
Intel HiBench benchmark suite
9th IEEE International Conference On Big Data Science and Engineering 2015 3
Page 4
Experimental Methodology
9th IEEE International Conference On Big Data Science and Engineering 2015 4
Setup
Phase 1
Setup
Phase 1
Workload Prepare Phase 2
Workload Prepare Phase 2
Workload
Execution
Phase 3
Workload
Execution
Phase 3
EvaluationPhase 4
EvaluationPhase 4
HiBenchWordCount
EnhancedDFSIO
HiveBench
Setup
Workload
Parameters
Hardware (8 Node Cluster) + Software
Hardware &
Software Setup
DataStax Enterprise (Cassandra)/Cloudera Hadoop
Distribution
Data
Generation
Workload
Data
Results
(Time
&
Throughput) Result
Analysis
3 Test Runs per
Wokload
Legend:
Next Step
Data Output/Input
Workload Parameters
[1] T. Ivanov, R. Niemann, S. Izberovic, M. Rosselli, K. Tolle, and R. V. Zicari, “Benchmarking DataStax Enterprise/Cassandra with HiBench,”
ArXiv Prepr. ArXiv14114044, 2014
Page 5
Hardware & Software
Hardware - Fujitsu BX 620 S3 blade center
• 8 blade nodes each with:
• Around 246GB effective disk space per node
Software
• Operating system - Ubuntu Server 12.04 LTS 64 bit
• Java runtime environment - Oracle JRE 1.7.0.60-b19
9th IEEE International Conference On Big Data Science and Engineering 2015 5
CPU 2x Dual-core AMD Opteron 870 (2.0 GHz)
Main memory 16 GB DDR2 registered
Mass memory 2x Seagate ST3146854SS, 146 GB
Network adapter Broadcom NetExtreme BCM5704S,1 GBit/s transfer speed
Page 6
Experimental Methodology
9th IEEE International Conference On Big Data Science and Engineering 2015 6
Setup
Phase 1
Setup
Phase 1
Workload Prepare Phase 2
Workload Prepare Phase 2
Workload
Execution
Phase 3
Workload
Execution
Phase 3
EvaluationPhase 4
EvaluationPhase 4
HiBenchWordCount
EnhancedDFSIO
HiveBench
Setup
Workload
Parameters
Hardware (8 Node Cluster) + Software
Hardware &
Software Setup
DataStax Enterprise (Cassandra)/Cloudera Hadoop
Distribution
Data
Generation
Workload
Data
Results
(Time
&
Throughput) Result
Analysis
3 Test Runs per
Wokload
Legend:
Next Step
Data Output/Input
Workload Parameters
Page 7
Enterprise Big Data Platforms
DataStax Enterprise (DSE) 4.0.2
• Apache Cassandra 2.0.6 with Peer-to-Peer architecture
• Cassandra File System (CFS) offering compatible Hadoop File System (HDFS) interface
• Modified Apache MapReduce version 1.0.4
• OpsCenter tool for configuration and management
Cloudera Hadoop Distribution (CDH) 5.0.2
• Apache Hadoop 2.3.0 with Master/Slave architecture consisting of
– MapReduce (YARN)
– Hadoop File System (HDFS)
• Cloudera Manager for configuration and management
Both platforms offer
• Hadoop Ecosystem tools: Hive, Pig, Oozie, Sqoop, Spark and more
• Optimal default configuration
9th IEEE International Conference On Big Data Science and Engineering 2015 7
Page 8
Experimental Methodology
9th IEEE International Conference On Big Data Science and Engineering 2015 8
Setup
Phase 1
Setup
Phase 1
Workload Prepare Phase 2
Workload Prepare Phase 2
Workload
Execution
Phase 3
Workload
Execution
Phase 3
EvaluationPhase 4
EvaluationPhase 4
HiBenchWordCount
EnhancedDFSIO
HiveBench
Setup
Workload
Parameters
Hardware (8 Node Cluster) + Software
Hardware &
Software Setup
DataStax Enterprise (Cassandra)/Cloudera Hadoop
Distribution
Data
Generation
Workload
Data
Results
(Time
&
Throughput) Result
Analysis
3 Test Runs per
Wokload
Legend:
Next Step
Data Output/Input
Workload Parameters
• Replication factor of 3 for both systems
Page 9
Intel HiBench
• Benchmark suite for Hadoop (developed by Intel in 2010) (Huang et al. [2])
• 4 categories, 10 workloads & 3 types
• Metrics: Time (Sec) & Throughput (Bytes/Sec)
9th IEEE International Conference On Big Data Science and Engineering 2015 9
Category No Workload Tools Type
Micro Benchmarks
1 Sort MapReduce I/O Bound
2 WordCount MapReduce CPU Bound
3 TeraSort MapReduce Mixed
4 Enhanced TestDFSIO MapReduce I/O Bound
Web Search 5 Nutch Indexing Nutch, Lucene Mixed
6 Page Rank Pegasus Mixed
Machine Learning 7 Bayesian Classification Mahout Mixed
8 K-means Clustering Mahout Mixed
Analytical Query 9 Join Hive Mixed
10 Aggregation Hive Mixed
[2] Huang, S. et al.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis.
Data Engineering Workshops (ICDEW), 2010
Page 10
Experimental Methodology
9th IEEE International Conference On Big Data Science and Engineering 2015 10
Setup
Phase 1
Setup
Phase 1
Workload Prepare Phase 2
Workload Prepare Phase 2
Workload
Execution
Phase 3
Workload
Execution
Phase 3
EvaluationPhase 4
EvaluationPhase 4
HiBenchWordCount
EnhancedDFSIO
HiveBench
Setup
Workload
Parameters
Hardware (8 Node Cluster) + Software
Hardware &
Software Setup
DataStax Enterprise (Cassandra)/Cloudera Hadoop
Distribution
Data
Generation
Workload
Data
Results
(Time
&
Throughput) Result
Analysis
3 Test Runs per
Wokload
Legend:
Next Step
Data Output/Input
Workload Parameters
• All workloads are executed for 3 data sizes: 240GB, 340GB and 440GB
• Leave approximately 25% of total storage for temporary use
Page 11
Experimental Methodology
9th IEEE International Conference On Big Data Science and Engineering 2015 11
Setup
Phase 1
Setup
Phase 1
Workload Prepare Phase 2
Workload Prepare Phase 2
Workload
Execution
Phase 3
Workload
Execution
Phase 3
EvaluationPhase 4
EvaluationPhase 4
HiBenchWordCount
EnhancedDFSIO
HiveBench
Setup
Workload
Parameters
Hardware (8 Node Cluster) + Software
Hardware &
Software Setup
DataStax Enterprise (Cassandra)/Cloudera Hadoop
Distribution
Data
Generation
Workload
Data
Results
(Time
&
Throughput) Result
Analysis
3 Test Runs per
Wokload
Legend:
Next Step
Data Output/Input
Workload Parameters
Page 12
WordCount (CPU-bound)
• Configuration: 4 map/ 2 reduce tasks, 240GB, 340GB, 440GB input data sizes
• CDH achieves on average 17% (11 min.) to 18% (22 min.) faster execution times and
around 20% to 22% higher throughput than DSE
• Possible reasons:
– Improved YARN (MapReduce 2.0) performance compared to MapReduce 1.0.4
– Better integration between YARN and HDFS than between MapReduce and CFS
9th IEEE International Conference On Big Data Science and Engineering 2015 12
4068
5785
7471
3392 4727 6127 0
2000
4000
6000
8000
240 GB 340 GB 440 GB
Tim
e (S
ec)
Time (Lower is better) DSE CDH
Page 13
Enhanced DFSIO (I/O-bound) - Read
• Configuration: 400MB file size, 240GB, 340GB, 440GB input data sizes
• CDH is between 14% (2 min.) and 32% (11 min.) faster in reading with 15% to 47% higher
read throughput compared to DSE
• Possible reason Different file system design
– CFS in DSE uses 64MB blocks which are further split in 2MB sub-blocks
– HDFS in CDH uses 128MB blocks
9th IEEE International Conference On Big Data Science and Engineering 2015 13
916
1406
2051
791 1086 1387
0
500
1000
1500
2000
2500
240 GB 340 GB 440 GB
Tim
e (S
ec)
Read Time (Lower is better) DSE CDH
Page 14
Enhanced DFSIO (I/O-bound) - Write
• Configuration: 400MB file size, 240GB, 340GB, 440GB input data sizes
• DSE is between 81% (13 min.) and 54% (19 min.) faster in writing with 45% to 35%
higher write throughput than CDH
• Possible reason Different file system design
– Cassandra morphs all writes into sequential writes using an in-memory structure
– HDFS writes all of its 3 file replicas before a file is considered as written
9th IEEE International Conference On Big Data Science and Engineering 2015 14
974 1477 2110
1760
2490
3248
0
1000
2000
3000
4000
240 GB 340 GB 440 GB
Tim
e (S
ec)
Write Time (Lower is better) DSE CDH
Page 15
Enhanced DFSIO (I/O-bound) – Read vs. Write
• Read/Write Δ (%) = ((Twrite * 100) / Tread) - 100
• DSE differences between reading and writing times are very small, starting with around
6% (240GB) and gradually decreasing to 3% (440GB)
• CDH reading times (reading throughput) are at least 2.2 times faster than the writing
times, with an increasing tendency
• Both tendencies are confirmed in related studies:
– In [3] for Cassandra
– In [4] and [5] for Hadoop
[3] Dede et al., “An Evaluation of Cassandra for Hadoop,” in Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on, 2013.
[4] N. Wakou, “Dell Apache Hadoop Performance Analysis,” (2013):http://en.community.dell.com/techcenter/extras/m/white_papers/20437989
[5] Islam et al., “A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters,” in Specifying Big Data Benchmarks, Springer
Berlin Heidelberg, 2014.
9th IEEE International Conference On Big Data Science and Engineering 2015 15
Data Size (GB) DSE Read/Write Δ (%) CDH Read/Write Δ (%)
240 6.37 122.59
340 5.10 129.35
440 2.89 134.23
Page 16
HiveBench – Join (mixed workload)
• HiveBench-Join executed with 240GB, 340GB, 440GB table data sizes
• DSE is around 6% (both in time and throughtput) faster than CDH for 240GB
• DSE is 4% slower than CDH for 340GB and 7% for 440GB respectively
• Possible reason:
– HiveBench-Join is moslty CPU-intensive workload that reads the entire tables (CDH is
better in CPU-intesive and read operations)
9th IEEE International Conference On Big Data Science and Engineering 2015 16
1486 2107 2646
1580 2027
2467
0
500
1000
1500
2000
2500
3000
240 GB 340 GB 440 GB
Tim
e (S
ec)
Time (Lower is better) DSE CDH
Page 17
Lessons Learned
9th IEEE International Conference On Big Data Science and Engineering 2015 17
Based on our experimental results*:
* Valid for our hardware configuration and tested data sizes of 240GB, 340GB and 440GB
Workload Type DSE CDH CDH/DSE Δ (%)
CPU-intensivev (WordCount) - + -18%
Read-intensive (Enhanced DFSIO-read) - + -32%
Write-intensive (Enhanced DFSIO-write) + - +81%
Read to write throughput difference low high
Mixed workload (HiveBench-join) - + -7%
Page 18
Ongoing Work
• Evaluating the availability and fault-tolerance of DataStax Enterprise/Cassandra.
• Analyzing the performance of MapReduce and Spark using BigBench.
• Comparing different file formats (rcfile, orc, parquet, avro, etc.) on Hive, Hive-on-Tez and
Impala using TPC-H and BigBench.
• Evaluating large-scale graph processing engines using appropriate graph benchmarks.
9th IEEE International Conference On Big Data Science and Engineering 2015 18
Page 19
Acknowledgments
9th IEEE International Conference On Big Data Science and Engineering 2015 19
• Frankfurt Big Data Lab, Chair for Databases
and Information Systems (DBIS) at Goethe
University Frankfurt
• Institute for Information Systems (IISYS) at Hof
University of Applied Sciences
• Accenture Germany
Page 20
Contact
Todor Ivanov
[email protected]
Goethe University Frankfurt am Main, Germany
http://www.bigdata.uni-frankfurt.de/
9th IEEE International Conference On Big Data Science and Engineering 2015 20