Performance Evaluation of Spark SQL using BigBench Todor Ivanov and Max-Georg Beer Frankfurt Big Data Lab Goethe University Frankfurt am Main, Germany http://www.bigdata.uni-frankfurt.de/ 6th Workshop on Big Data Benchmarking 2015 June 16 th – 17 th , Toronto, Canada
29
Embed
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Performance Evaluation of Spark SQL using BigBench
Todor Ivanov and Max-Georg Beer
Frankfurt Big Data LabGoethe University Frankfurt am Main, Germanyhttp://www.bigdata.uni-frankfurt.de/
6th Workshop on Big Data Benchmarking 2015June 16th – 17th, Toronto, Canada
• Towards BigBench on Spark– Our Experience with BigBench– Lessons Learned
• Data Scalability Experiments– Cluster Setup & Configuration– BigBench on MapReduce– BigBench on Spark SQL– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 2
Motivation
• „Towards A Complete BigBench Implementation” by Tilmann Rabl @WBDB 2014– end-to-end, application-level, analytical big data benchmark– technology agnostic– based on TPC-DS– consists of 30 queries
• Implementation for the Hadoop Ecosystem– https://github.com/intel-hadoop/Big-Bench
Java MapReduce with HiveQL 1, 2 2Python Streaming MR with HiveQL 3, 4, 8, 29, 30 5
Mahout (Java MR) with HiveQL 5, 20, 25, 26, 28 5OpenNLP (Java MR) with HiveQL 10, 18, 19, 27 4
Lessons Learned
Our BigBench on MapReduce experiments showed:
• The OpenNLP queries (Q19, Q10) scale best with the increase of the data size.• Q27 (OpenNLP) is not suitable for scalability comparison.• A subset of the Python Streaming (MR) queries (Q4, Q30, Q3) show the worst scaling
behavior.
Comparing Hive and Spark SQL we observed:
• A group of Spark SQL queries (Q7, Q16, Q21, Q22, Q23 and Q24) does not scale properly with the increase of the data size. Possible reason join optimization issues.
• For the stable HiveQL queries (Q6, Q9, Q11, Q12, Q13, Q14, Q15 and Q17) Spark SQL performs between 1.5x and 6.3x times faster than Hive.
6th Workshop on Big Data Benchmarking 2015 7
Our Experience with BigBench
• Validating the Spark SQL query results– Empty query results– Non-deterministic end results (OpenNLP and Mahout)– No reference results are available
• Q18 is memory bound with around 90% utilization and high CPU usage of 56%.
6th Workshop on Big Data Benchmarking 2015 18
Scale Factor: 1TB Q18 (OpenNLP)Input Data size/ Number of Tables: 71GB / 3 Tables
Average Runtime (minutes): 28 minutes
Avg. CPU Utilization %:
55.99 (User %); 2.04 (System %);0.31 (IOwait%)
Avg. Memory Utilization %: 90.22 %
020406080
100
0 50 96 144
190
236
284
330
376
424
470
516
564
610
656
704
750
796
844
890
936
984
1030
1076
1124
1170
1216
1264
1310
1356
1404
1450
1496
1544
1590
1636
1684C
PU U
tiliz
atio
in%
Time (sec)IOwait % User % System %
020406080
100
5 55 101
149
195
241
289
335
381
429
475
521
569
615
661
709
755
801
849
895
941
989
1035
1081
1129
1175
1221
1269
1315
1361
1409
1455
1501
1549
1595
1641
1689M
emor
y U
tiliz
atio
n %
Time (sec)
Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark– Our Experience with BigBench– Lessons Learned
• Data Scalability Experiments– Cluster Setup & Configuration– BigBench on MapReduce
– BigBench on Spark SQL– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 19
BigBench on Spark SQL – worst scalability
• Test the group of 14 pure HiveQL queries.• Tested Scale Factors: 100 GB, 300 GB, 600 GB and 1TB • Times normalized with respect to 100GB SF as baseline.
• Group A: Q24, Q21, Q16 and Q7 achieve the worst data scalability behavior.• Possible reason for Group A behavior is reported in SPARK-2211 (Join Optimization).
6th Workshop on Big Data Benchmarking 2015 20
0369
1215182124
Nor
mal
ized
Tim
e
Normalized BigBench + Spark SQL Times with respect to baseline 100GB SF
300GB 600GB1TB Linear 300GBLinear 600GB Linear 1TB
BigBench on Spark SQL – best scalability
• Test the group of 14 pure HiveQL queries.• Tested Scale Factors: 100 GB, 300 GB, 600 GB and 1TB • Times normalized with respect to 100GB SF as baseline.
• Group B: Q15, Q11,Q9 and Q14 achieve the best data scalability behavior.
6th Workshop on Big Data Benchmarking 2015 21
0369
1215182124
Nor
mal
ized
Tim
e
Normalized BigBench + Spark SQL Times with respect to baseline 100GB SF 300GB 600GB1TB Linear 300GBLinear 600GB Linear 1TB
Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark– Our Experience with BigBench– Lessons Learned
• Data Scalability Experiments– Cluster Setup & Configuration– BigBench on MapReduce– BigBench on Spark SQL
– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 22
Hive & Spark SQL Comparison (1)
• Calculate the Hive to Spark SQL ratio (%): ((HiveTime * 100) / SparkTime) - 100)
• Group 1: Q7, Q16, Q21, Q22, Q23 and Q24 drastically increase their Spark SQL execution time for the larger data sets.
• Complex Join issues described in SPARK-2211(https://issues.apache.org/jira/browse/SPARK-2211 ).
• Q7 is only 13% slower on Hive compared to Spark SQL.
• Spark SQL spends around 21% (IOwait) of the CPU time on waiting for outstanding disk I/Orequests in Q7 utilizes efficiently only around 17% of the CPU.