Page 1
Evaluating Hive and Spark SQL with BigBench
Technical Report No. 2015-2
January 11, 2016
Todor Ivanov and Max-Georg Beer
Frankfurt Big Data Lab
Chair for Databases and Information Systems
Institute for Informatics and Mathematics
Goethe University Frankfurt
Robert-Mayer-Str. 10,
60325, Bockenheim
Frankfurt am Main, Germany
www.bigdata.uni-frankfurt.de
Page 2
Copyright © 2015, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Page 3
Table of Contents 1. Introduction ........................................................................................................................... 1
2. Background ........................................................................................................................... 1
3. BigBench ............................................................................................................................... 4
4. BigBench on Spark ................................................................................................................ 6
4.1. Workarounds .................................................................................................................. 6
4.2. Porting Issues ................................................................................................................. 7
5. Experimental Setup ............................................................................................................. 11
5.1. Hardware ...................................................................................................................... 11
5.2. Software ....................................................................................................................... 11
5.3. Cluster Configuration ................................................................................................... 12
6. Experimental Results ........................................................................................................... 17
6.1. BigBench on MapReduce ............................................................................................ 17
6.2. BigBench on Spark SQL .............................................................................................. 21
6.3. Query Validation Reference ........................................................................................ 23
7. Resource Utilization Analysis ............................................................................................. 24
7.1. BigBench Query 4 (Python Streaming) ....................................................................... 25
7.2. BigBench Query 5 (Mahout) ....................................................................................... 27
7.3. BigBench Query 18 (OpenNLP) .................................................................................. 29
7.4. BigBench Query 27 (OpenNLP) .................................................................................. 31
7.5. BigBench Query 7 (HiveQL + Spark SQL) ................................................................. 33
7.6. BigBench Query 9 (HiveQL + Spark SQL) ................................................................. 36
7.7. BigBench Query 24 (HiveQL + Spark SQL) ............................................................... 39
8. Lessons Learned .................................................................................................................. 42
Acknowledgements ........................................................................................................................ 42
References ...................................................................................................................................... 43
Appendix ........................................................................................................................................ 46
Page 4
1
1. Introduction
The objective of this work was to utilize BigBench [1] as a Big Data benchmark and evaluate and
compare two processing engines: MapReduce [2] and Spark [3]. MapReduce is the established
engine for processing data on Hadoop. Spark is a popular alternative engine that promises faster
processing times than the established MapReduce engine. BigBench was chosen for this
comparison because it is the first end-to-end analytics Big Data benchmark and it is currently
under public review as TPCx-BB [4]. One of our goals was to evaluate the benchmark by
performing various scalability tests and validate that it is able to stress test the processing
engines. First, we analyzed the steps necessary to execute the available MapReduce
implementation of BigBench [1] on Spark. Then, all the 30 BigBench queries were executed on
MapReduce/Hive with different scale factors in order to see how the performance changes with
the increase of the data size. Next, the group of HiveQL queries were executed on Spark SQL and
compared with their respective Hive runtimes.
This report gives a detailed overview on how to setup an experimental Hadoop cluster and
execute BigBench on both Hive and Spark SQL. It provides the absolute times for all
experiments preformed for different scale factors as well as query results which can be used to
validate correct benchmark execution. Additionally, multiple issues and workarounds were
encountered and solved during our work. An evaluation of the resource utilization (CPU,
memory, disk and network usage) of a subset of representative BigBench queries is presented to
illustrate the behavior of the different query groups on both processing engines.
Last but not least it is important to mention that larger parts of this report are taken from the
master thesis of Max-Georg Beer, entitled “Evaluation of BigBench on Apache Spark Compared
to MapReduce” [5].
The rest of the report is structured as follows: Section 2 provides a brief description of the
technologies involved in our study. Brief summary of the BigBench benchmark is presented in
Section 3. Section 4 evaluates the steps needed to complete in order to execute BigBench on
Spark. An overview of the hardware and software setup used for the experiments is given in
Section 5. The performed experiments together with the evaluation of the results are presented in
Section 6. Section 7 depicts a comparison between the cluster resource utilization during the
execution of representative BigBench queries. Finally, Section 8 concludes with lessons learned.
2. Background
Big Data has emerged as a new term not only in IT, but also in numerous other industries such as
healthcare, manufacturing, transportation, retail and public sector administration [6][7] where it
quickly became relevant. There is still no single definition which adequately describes all Big
Data aspects [8], but the “V” characteristics (Volume, Variety, Velocity, Veracity and more) are
among the widely used one. Exactly these new Big Data characteristics challenge the capabilities
of the traditional data management and analytical systems [8][9]. These challenges also motivate
the researchers and industry to develop new types of systems such as Hadoop and NoSQL
databases [10].
Apache Hadoop [11] is a software framework for distributed storing and processing of large data
sets across computer clusters using the map and reduce programming model. The architecture
allows scaling up from a single server to thousands of machines. At the same time Hadoop
Page 5
2
delivers high-availability by detecting and handling failures at the application layer. The use of
data replication guarantees the data reliability and fast access. The core Hadoop components are
the Hadoop Distributed File System (HDFS) [12][13] and the MapReduce framework [2].
HDFS has a master/slave architecture with a NameNode as a master and multiple DataNodes as
slaves. The NameNode is responsible for storing and managing all file structures, metadata,
transactional operations and logs of the file system. The DataNodes store the actual data in the
form of files. Each file is split into blocks of a preconfigured size. Every block is copied and
stored on multiple DataNodes. The number of block copies depends on the Replication Factor.
MapReduce is a software framework that provides general programming interfaces for writing
applications that process vast amounts of data in parallel, using a distributed file system, running
on the cluster nodes. The MapReduce unit of work is called job and consists of input data and a
MapReduce program. Each job is divided into map and reduce tasks. The map task takes a split,
which is a part of the input data, and processes it according to the user-defined map function from
the MapReduce program. The reduce task gathers the output data of the map tasks and merges
them according to the user-defined reduce function. The number of reducers is specified by the
user and does not depend on input splits or number of map tasks. The parallel application
execution is achieved by running map tasks on each node to process the local data and then send
the result to a reduce task which produces the final output.
Hadoop implements the MapReduce (version 1) model by using two types of processes –
JobTracker and TaskTracker. The JobTracker coordinates all jobs in Hadoop and schedules tasks
to the TaskTrackers on every cluster node. The TaskTracker runs tasks assigned by the
JobTracker.
Multiple other applications were developed on top of the Hadoop core components, also known
as the Hadoop ecosystem, to make it more ease to use and applicable to variety of industries.
Example for such applications are Hive [14], Pig [15], Mahout [16], HBase [17], Sqoop [18] and
many more.
YARN (Yet Another Resource Negotiator) [19] is the next generation Apache Hadoop platform,
which introduces new architecture by decoupling the programming model from the resource
management infrastructure and delegating many scheduling-related functions to per-application
components. This new design [19] offers some improvements over the older platform:
Scalability
Multi-tenancy
Serviceability
Locality awareness
High Cluster Utilization
Reliability/Availability
Secure and auditable operation
Support for programming model diversity
Flexible Resource Model
Backward compatibility
The major difference is that the functionality of the JobTracker is split into two new daemons –
ResourceManager (RM) and ApplicationMaster (AM). The RM is a global service, managing all
the resources and jobs in the platform. It consists of a scheduler and the ApplicationManager.
The scheduler is responsible for allocation of resources to the various running applications based
on their resource requirements. The ApplicationManager is responsible for accepting jobs-
Page 6
3
submissions and negotiating resources from the scheduler. Additionally, there is a NodeManager
(NM) agent that runs on each worker. It is responsible for allocating and monitoring of node
resources (CPU, memory, disk and network) usage and reports back to the ResourceManager
(scheduler). An instance of the ApplicationMaster runs per-application on each node and
negotiates the appropriate resource container from the scheduler. It is important to mention that
the new MapReduce 2.0 maintains API compatibility with the older stable versions of Hadoop
and therefore, MapReduce jobs can run unchanged.
Hive [10][17] is a data warehouse infrastructure built on top of Hadoop. Hive was originally
developed by Facebook and supports the analysis of large data sets stored on HDFS by queries in
a SQL-like declarative query language. This SQL-like language is called HiveQL and is based on
the SQL language, but does not strictly follow the SQL-92 standard. For example, the additional
feature Use Defined Functions (UDF) of HiveQL allows to filter data by custom Java or Python
scripts. Plugging in custom scripts makes the implementation of in HiveQL natively unsupported
statements possible.
When a HiveQL statement is submitted through the Hive command-line interface, the compiler of
Hive translates the statement into jobs that are submitted to the MapReduce engine [14]. This
allows users to analyze large data sets without actually having to apply the MapReduce
programming model themselves. The MapReduce programming model is very low-level and
requires developers to write custom programs, whereas Hive can be used by analysts with SQL
skills.
Before data stored on HDFS can be analyzed by Hive, Hive's Metastore has to be created. The
Metastore is the central repository for Hive's metadata and stores all information about the
available databases, the available tables, the available table columns, table columns' types etc.
The Metastore is stored on a traditional RDBMS like MySQL.
When a table is created with HiveQL, the user can define the format of the file that is stored on
HDFS and which contains the actual data of the table [21]. Besides the default text file format,
more compressed formats like ORC and Parquet are available. The applied file format affects the
performance of Hive.
Apache Spark [22] is a processing engine that promises to perform much faster than Hadoop's
MapReduce engine. This performance advantage of Spark is achieved in part by its heavy
reliance on in-memory computing. In contrast to that, MapReduce is strongly based on disk.
Spark was originally created in 2009 by the AMPLab at UC Berkeley and was developed to run
independent of Hadoop. Instead, Spark is a generic framework for a wide variety of distributed
storage systems including Hadoop.
The Spark project consists of several components [22]. The Spark Core is the general execution
engine that provides APIs for programming languages like Java, Scala and Python and enables an
easy development of Spark programs. All the other Spark components are built on top of the
Spark Core. These components are Spark SQL for analyzing structured data, Spark Streaming for
analyzing streaming data, the machine learning framework MLlib and the graph processing
framework GraphX.
Spark SQL [23] integrates relational processing into Spark and allows users to intermix relational
and procedural processing techniques. Besides the general support for structured data processing,
Spark SQL supports SQL-like statements. These statements can be executed through a command-
line interface similar to Hive's command-line interface. Moreover, Spark SQL is pretty
compatible to run unmodified HiveQL queries and to use the Hive Metastore [24]. In summary
Page 7
4
Spark SQL relates to Spark in the same way as Hive relates to MapReduce: an interface to
execute SQL-like statements on the respective processing engine.
The general programming model of Spark Core and therefore the fundamentals for all the other
Spark components can be summarized as follows [3]. To write a program running on Spark, the
developer has to write the so called driver program that implements the program flow and
launches various operations in parallel.
Spark provides the two main abstractions Resilient Distributed Datasets (RDD) and parallel
operations. A RDD is a read-only, partitioned collection of elements.
The separate partitions of the RDD are distributed across a set of machines and can be stored in a
persistent storage as well as in-memory. Persisting and caching the RDD in memory allows very
efficient operations.
Besides allowing Spark's driver program to run its operations on the RDD in parallel on various
machines, a RDD can automatically recover from machine failures.
Cloudera Hadoop Distribution (CDH) [25] is a 100% Apache-licensed open source Hadoop
distribution offered by Cloudera. It includes the core Apache Hadoop elements - Hadoop
Distributed File System (HDFS) and MapReduce (YARN), as well as several additional projects
from the Apache Hadoop Ecosystem. All components are tightly integrated to enable ease of use
and managed by a central application - Cloudera Manager [26].
3. BigBench
BigBench [26][27] is a proposal for an end-to-end analytics benchmark suite for Big Data
systems. To fit the needs of a Big Data benchmark and to allow the performance comparison of
different Big Data systems, BigBench focuses on the three Big Data characteristics volume,
variety and velocity. It provides a specification describing a data model and workloads of a non-
system-specific end-to-end analytics benchmark. Additionally, a data generator is available to
create data for the data model.
Since the BigBench specification is general and technology agnostic, it should be implemented
specifically for each Big Data system. The initial implementation of BigBench was made for the
Teradata Aster platform [29]. It was done in the Aster’s SQL-MR syntax served - additionally to
a description in the English language - as an initial specification of BigBench's workloads.
Meanwhile, BigBench is implemented for Hadoop [1], using the MapReduce engine and other
components like Hive, Mahout and OpenNLP from the Hadoop Ecosystem.
To summarize, BigBench covers the data model, depicted in Figure 1, the data generator and the
specification of the workloads. Figure 1 shows how BigBench implements the variety property of
Big Data. This is done by categorizing the data model into three parts: structured, semi-
structured and unstructured data. A fictional product retailer is used as the underlying business
model [27]. The business model and a large portion of the data model's structured part is derived
from the TPC-DS benchmark [30]. The structured part was extended with a table for the prices of
the retailer's competitors, the semi-structured part was added represented by a table with website
logs and the unstructured part was added by a table showing product reviews.
Page 8
5
Figure 1: BigBench Schema [31]
The data generator is based on an extension of PDGF [32] and allows generating data in
accordance with BigBench's data model, including the structured, semi-structured and
unstructured parts. The data generator can scale the amount of data based on a scale factor. Due
to parallel processing of the data generator, it runs efficiently for large scale factors. In this way,
the Big Data volume property is implemented in BigBench. Additionally, the velocity property of
Big Data is implemented by a periodic refresh scheme that constantly adds new data to the
different tables of the data model.
The workloads are a major part of BigBench. The workloads are represented by 30 queries,
which are defined as questions about the BigBench's underlying business model. Ten of these
queries are taken from the TPC-DS benchmark's workload. The other 20 queries were defined
based on the five major areas of Big Data analytics identified in the McKinsey report on Big Data
use cases and opportunities [6]. These areas are marketing, merchandising, operations, supply
chain and new business models. However, besides these business areas it was made sure that the
following three technical dimensions are also included in the queries:
a) The three different data types (structured, semi-structured and unstructured type)
b) The two paradigms of processing (declarative and procedural MR)
c) Different algorithms of analytic processing (classifications, clustering, regression etc.)
A list of the BigBench queries grouped by the technologies their implementation is based on can
be found in Table 1.
Query Types Queries Number of Queries
Pure HiveQL Q6, Q7, Q9, Q11, Q12, Q13, Q14, Q15,
Q16, Q17, Q21, Q22, Q23, Q24 14
Java MapReduce with HiveQL Q1, Q2 2
Python Streaming MR with HiveQL Q3, Q4, Q8, Q29, Q30 5
Mahout (Java MR) with HiveQL Q5, Q20, Q25, Q26, Q28 5
OpenNLP (Java MR) with HiveQL Q10, Q18, Q19, Q27 4
Table 1: BigBench Queries
Page 9
6
The combination of factoring in relevant business areas as well as technical dimensions within
the scope of the Big Data characteristics makes BigBench to a Big Data analytics benchmark
suite. Besides the objective of becoming an industry standard as TPCx-BB [4], BigBench will be
extended to incorporate additional use cases in the future [28].
4. BigBench on Spark
A major focus of this work is to evaluate and run BigBench on Spark. Because Spark SQL
supports HiveQL, the queries of the type “Pure HiveQL” were successfully ported to Spark and
executed. However, to provide a comprehensive evaluation the additional BigBench queries will
also be considered in this chapter.
The validation references described in the subsection Query Validation Reference significantly
supported the evaluation. With their help, the verification of successful query executions was
quite easy. The first section of this chapter presents workarounds that had to be applied at the
beginning of our research. At that time, Spark SQL was at an earlier stage and did not support
some of the syntactical expressions. During the project, many issues were solved by developers
of the Spark project and the described workarounds became obsolete. Below, the final outcomes
of the evaluation of running BigBench on Spark are described and all necessary porting tasks are
listed.
4.1. Workarounds
Since the start of our research, further development on Spark solved several issues. However,
before these improvements on Spark were available, workarounds for those issues had to be
developed. In the following part, two major problems are examined to give an example of our
work and an idea of the current state of Spark's component Spark SQL. The issues are described
as follows: First, the actual issue is described. Then, the temporarily implemented workaround is
explained. Finally, a reference to the reported ticket in the official issue tracker of the Spark
project is given.
Variables substitution
The Hive variable substitution mechanism allows using variables within the queries. The so
called hiveconf variables can be set by passing them with the hiveconf parameter to the client
program or by setting them directly with the set command in the query. Furthermore, values of
ordinary environment variables can be accessed within queries. Depending on whether it is a
hiveconf variable or an environment variable, the variable can be retrieved by using the syntax
${hiveconf:variable_name} or ${env:variable_name} [33].
The available BigBench implementation for MapReduce uses this mechanism intensively.
Initially, Spark SQL did not support this mechanism. Because this mechanism was used
intensively as well as to avoid big changes on the BigBench implementation, the variable
substitution concept was retained. The approach of the workaround was to retrieve and substitute
the variables before the queries were passed to the Spark SQL client program. By doing so, no
variables were within the queries and the actual variable substitution mechanism was obsolete.
The procedure implemented in the script-based solution, which was executed before the query
was passed to the Spark SQL client program, can be described as follows:
Page 10
7
1) Searching for the variable syntax ${hiveconf: variable_name} and ${env:variable_name}
in the query.
2) Retrieving these variables to obtain their values.
3) Replacing each variable with its received value.
4) Passing the query with replaced variables to the Spark SQL client program.
This workaround became obsolete with the resolution of the issue [SPARK-5202] HiveContext
doesn't support the Variables Substitution1 in the Spark project.
User-Defined Functions (UDFs) with multiple fields as output
When an UDF output has multiple fields, it was not possible to assign an alias for each individual
field. The following example shows the desired, but unsupported expression.
SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, review_sentence,
sentiment, sentiment_word) FROM product_reviews;
Because this expression was syntactically not accepted by Spark SQL, the subsequent
workaround was used to solve this issue.
SELECT `result._c0` AS pr_item_sk, `result._c1` AS review_sentence, `result._c2` AS sentiment,
`result._c3` AS sentiment_word
FROM (
SELECT extract_sentiment(pr.pr_item_sk,pr.pr_review_content) AS return
FROM product_reviews pr
) result;
This workaround became obsolete with the resolution of the issue [SPARK-5237] UDTF don't
work with multi-alias of multi-columns as output on Spark SQL2 in the Spark project.
4.2. Porting Issues
This section documents the final outcomes of running the BigBench queries on Spark. Table 2
gives an overview of all the different porting tasks that have been identified together with the
affected queries attached to each task.
Issue Affected Queries
External scripts in Spark SQL Q1, Q2, Q3, Q4, Q8, Q10, Q18, Q19, Q27, Q29, Q30
Different expression of null values Q3, Q8, Q29, Q30
Scripts implemented for MapReduce Q1, Q2
External libraries Q5, Q10, Q18, Q19, Q20, Q25, Q26, Q27, Q28
Query specific settings Q3, Q4, Q7, Q8, Q16, Q21, Q22, Q23, Q24, Q29, Q30
Type definition for return values Q1, Q2, Q3, Q4, Q8
Table 2: Porting tasks and queries that are affected by them
Subsequently, all different porting tasks are explained in more detail.
1 https://issues.apache.org/jira/browse/SPARK-5202 2 https://issues.apache.org/jira/browse/SPARK-5237
Page 11
8
External scripts in Spark SQL
Calling external scripts within queries executed with Spark SQL requires passing of the
respective script file paths to the Spark SQL client program. This ensures that these scripts are
distributed to all of the Spark executors. This is relevant for all queries containing user-defined
functions (UDFs) or custom reduce scripts. Depending on whether these are represented as Java
programs (JAR files) or Python scripts (PY files), the parameter to be used differs.
To make Python scripts available on the executors, the files parameter should be used. This
places the scripts in the working directory of each executor. Affected by this issue are the
BigBench queries Q1, Q2, Q10, Q18, Q19 and Q27. The usage of the files parameter is shown by
the following generalized command. The $SPARK_ROOT variable represents the path to the
root of the local Spark repository.
$SPARK_ROOT/bin/spark-sql --files $PY_FILE_PATH
To make Java programs available, the jars parameter should be used. Besides distributing the
files to the Spark executors, this ensures that the programs will be included to the Java Classpath
on each executor. Affected by this issue are the BigBench queries Q3, Q4, Q8, Q29 and Q30.
Using the jars parameter is shown by the following generalized command.
$SPARK_ROOT/bin/spark-sql --jars $JAR_FILE_PATH
Different expression of null values
It became apparent that in Hive and Spark SQL, specific calculations lead to different results.
Examples for such different calculation results can be found in Table 3.
Query Hive Result Spark SQL
Result
SELECT CAST(1 as double) / CAST(0 as double) FROM table; NULL Infinity
SELECT CAST(-1 as double) / CAST(0 as double) FROM table; NULL -Infinity
SELECT CAST(0 as double) / CAST(0 as double) FROM table; NULL NaN
Table 3: Hive and Spark SQL differences
Furthermore, Hive and Spark SQL show different expression of null values in the context of
external scripts. This different expression has impact on the row counts of several BigBench
query result tables. Conditions that check if the value of a row field is equal/unequal to null, lead
to different results. Null values are expressed in Hive as \N and in Spark SQL as null.
Subsequently, a generalized Python code example illustrates the required adjustments to ensure
correct query execution when using Spark SQL.
When executing with Hive, the following condition is valid to check if a row field is unequal to
null.
if rowField != '\N' :
# do something
When executing with Spark SQL, the condition must be adjusted as follows.
Page 12
9
if rowField != 'null' :
# do something
Affected by this issue are external scripts of the BigBench queries Q3, Q8, Q29 and Q30.
Scripts implemented for MapReduce
External scripts that are specifically implemented for the MapReduce framework are not usable
when running BigBench on Spark. Those scripts have to be rewritten to run with the Spark
framework. This task requires understanding the respective MapReduce code and transforming it
to code compatible with the Spark framework. Performing this task requires certain knowledge in
the mentioned technologies. The affected BigBench queries are Q1 and Q2.
External libraries
The implementation of BigBench for MapReduce utilizes a small number of external libraries. It
uses Apache OpenNLP for processing natural language text and Apache Mahout for performing
machine learning tasks. These libraries, which are implemented to run on MapReduce, have to be
replaced. In case of Apache Mahout, this means waiting for the release that runs on Spark or
choosing a different machine learning library that is already running on Spark like MLlib [34].
This issue affects all queries utilizing the functionality of libraries such as Apache OpenNLP
(Java MR) and Mahout (Java MR) (see Table 1).
Query specific settings
Contrary to Hive, Spark SQL does not dynamically determine some of the settings during query
execution. The need for manually defining settings for specific queries and scale factors became
obvious in the case of queries with exhaustive join operations and queries with streaming
functionality. The higher the scale factor the more relevant were those settings in terms of query
runtime.
Open tickets in the Spark issue tracker like [SPARK-2211] Join Optimization3 and [SPARK-
5791] show poor performance when multiple table do join operation4 document the missing join
optimization functionality in Spark, which causes the need of tweaking settings specifically for
individual queries. In the official Spark documentation [35] the unsupported functionality of
dynamically determining the number of partitions is described. It became apparent that setting
this value properly was especially relevant for queries with streaming functionality. Ryza [36]
gives a formula that roughly estimate this value. However, despite utilizing the formula, it is not a
simple task to determine this setting.
Due to the complexity and the fact that the configuration of such specific settings has to be
individually processed for each query and scale factor, it does not seem to be a practical
approach. With further development, Spark will probably improve its abilities of dynamic
settings determination and query optimization.
Affected by this issue concerning determination of query specific settings are the BigBench
queries Q7, Q16, Q21, Q22, Q23, Q24 with exhaustive join operations and the BigBench queries
Q3, Q4, Q8, Q29, Q30 with streaming functionality.
3 https://issues.apache.org/jira/browse/SPARK-2211 4 https://issues.apache.org/jira/browse/SPARK-5791
Page 13
10
Type definition for return values
HiveQL supports an operation to integrate custom reduce scripts in the query data stream.
Records output by these scripts have a certain number of fields. By default these fields are of the
type string. However, it is possible to cast each field to a specified data type. Typecasting the
fields of reduce script outputs is used in several BigBench queries. In case of Spark SQL this type
casting has not worked properly and caused wrong query execution. Removing the type cast
definition can solve this issue. However, Hive allows typecasting the return values of functions.
Affected by this issue are the BigBench queries Q1, Q2, Q3, Q4 and Q8. This is shown in the
following example:
SELECT result_field_one, result_field_two
FROM (
FROM (
SELECT
wcs_user_sk AS user,
wcs_click_date_sk AS lastviewed_date,
FROM source_table
) my_return_table
REDUCE
my_return_table.user,
my_return_table.lastviewed_date,
USING 'python reduce_script.py'
AS (result_field_one BIGINT, result_field_two BIGINT));
When executing on Spark SQL, typecasting return values should be prevented.
SELECT result_field_one, result_field_two
FROM (
FROM (
SELECT
wcs_user_sk AS user,
wcs_click_date_sk AS lastviewed_date,
FROM source_table
) my_return_table
REDUCE
my_return_table.user,
my_return_table.lastviewed_date,
USING 'python reduce_script.py'
AS (result_field_one, result_field_two));
Page 14
11
5. Experimental Setup
This section presents the hardware and software setup of the cluster as well as the exact
configuration of the Hadoop and BigBench components as used in our experiments.
5.1. Hardware
The experiments were performed on a cluster consisting of 4 nodes connected directly through
1GBit Netgear switch, as shown on Figure 2. All 4 nodes are Dell PowerEdge T420 servers. The
master node is equipped with 2x Intel Xeon E5-2420 (1.9GHz) CPUs each with 6 cores, 32GB of
RAM and 1TB (SATA, 3.5 in, 7.2K RPM, 64MB Cache) hard drive. The worker nodes are
equipped with 1x Intel Xeon E5-2420 (2.20GHz) CPU with 6 cores, 32GB of RAM and 4x 1TB
(SATA, 3.5 in, 7.2K RPM, 64MB Cache) hard drives. More detailed specification of the node
servers is provided in the Appendix (Table 19 and Table 20).
Master
Node
Worker
Node 1
Worker
Node 2
Worker
Node 3
1Gbit Switch
Figure 2: Cluster Setup
Setup
Description Summary
Total Nodes: 4 x Dell
PowerEdge T420
Total Processors/
Cores/Threads :
5 CPUs/
30 Cores/
60 Threads
Total Memory: 128 GB
Total Number of
Disks:
13 x 1TB,SATA,
3.5 in, 7.2K RPM,
64MB Cache
Total Storage
Capacity: 13 TB
Network: 1 GBit Ethernet
Table 4: Summary of Total System Resources
Table 4 summarizes the total cluster resources that are used in the calculation of the benchmark
ratios in the next sections.
5.2. Software
This section describes the software setup of the cluster. The exact software versions that were
used are listed in Table 5. Ubuntu Server LTS was installed on all 4 nodes, allocating the entire
first disk. The number of open files per user was changed from the default value of 1024 to 65000
as suggested by the TPCx-HS benchmark and Cloudera guidelines [37]. Additionally, the OS
swappiness option was turned permanently off (vm.swappiness = 0). The remaining three disks,
on all worker nodes, were formatted as ext4 partitions and permanently mounted with options
noatime and nodiratime. Then the partitions were configured to be used by HDFS through the
Cloudera Manager. Each 1TB disk provides in total 916.8GB of effective HDFS space, which
means that all three workers (3 x 916.8GB = 8251.2GB = 8.0578TB) have in total around 8TB of
effective HDFS space.
Page 15
12
Software Version
Ubuntu Server 64 Bit 14.04.1 LTS, Trusty Tahr,
Linux 3.13.0-32-generic
Java (TM) SE Runtime Environment 1.6.0_31-b04,
1.7.0_72-b14
Java HotSpot (TM) 64-Bit Server VM 20.6-b01, mixed mode
24.72-b04, mixed mode
OpenJDK Runtime Environment 7u71-2.5.3-0ubuntu0.14.04.1
OpenJDK 64-Bit Server VM 24.65-b04, mixed mode
Cloudera Hadoop Distribution 5.2.0-1.cdh5.2.0.p0.36
BigBench [38]
Spark 1.4.0-SNAPSHOT (March 27th
2015)
Table 5: Software Stack of the System under Test
Cloudera CDH 5.2, with default configurations, was used for all experiments. Table 6
summarizes the software services running on each node. Due to the resource limitation (only 3
worker nodes) of our experimental setup, the cluster was configured to work with replication
factor of 2. This means that our cluster can store at most 4TB of data on HDFS.
Server Disk Drive Software Services
Master Node Disk 1/ sda1
Operating System, Root, Swap, Cloudera Manager Services,
Name Node, SecondaryName Node, Hive Metastore, Hive
Server2, Oozie Server, Spark History Server, Sqoop 2 Server,
YARN Job History Server, Resource Manager, Zookeeper
Server
Worker
Nodes 1-3
Disk 1/ sda1 Operating System, Root, Swap,
Data Node, YARN Node Manager
Disk 2/ sdb1 Data Node
Disk 3/ sdc1 Data Node
Disk 4/ sdd1 Data Node
Table 6: Software Services per Node
5.3. Cluster Configuration
Besides making modifications on the BigBench implementation as described previously,
configuration parameters for the different components of the cluster have to be properly set so
that BigBench queries run stable (also with higher scale factors). Determining these configuration
parameters is not connected to the particular case of running the BigBench benchmark. Instead,
this is part of the general complexity of Big Data systems and is essential to their proper
operation. As a basic principle when setting the configuration parameters, we tried to follow the
rule that these should not differ from their default values unless adjusting is needed to ensure
correct cluster operation. This principle avoids tuning of special cases to guarantee meaningful
Page 16
13
benchmarking results. However, utilizing all the available cluster resources and running
BigBench with higher scale factors demonstrated the need for adjusting some of the parameters.
Furthermore, some configuration parameters of Spark were not set by default and had to be
defined accordingly. The process of determining the configuration parameters can be described as
follows and was executed for each individual BigBench query:
1) Identifying errors and abnormal runtime in BigBench query execution.
2) Figuring out which configuration parameters cause the problem.
3) Trying to find problem-solving values for configuration parameters.
4) Validating the configuration parameter values by re-executing the BigBench query:
Parameter values are determined successfully when errors are fixed and abnormal runtime is
solved.
Components of the cluster that were actually affected by adjusted configuration parameters are
YARN, Spark, MapReduce and Hive. It should be noted that changing the configuration
parameters of YARN has an impact on Spark as well as MapReduce because both processing
engines are dependent on the resource manager YARN. Hereafter, the changed configuration
parameters of the particular cluster components are documented and explained.
YARN
To adjust the configuration of the resource manager YARN in order to fit the experimental
cluster and to ensure efficient resource utilization, two configuration parameters were adjusted
initially. The amount of memory that can be allocated for YARN ResourceContainers per node
(yarn.nodemanager.resource.memory-mb = 28672) and the maximum allocation for every
YARN ResourceContainer request were set to 28 GB (yarn.scheduler.maximum-allocation-mb
= 28672). Later, following the recommendations published by Ryza [36], the amount of memory
that can be allocated for YARN ResourceContainers was changed to 31 GB per node
(yarn.nodemanager.resource.memory-mb = 31744). As described in Hortonworks' manual [39],
the maximum allocation for every YARN ResourceContainer request was set to be exactly the
same as the amount of memory that can be allocated for YARN ResourceContainers. In short,
this parameter defines the largest ResourceContainers size YARN will allow. It was also set to 31
GB (yarn.scheduler.maximum-allocation-mb = 31744). Following the recommendations
published by Ryza [36], the number of CPU cores that can be allocated for YARN
ResourceContainers was changed to 11 per node (yarn.nodemanager.resource.cpu-vcores = 11).
The final configuration gives YARN plenty of resources, but still leaves 1 GB of memory and 1
CPU core to the operating system. All of the above YARN configuration adjustments were made
in the respective yarn-site.xml configuration file.
Spark
Since the Spark version shipped with CDH 5.2.0 was not used, the Spark configuration that
comes with CDH was deactivated. Many configuration parameters can be set by passing them to
the Spark client program. Besides passing --master yarn to run YARN in client mode, the
configuration parameters --num-executors, --executor-cores and --executor-memory should be
passed with proper values. Initially, finding proper values for the above mentioned configuration
parameters was done by performing spot-check tests. The different configuration parameter
Page 17
14
values of the performed tests and their runtime for two randomly chosen BigBench queries can be
found in Table 7. The test results prompted us to set the configuration parameters to the values
used in configuration 4 (--num-executors 12, --executor-cores 2, --executor-memory 8G).
# num-executors executor-memory executor-cores Time (min) Time (min)
Q7 Q24
1 3 26G 12 2.98 4.15
2 3 26G 10 3.02 4.12
3 6 16G 4 3.05 4.05
4 12 8G 2 2.73 3.55
Table 7: Runtime for different Spark configurations
However, the recommendations published by Ryza [36], give a more methodical guideline
regarding the Spark configuration parameters.
The sample cluster in the guide configured 3 executors on each DataNode except the one
operating the ApplicationMaster, which has only 2 executors. Due to different hardware
resources, there are maximum 3 executors on every DataNode. Because of the configuration
parameter values --executor-cores and --executor-memory, on every DataNode there will be
available resources for the ApplicationMaster. Consequently, this results in total of 9 executors (-
-num-executors 9).
Every DataNode in the experimental cluster has 12 virtual CPU cores. Since one core is left for
the operating system and Hadoop daemons, there are 11 virtual cores available for the executors.
Dividing the number of cores by the 3 executors per node results in 3 cores per executor (--
executor-cores 3). Therefore, 9 virtual cores per node are used for executors, 1 core is left for the
operating system and 2 spare cores are available. These two cores are the ones available for the
ApplicationMaster.
The amount of memory per executor can be determined by the following calculation:
𝑎𝑝𝑝𝑟𝑜𝑥_𝑒𝑚 =yarn. nodemanager. resource. memory − mb
num − executors=
31 744
3= 10 581
The variable approx_em stores the amount of memory which is theoretically available for each
executor. However, the Java Virtual Machine (JVM) overhead has to be considered and included
into the calculation. This can be done by subtracting the value of the property
spark.yarn.executor.memoryOverhead from the calculated approx_em value. If the property
spark.yarn.executor.memoryOverhead is not explicitly set by the user, its default value is
calculated by max (384, 0.07 * executor-memory). Listed below is the calculation done in order
to determine the memory per executor:
executor − memory =
= approxem − spark. yarn. executor. memoryOverhead =
= approxemem − max(384, 0.07 ∗ approxem) =
= 10581 − max(384, 0.07 ∗ 10581) =
= 9840
The resulting integral value 9840 MB is adjusted downward to 9 GB (--executor-memory = 9G).
In addition to the above configurations, which have to be passed as parameter when calling the
Page 18
15
client program, the default serializer used for object serialization was also changed
(spark.serializer = org.apache.spark.serializer.KryoSerializer). The faster Kryo serializer was
chosen over the default serializer as recommended by various sources [36], [35]. The serializer
option was adjusted in the respective spark-defaults.conf configuration file.
MapReduce
Specifically for the BigBench queries, which include Java MapReduce programs (Q1 and Q2),
configuration parameters had to be adjusted to ensure accurate execution. Execution errors were
caused by not allowing enough memory for the map and reduce tasks. Also the allowed Java heap
size of the map and reduce tasks [40] had to be increased. To find proper values for these
parameters, values were raised incrementally until errors were eliminated. This resulted in the
following adjusted parameters: mapreduce.map.java.opts.max.heap = 2GB,
mapreduce.reduce.java.opts.max.heap = 2GB, mapreduce.map.memory.mb = 3GB,
mapreduce.reduce.memory.mb = 3GB. These settings were changed in the respective mapred-
site.xml configuration file.
Hive
When executing BigBench's query Q9 with the default configuration, Hive encountered an out of
memory error. Initially, this issue was solved by deactivating MapJoins for this particular query.
The MapJoin feature allows loading a table in memory, so that a very fast table scan can be
performed [41]. As a consequence, performing a MapJoin requires more memory resources. In
our case this caused out of memory errors, which could be resolve by simply deactivating this
feature. Deactivating was done by just setting hive.auto.convert.join = false in the file
engines/hive/queries/q09/q09.sql of the BigBench repository.
Even though deactivating MapJoins solves the problem, it entails a significant performance
decline. A better solution is the increase of the heap size of the local Hadoop JVM to prevent the
out of memory error. In our case the heap size was increased to 2 GB. This was done by adding
the parameters -Xms2147483648 and -Xmx2147483648 to the environment variable
HADOOP_CLIENT_OPTS in the responsible hive-env.sh file.
Configuration validation
During the progress of determining proper parameter values, multiple validations were
performed. Especially after applying the guidelines published by Ryza [36] and after choosing
the better solution for the MapJoin issue described in section 4.2, the values were validated
against the one previously used. It should be noted that the previous configuration can be also
seen as a viable configuration. However, the following validation results should verify Ryzas’
guidelines [36] and demonstrate the performance difference between the two configurations.
Table 8 lists the different parameters for the default, initial and final configurations as used in
our cluster configuration. Figure 3 illustrates the effect on queries’ runtime when changing the
initial cluster configuration to the final cluster configuration.
Page 19
16
Component Parameter Default Configuration Initial Configuration Final Configuration
YARN
yarn.nodemanager.resource.memory-mb 8GB 28GB 31GB
yarn.scheduler.maximum-allocation-mb 8GB 28GB 31GB
yarn.nodemanager.resource.cpu-vcores 8 8 11
Spark
master local yarn yarn
num-executors 2 12 9
executor-cores 1 2 3
executor-memory 1GB 8GB 9GB
spark.serializer org.apache.spark.
serializer.JavaSerializer
org.apache.spark.
serializer.JavaSerializer org.apache.spark.
serializer.KryoSerializer
MapReduce
mapreduce.map.java.opts.max.heap 788MB 2GB 2GB
mapreduce.reduce.java.opts.max.heap 788MB 2GB 2GB
mapreduce.map.memory.mb 1GB 3GB 3GB
mapreduce.reduce.memory.mb 1GB 3GB 3GB
Hive hive.auto.convert.join (Q9 only) true false true
Client Java Heap Size 256MB 256MB 2GB
Table 8: Initial and final configuration
Considering the differences in the runtimes of the individual queries depicted in Figure 3, no big
difference can be seen when running them on MapReduce except for query Q9. The reason for
this was that the maximum client Java heap size was raised to 2GB. However, it seems that no
other query except query Q9 was running into that limit, so this change did not have any impact
on the runtimes. As mentioned in the above Hive section, not turning off MapJoins for query Q9,
but raising the maximum client Java heap size instead, significantly improved its runtime. In case
of running the queries with Apache Spark, the runtime of 8 queries became faster whereas 4
queries became slower.
Figure 3: Differences in runtime for different cluster configurations
In summary it can be said that the initial configuration that was determined through testing can
be considered a decent configuration because it showed slightly slower runtimes compared to the
final configuration. Therefore, the final configuration following the best practices was chosen for
the real benchmarking experiments. Investigating the performance of different configurations in
advance allowed us to validate the final configuration. This was sufficient for our benchmark
purposes since our goal was not to find the optimal cluster configuration.
Page 20
17
6. Experimental Results
This section presents the query execution results. Experiments were performed with all the 30
BigBench queries on MapReduce/Hive and the group of 14 pure HiveQL queries on Spark SQL
for four different scale factors (100GB, 300GB, 600GB and 1TB).
6.1. BigBench on MapReduce
Table 9 summarizes the absolute runtimes of the 30 BigBench queries on MapReduce/Hive for
100GB scale factor. There are three columns depicting the times for each run in minutes, a
column with the average execution time from the three runs and two columns with the standard
deviation in minutes and in %. The cells with yellow mark the queries with standard deviation
higher than 2%.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard
Deviation (min)
Standard
Deviation %
Q1 3.75 3.77 3.73 3.75 0.02 0.44
Q2 8.40 8.27 8.03 8.23 0.19 2.25
Q3 10.20 10.22 9.55 9.99 0.38 3.81
Q4 72.58 72.98 68.55 71.37 2.45 3.44
Q5 27.85 28.18 27.07 27.70 0.57 2.07
Q6 6.43 6.37 6.27 6.36 0.08 1.32
Q7 9.18 9.10 8.92 9.07 0.14 1.50
Q8 8.57 8.60 8.60 8.59 0.02 0.22
Q9 3.12 3.08 3.18 3.13 0.05 1.63
Q10 15.58 15.50 15.25 15.44 0.17 1.12
Q11 2.90 2.87 2.88 2.88 0.02 0.58
Q12 7.18 6.97 6.97 7.04 0.13 1.78
Q13 8.43 8.28 8.43 8.38 0.09 1.03
Q14 3.12 3.12 3.27 3.17 0.09 2.73
Q15 2.03 2.05 2.03 2.04 0.01 0.47
Q16 5.88 5.90 5.57 5.78 0.19 3.25
Q17 7.55 7.53 7.72 7.60 0.10 1.33
Q18 8.47 8.57 8.57 8.53 0.06 0.68
Q19 6.53 6.53 6.60 6.56 0.04 0.59
Q20 8.50 8.62 8.02 8.38 0.32 3.80
Q21 4.58 4.53 4.63 4.58 0.05 1.09
Q22 16.53 16.92 16.48 16.64 0.24 1.42
Q23 18.18 18.05 18.37 18.20 0.16 0.87
Q24 4.82 4.80 4.77 4.79 0.03 0.53
Q25 6.25 6.22 6.22 6.23 0.02 0.31
Q26 5.32 5.18 5.08 5.19 0.12 2.25
Q27 0.93 0.87 0.93 0.91 0.04 4.22
Q28 18.30 18.38 18.40 18.36 0.05 0.29
Q29 5.27 5.28 4.97 5.17 0.18 3.45
Q30 19.68 19.77 18.98 19.48 0.43 2.21
Table 9: MapReduce Executions for all BigBench Queries with SF 100 (100GB data)
Page 21
18
Similarly, Table 10 illustrates the three runtimes for the BigBench queries on MapReduce/Hive
with 300GB scale factor. Again reported are the average time from three runs and the standard
deviation in minutes and in %. For this scale factor all of the queries show standard deviation
within 2%, which is an indicator for stable performance.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard
Deviation (min)
Standard
Deviation %
Q1 5.57 5.53 5.47 5.52 0.05 0.92
Q2 21.15 21.03 21.03 21.07 0.07 0.32
Q3 26.28 26.37 26.30 26.32 0.04 0.17
Q4 221.53 221.83 220.58 221.32 0.65 0.29
Q5 76.68 76.52 76.48 76.56 0.11 0.14
Q6 10.75 10.60 10.73 10.69 0.08 0.77
Q7 17.02 16.88 16.87 16.92 0.08 0.49
Q8 17.78 17.62 17.82 17.74 0.11 0.60
Q9 6.60 6.52 6.57 6.56 0.04 0.64
Q10 19.70 19.82 19.48 19.67 0.17 0.86
Q11 4.60 4.62 4.60 4.61 0.01 0.21
Q12 11.68 11.60 11.52 11.60 0.08 0.72
Q13 12.97 13.08 12.95 13.00 0.07 0.56
Q14 5.47 5.53 5.45 5.48 0.04 0.80
Q15 3.02 2.97 3.05 3.01 0.04 1.39
Q16 14.72 14.90 14.88 14.83 0.10 0.68
Q17 10.95 10.85 10.93 10.91 0.05 0.49
Q18 11.12 11.05 10.88 11.02 0.12 1.09
Q19 7.20 7.22 7.25 7.22 0.03 0.35
Q20 20.22 20.42 20.23 20.29 0.11 0.55
Q21 6.90 6.85 6.92 6.89 0.03 0.50
Q22 19.35 19.85 19.08 19.43 0.39 2.00
Q23 20.12 20.92 20.48 20.51 0.40 1.95
Q24 7.00 7.05 7.00 7.02 0.03 0.41
Q25 11.18 11.20 11.23 11.21 0.03 0.23
Q26 8.58 8.57 8.55 8.57 0.02 0.19
Q27 0.63 0.63 0.62 0.63 0.01 1.53
Q28 21.27 21.25 21.22 21.24 0.03 0.12
Q29 11.68 11.72 11.80 11.73 0.06 0.51
Q30 57.73 57.60 57.72 57.68 0.07 0.13
Table 10: MapReduce Executions for all BigBench Queries with SF 300 (300GB data)
Page 22
19
Table 11 depicts the three runtimes for the BigBench queries on MapReduce/Hive with 600GB
scale factor. Reported are the average runtime from the three runs and the standard deviation in
minutes and in %. For this scale factor all of the queries show standard deviation within 2%,
except for Q22, which has standard deviation of around 5.6%.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard
Deviation (min)
Standard
Deviation %
Q1 8.10 8.10 8.13 8.11 0.02 0.24
Q2 40.35 40.02 39.97 40.11 0.21 0.52
Q3 53.60 53.52 53.23 53.45 0.19 0.36
Q4 510.25 502.00 493.67 501.97 8.29 1.65
Q5 155.18 155.52 156.35 155.68 0.60 0.39
Q6 16.65 16.78 16.75 16.73 0.07 0.41
Q7 29.43 29.47 29.63 29.51 0.11 0.36
Q8 32.45 32.47 32.47 32.46 0.01 0.03
Q9 11.45 11.58 11.47 11.50 0.07 0.63
Q10 24.07 24.53 24.28 24.29 0.23 0.96
Q11 7.47 7.48 7.42 7.46 0.03 0.47
Q12 18.83 18.62 18.55 18.67 0.15 0.79
Q13 20.22 20.20 20.27 20.23 0.03 0.17
Q14 9.00 8.98 8.98 8.99 0.01 0.11
Q15 4.45 4.48 4.47 4.47 0.02 0.37
Q16 28.83 29.27 29.28 29.13 0.26 0.88
Q17 14.60 14.63 14.57 14.60 0.03 0.23
Q18 14.48 14.32 14.53 14.44 0.11 0.79
Q19 7.55 7.67 7.52 7.58 0.08 1.04
Q20 39.27 39.30 39.40 39.32 0.07 0.18
Q21 10.22 10.25 10.20 10.22 0.03 0.25
Q22 18.72 19.80 20.93 19.82 1.11 5.59
Q23 23.10 23.08 23.47 23.22 0.22 0.93
Q24 10.27 10.30 10.33 10.30 0.03 0.32
Q25 19.88 20.02 20.07 19.99 0.09 0.47
Q26 15.05 15.00 15.20 15.08 0.10 0.69
Q27 0.98 0.98 0.97 0.98 0.01 0.98
Q28 24.77 24.73 24.82 24.77 0.04 0.17
Q29 22.78 22.73 22.82 22.78 0.04 0.18
Q30 119.38 120.27 119.93 119.86 0.45 0.37
Table 11: MapReduce Executions for all BigBench Queries with SF 600 (600GB data)
Page 23
20
Table 9 summarizes the absolute runtimes of the 30 BigBench queries on MapReduce/Hive for
1000GB/1TB scale factor. There are three columns depicting the times for each run in minutes, a
column with the average execution time from the three runs and two columns with the standard
deviation in minutes and in %. Similar to the smaller scale factors, only Q22 has a slightly higher
than 2% standard deviation and is marked in yellow.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard Deviation
(min) Standard Deviation %
Q1 10.48 10.45 10.63 10.52 0.10 0.93
Q2 68.12 66.47 66.90 67.16 0.86 1.27
Q3 89.30 91.48 90.87 90.55 1.13 1.24
Q4 927.67 918.05 940.33 928.68 11.18 1.20
Q5 272.53 268.67 264.27 268.49 4.14 1.54
Q6 25.28 25.40 25.67 25.42 0.20 0.77
Q7 46.40 46.47 46.33 46.33 0.07 0.14
Q8 53.30 53.78 53.93 53.67 0.33 0.62
Q9 17.62 17.87 17.68 17.72 0.13 0.73
Q10 22.92 22.62 22.67 22.73 0.16 0.71
Q11 11.20 11.23 11.23 11.24 0.02 0.17
Q12 29.93 29.88 30.05 29.86 0.09 0.29
Q13 30.30 30.30 30.07 30.18 0.13 0.45
Q14 13.88 13.83 13.87 13.84 0.03 0.18
Q15 6.35 6.38 6.38 6.37 0.02 0.30
Q16 48.77 48.63 48.87 48.85 0.12 0.24
Q17 18.53 18.63 18.62 18.57 0.05 0.29
Q18 27.60 27.75 27.45 27.60 0.15 0.54
Q19 8.18 8.15 8.13 8.16 0.03 0.31
Q20 64.83 64.92 64.77 64.84 0.08 0.12
Q21 14.90 14.93 14.98 14.92 0.04 0.28
Q22 29.78 31.03 30.67 29.84 0.64 2.15
Q23 25.05 25.23 25.12 25.16 0.09 0.37
Q24 14.75 14.77 14.83 14.75 0.04 0.30
Q25 31.65 31.65 31.55 31.62 0.06 0.18
Q26 22.92 22.85 23.07 22.94 0.11 0.48
Q27 0.70 0.68 0.68 0.69 0.01 1.40
Q28 28.87 28.87 29.05 28.93 0.11 0.37
Q29 37.05 37.37 37.35 37.21 0.18 0.48
Q30 199.30 203.10 200.50 200.97 1.94 0.97
Table 12: MapReduce Executions for all BigBench Queries with SF 1000 (1TB data)
Page 24
21
6.2. BigBench on Spark SQL
This part presents the group of 14 pure HiveQL BigBench queries executed on Spark SQL with
different scale factors.
Table 13 summarizes the absolute runtimes of the 14 queries run on Spark SQL for 100GB scale
factor. Reported are the absolute times from the three runs, the average runtime in minutes and
the standard deviation in minutes and in %. The yellow cells indicate the queries with standard
deviation higher or equal to 2% and possibly unstable behavior.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard Deviation
(min)
Standard Deviation
%
Q6 2.53 2.60 2.50 2.54 0.05 2.00
Q7 2.53 2.53 2.55 2.54 0.01 0.38
Q9 1.25 1.25 1.22 1.24 0.02 1.55
Q11 1.17 1.15 1.15 1.16 0.01 0.83
Q12 1.95 1.98 1.95 1.96 0.02 0.98
Q13 2.43 2.42 2.43 2.43 0.01 0.40
Q14 1.25 1.23 1.25 1.24 0.01 0.77
Q15 1.40 1.40 1.40 1.40 0.00 0.00
Q16 3.40 3.38 3.43 3.41 0.03 0.75
Q17 1.55 1.55 1.57 1.56 0.01 0.62
Q21 2.70 2.68 2.65 2.68 0.03 0.95
Q22 31.75 45.50 32.73 36.66 7.67 20.92
Q23 16.08 17.45 16.52 16.68 0.70 4.19
Q24 3.32 3.33 3.33 3.33 0.01 0.29
Table 13: Spark SQL Executions for the group of 14 HiveQL BigBench Queries with SF 100 (100GB data)
Analogous Table 14 presents the execution times with 300GB scale factor. The reported columns
are the same and the yellow cells indicate standard deviation higher than 2%.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard Deviation
(min)
Standard Deviation
%
Q6 3.42 3.50 3.63 3.52 0.11 3.11
Q7 6.03 6.17 5.93 6.04 0.12 1.94
Q9 1.70 1.73 1.68 1.71 0.03 1.49
Q11 1.37 1.38 1.38 1.38 0.01 0.70
Q12 3.05 3.05 3.08 3.06 0.02 0.63
Q13 3.58 3.60 3.60 3.59 0.01 0.27
Q14 1.55 1.58 1.55 1.56 0.02 1.23
Q15 1.58 1.58 1.60 1.59 0.01 0.61
Q16 7.85 8.00 7.80 7.88 0.10 1.32
Q17 2.13 2.22 2.22 2.19 0.05 2.20
Q21 10.12 11.13 10.67 10.64 0.51 4.78
Q22 54.90 62.10 65.07 60.69 5.23 8.61
Q23 25.57 26.60 28.88 27.02 1.70 6.28
Q24 15.22 15.23 15.35 15.27 0.07 0.48
Table 14: Spark SQL Executions for the group of 14 HiveQL BigBench Queries with SF 300 (300GB data)
Page 25
22
Table 15 shows the absolute Spark SQL runtimes with 600GB scale factor. The higher standard
deviations are marked with yellow.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard Deviation
(min)
Standard Deviation
%
Q6 4.80 4.82 4.88 4.83 0.04 0.91
Q7 24.67 21.07 18.67 21.47 3.02 14.07
Q9 2.32 2.30 2.30 2.31 0.01 0.42
Q11 1.67 1.70 1.68 1.68 0.02 0.99
Q12 4.92 4.92 4.93 4.92 0.01 0.20
Q13 5.57 5.55 5.60 5.57 0.03 0.46
Q14 2.10 2.12 2.08 2.10 0.02 0.79
Q15 1.93 1.92 1.93 1.93 0.01 0.50
Q16 23.78 23.40 22.78 23.32 0.50 2.16
Q17 2.90 2.90 2.92 2.91 0.01 0.33
Q21 28.32 27.38 25.83 27.18 1.25 4.62
Q22 96.00 78.55 92.22 88.92 9.18 10.32
Q23 57.77 53.78 44.78 52.11 6.65 12.76
Q24 41.62 46.08 38.87 42.19 3.64 8.63
Table 15: Spark SQL Executions for the group of 14 HiveQL BigBench Queries with SF 600 (600GB data)
Finally, Table 16 summerizes the query times for the largest 1000GB scale factor. Most of the
queries show standard deviation higher than 2% which is marked with yellow.
Query Run1 (min) Run2 (min) Run3 (min) Average Time
(min)
Standard Deviation
(min)
Standard Deviation
%
Q6 6.68 6.72 6.75 6.70 0.03 0.50
Q7 39.68 42.73 42.67 41.07 1.74 4.24
Q9 2.78 2.90 2.78 2.82 0.07 2.42
Q11 2.08 2.07 2.08 2.07 0.01 0.46
Q12 7.55 7.62 7.52 7.56 0.05 0.67
Q13 8.03 7.95 7.97 7.98 0.04 0.55
Q14 2.95 2.82 2.87 2.83 0.07 2.38
Q15 2.37 2.35 2.35 2.36 0.01 0.41
Q16 42.73 45.63 42.83 43.65 1.65 3.77
Q17 3.55 3.47 3.52 3.55 0.04 1.18
Q21 45.30 51.73 49.37 48.08 3.25 6.77
Q22 110.27 114.78 138.92 122.68 15.40 12.56
Q23 69.40 74.78 71.57 69.01 2.71 3.92
Q24 83.32 77.20 76.02 77.05 3.92 5.09
Table 16: Spark SQL Executions for the group of 14 HiveQL BigBench Queries with SF 1000 (1TB data)
Page 26
23
6.3. Query Validation Reference
This section provides the tables with exact values that were used in the process of porting and
evaluation of the BigBench queries to Spark.
Table 17 shows the row counts for all database tables of BigBench's data model for the different
scale factors 100GB, 300GB, 600GB and 1000GB.
Table Name Row Count
Sample Row SF 100 SF 300 SF 600 SF 1000
customer 990000 1714731 2424996 3130656 0 AAAAAAAAAAAAAAAA 1824793 3203 25558 14690 14690 Ms. Marisa
Harrington N 17 4 1988 UNITED ARAB EMIRATES PQByuX1WeD19
[email protected] fcKlEcS7
customer_ address
495000 857366 1212498 1565328 0 AAAAAAAAAAAAAAAA 561 Cedar 12th Road I3jhw5ICEB White City
Montmorency County MI 64453 United States -5.0 condo
customer_
demographics 1920800 1920800 1920800 1920800 0 F U Primary 6000 Good 0 5 0
date_dim 109573 109573 109573 109573 0 AAAAAAAAAAAAAAAA 1900-01-01 0 0 0 1900 1 1 1 1 1900 0 0 Monday 1900Q1
Y N N 2448812 2458802 2472542 2420941 N N N N N
household_demographics
7200 7200 7200 7200 0 3 1001-5000 0 0
income_band 20 20 20 20 0 1 10000
inventory 883693800 1852833814 2848155453 3824032470 38220 53687 15 65
item 178200 308652 436499 563518
0 AAAAAAAAAAAAAAAA 2000-01-14 quickly even dinos beneath the frays must
have to boost boldly careful bold escapades: stealthily even forges over the dependencies
integrate always past the quiet sly decoys-- notornis solve fluffily; furious dinos doubt
with the realms: always dogged dinos among the slow pains 28.68 69.06 3898712
50RQ6LQauF0XabhPLF4tsAFIvliiMoGQv 1 Fan Shop 9 Sports & Outdoors 995
0UMxurGVvkHOSQk5 small 77DdZq5tEbYRQBkvV1 dodger Oz Unknown 18
7l8m4P6R12CMVibnv4mUkg4ybmpv0RIMoMHKWhKU9
item_marketp
rices 891000 1543257 2182495 2817590 0 60665 5VitFqR2CxJ 95.41 7604 92131
product_
reviews 1034796 2007732 3143124 4450482
187125 2186-01-31 114344 5 93256 6994338712124158976 8520181449317677056
When tried these Jobst 15-20 mmHg pantyhose in my waist at the waist cincher is not for
you. tried tucking the net piece part of the dryer covered with wrinkles
promotion 3707 4520 5033 5411 0 AAAAAAAAAAAAAAAA 61336 94523 104776 445.17 1 able Y N N N N N N N
always bold warthogs despite the dugouts will play closely b Unknown N
reason 433 527 587 631 0 48h2I9vhvJ slyly thin dugouts on the ironically enticing real
ship_mode 20 20 20 20 0 FW7qE09M ZjZ84JKe 8CNtE5D IpPSqBCvGzN4m6G 75jAyujyTumy2CFBWAQD
store 120 208 294 379
0 AAAAAAAAAAAAAAAA 2000-08-08 71238 able 217 8891512 8AM-12AM Joshua
Watson 6 Unknown realms sublate quickly outside the epitaphs; evenly silent patterns
boost! thin patterns within the daring thin sheaves nod daringly instead of the fluffy final
soma Randy King 1 Unknown 1 Unknown 916 1st Boulevard WD Post Oak Hoke
County NC 47562 United States -5.0 0.11
store_returns 6108428 19740384 40807766 69407907 66190 80578 57566 962182 611011 2556 419286 83 152 3518518 19 700.34 42.02
742.36 79.14 103.0 413.2 267.04 20.1 187.04
store_sales 107843438 348352146 720479689 1224712024 37337 84551 145227 190483 240122 2393 453476 7 2772 3562467 14 60.5 100.43
73.31 379.68 1026.34 847.0 1406.02 37.97 266.85 759.49 797.46 -87.51
time_dim 86400 86400 86400 86400 0 AAAAAAAAAAAAAAAA 0 0 0 0 AM third night
warehouse 19 23 25 26 0 AAAAAAAAAAAAAAAA thin theodolites poach stealth 467315 738 Main Smith
Cir. X3 Bethel Caldwell County KY 52585 United States -6.0
web_clickstreams
1092877307 3530048749 7300782597 12409888280 37340 3106 NULL 168922 133 NULL
web_page 741 904 1007 1082 0 AAAAAAAAAAAAAAAA 2000-07-31 103908 107243 0 579660
http://www.A7Svq4s2L2eLJfz44PDVxeF0BuRRFhsKwBEnKjyzlcM3VebenChLAi7D
YwXi7v6Kkca3dBvMV5Y.com feedback 2339 11 4 1
web_returns 6115748 19737891 40824500 69406183 55179 42872 35361 571349 1096022 2609 225532 571349 1096022 2609 225532 478
161 1779133 13 1826.37 127.85 1954.22 97.94 286.05 1205.4 546.45 74.52 541.75
web_sales 107854751 348360527 720453868 1224631543 37791 77933 37869 25520 860026 1810864 3208 260615 860026 1810864 3208 260615
235 5 12 10 2130 7174583 16 11.34 33.23 21.93 180.8 350.88 181.44 531.68 4.95
185.97 132.92 164.91 169.86 297.83 302.78 -16.53
web_site 30 30 30 30
0 AAAAAAAAAAAAAAAA 1999-08-16 site_0 12694 77464 Unknown Robert
Stewart 1 even ruthless multipliers should have to maintain sometimes even ruthless bold
notornis doubt: closely quiet hockey players behind the fluffily daring decoys try to
maintain never along the thinly ironic t James Feliciano 3 bar 625 1st Lane EF85 Bolton
Elbert County GA 68675 United States -5.0 0.04
Table 17: Number of Rows in all BigBench Tables for the tested Scale Factors
Page 27
24
Table 18 shows the row counts for BigBench's query result tables for the different scale factors
100 GB, 300 GB, 600 GB and 1000 GB.
Query # Row Count
Sample Row SF 100 SF 300 SF 600 SF 1000
Q1 0 0 0 0
Q2 1288 1837 1812 1669 1415 41 1
Q3 131 426 887 1415 20 5809 1
Q4 73926146 233959972 468803001 795252823 0_1199 1
Q5 logRegResult.txt AUC = 0.50 confusion: [[0.0, 0.0], [1.0, 3129856.0]] entropy: [[-0.7, -0.7], [-0.7, -0.7]]
Q6 100 100 100 100 AAAAAAAAAAAAAAAA Marisa Harrington N UNITED ARAB EMIRATES PQByuX1WeD19
0.7015194245020148 0.6517334472176035
Q7 52 52 52 52 WY 63269
Q8 1 1 1 1 5.1591069883547675E11 5.382825071218451E10
Q9 1 1 1 1 10914483
Q10 2879890 5582973 8743044 12396422 479336 If this is some kind of works and she's really pretty but just couldn't get that excited about
something dont make it reggae lyrics). POS kind
Q11 1 1 1 1 0.000677608
Q12 1697681 10196175 30744360 68614374 37134 37142 9 2950380
Q13 100 100 100 100 AAAAAAAAAAAAAAAA Marisa Harrington 0.4387617877663627 0.8869539352739836
Q14 1 1 1 1 0.998896356
Q15 7 4 6 3 1 -3.60713321147841 216619.96230580617
Q16 1431932 3697528 6404121 9137536 AK AAAAAAAAAAAAAAMD -171.92000000000002 0.0
Q17 1 1 1 1 2.446298259939976E9 4.1096035613800263E9 59.526380669148935
Q18 1501027 2805571 4361606 9280457 ese 2044-02-07 We never really get to know what is not? NEG never
Q19 15 2 91 270 551717 Hooked myPlayStation 80GBup to mySamsung LN40A650 40-Inch 1080p 120Hz LCD HDTV
with RED Touch of Colorand the screen flickered really bad while playingCall of Duty: World at War.
NEG bad
Q20 cluster.txt VL-1426457{n=599019 c=[1946576.977, 12.584, 5.737, 3.194, 6.563] r=[591462.113, 3.598, 2.609,
1.739, 2.011]}
Q21 0 0 0 1 AAAAAAAAAAABDCIK slow quick frays should promise enticingly through the quick asymptotes;
furious theodolites beside the asymptotes kindle slowly foxes: furious somas through the slyly idle
dolphin AAAAAAAAAAAAAADU eing 27 4 82
Q22 11342 23149 0 47058 careful wa AAAAAAAAAAAAAKLL 2545 2276
Q23_1 9205 19417 29613 39727 0 356 1 444.4 1.0716206156635266 0 356 2 354.5 1.2073749163813288
Q23_2 492 1080 1589 2129 0 483 2 262.0 1.694455894415943 0 483 3 390.25 1.0126729703080375
Q24 9 10 8 8 7 NULL
Q25 cluster.txt VL-1906612{n=405237 c=[2804277.105, 1.000, 77.611, 1126397.997] r=[0:248120.802, 2:7.701,
3:126175.278]}
Q26 cluster.txt VL-2422906{n=684261 c=[0:1004083.596, 1:27.456, 2:22.124, 3:18.270, 9:32.999, 10:18.810]
r=[0:271823.023, 1:6.530, 2:5.646, 3:5.027, 9:7.426, 10:5.127]}
Q27 1 0 3 0 2412458 10653 American On an exploratory trip in "savage" lands
Q28 classifierResult.txt Correctly Classified Instances: 1060570 59.5777%
Q29 72 72 72 72 7 6 Toys & Games Tools & Home Improvement 4664408
Q30 72 72 72 72 7 6 Toys & Games Tools & Home Improvement 42658456
Table 18: Number of Rows in the Result Tables for all BigBench Queries
7. Resource Utilization Analysis
The resource utilization metrics are gathered with the aid of Intel's Performance Analysis Tool
(PAT) [42]. For each query the metrics CPU utilization, disk input/output, memory utilization
and network input/output are provided when running the query on MapReduce as well as Spark.
The measurements of the utilization metrics are depicted as graphs to show their distribution over
the query's runtime. Additionally, the average/total values of the metric measurements are shown
in a table for both MapReduce and Spark. This allows comparing the two engines.
For this experiment the queries were executed with scale factor 1000GB.
Page 28
25
7.1. BigBench Query 4 (Python Streaming)
BigBench's query Q4 performs a shopping cart abandonment analysis: For users who added
products in their shopping carts but did not check out in the online store, find the average number
of pages they visited during their sessions [29]. The query is implemented in HiveQL and
executes additionally python scripts.
Scale Factor: 1TB
Input Data size/ Number of Tables: 122GB / 4 Tables
Average Runtime (minutes): 929 minutes
Result table rows: 795 252 823
MapReduce stages: 33
Avg. CPU
Utilization %
User % 48.82%
System % 3.31%
IOwait% 4.98%
Memory Utilization % 95.99%
Avg. Kbytes Transmitted per Second 7128.30
Avg. Kbytes Received per Second 7129.75
Avg. Context Switches per Second 11364.64
Avg. Kbytes Read per Second 3487.38
Avg. Kbytes Written per Second 5607.87
Avg. Read Requests per Second 47.81
Avg. Write Requests per Second 12.88
Avg. I/O Latencies in Milliseconds 115.24
Summary: The query is memory bound with 96% utilization and around 5% IOwaits, which
means that the CPU is waiting for outstanding disk I/O requests. It has a modest CPU utilization
of around 49%, but very high number of average context switches per second and very long
average I/O latencies. This makes Q4 the slowest from all the 30 BigBench queries.
0
20
40
60
80
100
011
74
23
70
35
44
47
40
59
14
71
10
82
84
94
80
10
654
11
850
13
024
14
220
15
394
16
590
17
764
18
960
20
135
21
330
22
505
23
700
24
875
26
070
27
245
28
440
29
615
30
810
31
985
33
180
34
355
35
550
36
725
37
920
39
095
40
290
41
465
42
660
43
835
45
030
46
205
47
400
48
575
49
770
50
945
52
140
53
315
54
510
55
685C
PU
Uti
liza
tio
n %
Time (sec)IOwait % User % System %
0
20
40
60
80
100
30
11
83
23
44
35
11
46
80
58
33
69
94
81
61
93
30
10
483
11
644
12
811
13
980
15
134
16
294
17
461
18
630
19
784
20
945
22
111
23
280
24
434
25
595
26
761
27
930
29
084
30
245
31
411
32
580
33
734
34
895
36
061
37
230
38
384
39
545
40
711
41
880
43
034
44
195
45
361
46
530
47
684
48
845
50
011
51
180
52
334
53
495
54
661
55
830
Mem
ory
Uti
liza
tio
n %
Time (sec)
Page 29
26
0
10000
20000
30000
40000
50000
60000
012
31
24
64
37
03
49
50
61
81
74
14
86
53
99
00
11
130
12
361
13
594
14
834
16
080
17
311
18
544
19
784
21
030
22
261
23
494
24
734
25
980
27
211
28
445
29
684
30
930
32
161
33
395
34
634
35
880
37
111
38
345
39
584
40
830
42
061
43
295
44
534
45
780
47
011
48
245
49
484
50
730
51
961
53
195
54
434
55
680
KB
yte
s
Time (sec)
Transmitted per Second
Received per Second
0
10000
20000
30000
40000
50000
60000
011
53
23
14
34
81
46
50
58
03
69
64
81
31
93
00
10
454
11
615
12
781
13
950
15
104
16
265
17
431
18
600
19
754
20
915
22
081
23
250
24
404
25
565
26
731
27
900
29
054
30
215
31
381
32
550
33
704
34
865
36
031
37
200
38
354
39
515
40
681
41
850
43
021
44
564
46
115
47
671
49
214
50
765
52
321
53
864
55
415C
on
text
Sw
itch
es
per
Sec
on
d
Time (sec)
0
10000
20000
30000
40000
50000
015
34
30
90
46
33
61
84
77
23
90
04
10
171
11
340
12
493
13
654
14
821
15
990
17
143
18
304
19
471
20
640
21
793
22
954
24
121
25
290
26
443
27
604
28
771
29
940
31
093
32
254
33
421
34
590
35
743
36
905
38
071
39
240
40
394
41
555
42
721
43
890
45
044
46
205
47
371
48
540
49
694
50
855
52
021
53
190
54
344
55
505
KB
yte
s
Time (sec)Disk Bandwidth
Read per Second
Written per Second
0
100
200
300
400
015
34
30
90
46
33
61
84
77
23
90
04
10
171
11
340
12
493
13
654
14
821
15
990
17
143
18
304
19
471
20
640
21
793
22
954
24
121
25
290
26
443
27
604
28
771
29
940
31
093
32
254
33
421
34
590
35
743
36
905
38
071
39
240
40
394
41
555
42
721
43
890
45
044
46
205
47
371
48
540
49
694
50
855
52
021
53
190
54
344
55
505
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
Reads per Second
Writes per Second
0
100
200
300
400
500
014
74
29
70
44
53
59
44
74
40
87
34
98
53
10
980
12
091
13
204
14
323
15
450
16
561
17
674
18
793
19
920
21
031
22
144
23
263
24
390
25
501
26
614
27
733
28
860
29
971
31
084
32
203
33
330
34
441
35
554
36
674
37
800
38
911
40
025
41
144
42
270
43
381
44
495
45
614
46
740
47
851
48
965
50
084
51
210
52
321
53
435
54
554
55
680T
ime (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies
0
5
10
15
20
25
010
37
20
79
31
15
41
58
51
92
62
34
74
29
86
45
96
81
10
735
11
881
13
325
14
760
15
813
16
854
17
919
19
160
20
419
21
502
22
528
23
702
24
813
25
855
26
920
27
967
29
083
30
378
31
672
32
966
34
139
35
313
36
545
37
838
38
954
40
005
41
089
42
319
43
581
44
754
45
920
47
163
48
358
49
390
50
433
51
466
52
506
53
572
55
133
Nu
mb
er o
f
Time (sec)
Mappers Reducers
Page 30
27
7.2. BigBench Query 5 (Mahout)
BigBench's query Q5 builds a model using logistic regression: based on existing users online
activities and demographics, for a visitor to an online store, predict the visitors likelihood to be
interested in a given category [29]. It is implemented in HiveQL and Mahout.
Scale Factor: 1TB
Input Data size/ Number of Tables: 123GB / 4 Tables
Average Runtime (minutes): 273minutes
Result table rows: logRegResult.txt
MapReduce stages: 20
Avg. CPU
Utilization %
User % 51.50%
System % 3.37%
IOwait% 3.65%
Memory Utilization % 91.85%
Avg. Kbytes Transmitted per Second 8329.02
Avg. Kbytes Received per Second 8332.22
Avg. Context Switches per Second 9859.00
Avg. Kbytes Read per Second 3438.94
Avg. Kbytes Written per Second 5568.18
Avg. Read Requests per Second 67.41
Avg. Write Requests per Second 13.12
Avg. I/O Latencies in Milliseconds 82.12
Summary: The query is memory bound with around 92% utilization and high network traffic
(around 8-9 MB/sec). The Mahout execution starts after the 15536 seconds and is clearly
observable on all of the below graphics. It takes around 18 minutes and utilizes very few
resources in comparison to the HiveQL part of the query.
0
20
40
60
80
100
034
168
110
21
13
61
17
01
20
41
23
81
27
21
30
61
34
01
37
41
40
81
44
21
47
61
51
01
54
41
57
81
61
21
64
61
68
01
71
41
74
81
78
22
81
62
85
02
88
42
91
82
95
22
98
62
10
202
10
542
10
882
11
222
11
562
11
902
12
242
12
582
12
922
13
262
13
602
13
942
14
282
14
622
14
962
15
302
15
642
15
982
16
322
CP
U U
tili
za
tio
n %
Time (sec)IOwait % User % System %
Page 31
28
0
20
40
60
80
100
10
36
070
510
54
14
01
17
50
20
95
24
44
27
91
31
40
34
85
38
34
41
81
45
30
48
75
52
24
55
71
59
20
62
65
66
14
69
61
73
10
76
55
80
04
83
51
87
00
90
45
93
94
97
42
10
090
10
435
10
784
11
132
11
480
11
825
12
174
12
522
12
870
13
215
13
564
13
912
14
260
14
605
14
954
15
302
15
650
15
995
16
344
Mem
ory
Uti
liza
tio
n %
Time (sec)
0
20000
40000
60000
80000
035
069
510
44
13
91
17
40
20
85
24
34
27
81
31
30
34
75
38
24
41
71
45
20
48
65
52
14
55
61
59
10
62
55
66
04
69
51
73
00
76
45
79
94
83
41
86
90
90
35
93
84
97
31
10
080
10
425
10
774
11
121
11
470
11
815
12
164
12
512
12
860
13
205
13
554
13
902
14
250
14
595
14
944
15
292
15
640
15
985
16
334
KB
yte
s
Time (sec)
Transmitted per Second
Received per Second
0
20000
40000
60000
80000
100000
036
172
410
85
14
50
18
11
21
74
25
35
29
00
32
62
36
24
39
85
43
51
47
12
50
74
54
35
58
01
61
62
65
24
68
85
72
51
76
12
79
74
83
35
87
01
90
62
94
24
97
85
10
151
10
512
10
874
11
235
11
601
11
962
12
324
12
685
13
051
13
412
13
774
14
135
14
501
14
862
15
224
15
586
15
951
16
312
Co
nte
xt
Sw
itch
es
per
Seco
nd
Time (sec)
0
20000
40000
60000
035
471
010
64
14
20
17
74
21
30
24
84
28
40
31
94
35
50
39
04
42
60
46
14
49
70
53
24
56
80
60
34
63
90
67
44
71
00
74
54
78
10
81
64
85
20
88
74
92
30
95
84
99
40
10
294
10
650
11
004
11
360
11
714
12
070
12
424
12
780
13
134
13
490
13
844
14
200
14
554
14
910
15
264
15
620
15
974
16
330
KB
yte
s
Time (sec)Disk Bandwidth
Read per SecondWritten per Second
0
100
200
300
400
500
600
035
471
010
64
14
20
17
74
21
30
24
84
28
40
31
94
35
50
39
04
42
60
46
14
49
70
53
24
56
80
60
34
63
90
67
44
71
00
74
54
78
10
81
64
85
20
88
74
92
30
95
84
99
40
10
294
10
650
11
004
11
360
11
714
12
070
12
424
12
780
13
134
13
490
13
844
14
200
14
554
14
910
15
264
15
620
15
974
16
330
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
Reads per Second
Writes per Second
0
100
200
300
400
500
034
569
410
41
13
90
17
35
20
84
24
31
27
80
31
25
34
74
38
21
41
70
45
15
48
64
52
11
55
60
59
05
62
54
66
01
69
50
72
95
76
44
79
91
83
40
86
85
90
34
93
81
97
30
10
075
10
424
10
771
11
120
11
465
11
814
12
161
12
510
12
855
13
204
13
551
13
900
14
245
14
594
14
942
15
290
15
635
15
984
16
332T
ime (
Mil
lise
co
nd
s)
Time (sec)
IO Latencies
Page 32
29
7.3. BigBench Query 18 (OpenNLP)
BigBench's query Q18 identifies the stores with flat or declining sales in 3 consecutive months,
check if there are any negative reviews regarding these stores available online [29]. It is
implemented in HiveQL and uses the apache OpenNLP machine learning library for natural
language text processing.
Scale Factor: 1TB
Input Data size/ Number of Tables: 71GB / 3 Tables
Average Runtime (minutes): 28 minutes
Result table rows: 9280457
MapReduce stages: 17
Avg. CPU
Utilization %
User % 55.99%
System % 2.04%
IOwait% 0.31%
Memory Utilization % 90.22%
Avg. Kbytes Transmitted per Second 2302.81
Avg. Kbytes Received per Second 2303.59
Avg. Context Switches per Second 6751.68
Avg. Kbytes Read per Second 1592.41
Avg. Kbytes Written per Second 988.08
Avg. Read Requests per Second 4.86
Avg. Write Requests per Second 4.66
Avg. I/O Latencies in Milliseconds 20.68
Summary: The query is memory bound with around 90% utilization and around 56% of CPU
usage. The time spent for I/O waits is only around 0.30% as well as the average time spent for
I/O latencies.
05
101520253035
027
655
082
711
15
14
32
17
85
21
13
24
15
26
99
30
14
33
50
36
27
39
02
42
21
45
33
48
07
51
21
55
75
58
80
61
56
64
32
67
05
70
06
73
32
77
47
81
29
84
67
87
64
91
03
94
15
96
90
10
000
10
308
10
640
10
975
11
357
11
852
12
266
12
661
13
066
13
471
13
800
14
247
14
666
15
076
15
462
15
823
16
259
Nu
mb
er o
f
Time (sec)
Mappers Reducers
0
20
40
60
80
100
0
40
76
11
4
15
0
18
6
22
4
26
0
29
6
33
4
37
0
40
6
44
4
48
0
51
6
55
4
59
0
62
6
66
4
70
0
73
6
77
4
81
0
84
6
88
4
92
0
95
6
99
4
10
30
10
66
11
04
11
40
11
76
12
14
12
50
12
86
13
24
13
60
13
96
14
34
14
70
15
06
15
44
15
80
16
16
16
54
16
90C
PU
Uti
liza
tio
in%
Time (sec)IOwait % User % System %
Page 33
30
0
20
40
60
80
100
5
45
81
11
9
15
5
19
1
22
9
26
5
30
1
33
9
37
5
41
1
44
9
48
5
52
1
55
9
59
5
63
1
66
9
70
5
74
1
77
9
81
5
85
1
88
9
92
5
96
1
99
9
10
35
10
71
11
09
11
45
11
81
12
19
12
55
12
91
13
29
13
65
14
01
14
39
14
75
15
11
15
49
15
85
16
21
16
59
16
95M
em
ory
Uti
liza
tio
n %
Time (sec)
0
20000
40000
60000
80000
0
40
76
11
4
15
0
18
6
22
4
26
0
29
6
33
4
37
0
40
6
44
4
48
0
51
6
55
4
59
0
62
6
66
4
70
0
73
6
77
4
81
0
84
6
88
4
92
0
95
6
99
4
10
30
10
66
11
04
11
40
11
76
12
14
12
50
12
86
13
24
13
60
13
96
14
34
14
70
15
06
15
44
15
80
16
16
16
54
16
90
KB
yte
s
Time (sec)
Transmitted per Second
Received per Second
0
20000
40000
60000
80000
036
74
11
014
618
422
025
629
433
036
640
444
047
651
455
058
662
466
069
673
477
080
684
488
091
695
499
010
26
10
64
11
00
11
36
11
74
12
10
12
46
12
84
13
20
13
56
13
94
14
30
14
66
15
04
15
40
15
76
16
14
16
50
16
86C
on
text
Sw
itch
es
per S
eco
nd
Time (sec)
0
50000
100000
036
74
11
014
6
18
422
025
6
29
433
036
6
40
444
047
6
51
455
058
6
62
466
069
6
73
477
080
684
488
091
695
499
010
26
10
64
11
00
11
36
11
74
12
10
12
46
12
84
13
20
13
56
13
94
14
30
14
66
15
04
15
40
15
76
16
14
16
50
16
86
KB
yte
s
Time (sec)Disk Bandwidth
Read per SecondWritten per Second
0
50
100
150
200
0
36
74
11
0
14
6
18
4
22
0
25
6
29
4
33
0
36
6
40
4
44
0
47
6
51
4
55
0
58
6
62
4
66
0
69
6
73
4
77
0
80
684
4
88
0
91
6
95
4
99
0
10
26
10
64
11
00
11
36
11
74
12
10
12
46
12
84
13
20
13
56
13
94
14
30
14
66
15
04
15
40
15
76
16
14
16
50
16
86
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
Reads per Second
Writes per Second
0
200
400
600
0
36
74
11
0
14
6
18
4
22
0
25
6
29
4
33
0
36
6
40
4
44
0
47
6
51
4
55
0
58
6
62
4
66
0
69
6
73
4
77
0
80
6
84
4
88
0
91
6
95
4
99
0
10
26
10
64
11
00
11
36
11
74
12
10
12
46
12
84
13
20
13
56
13
94
14
30
14
66
15
04
15
40
15
76
16
14
16
50
16
86T
ime (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies
0
10
20
30
0
31
62
86
11
2
13
7
16
8
19
8
22
9
26
2
28
5
31
5
34
1
36
6
39
2
41
5
43
8
46
1
48
3
51
7
54
9
58
3
61
5
64
9
68
6
73
1
76
7
81
7
86
8
90
9
95
5
10
11
10
72
11
33
11
84
12
40
12
95
13
36
13
76
14
18
14
53
14
84
15
19
15
50
15
85
16
16
16
51
16
82
Nu
mb
er o
f
Time (sec)
Mappers Reducers
Page 34
31
7.4. BigBench Query 27 (OpenNLP)
BigBench's query Q27 extracts competitor product names and model names (if any) from online
product reviews for a given product [29]. It is implemented in HiveQL and uses the Apache
OpenNLP machine learning library for natural language text processing.
Scale Factor: 1TB
Input Data size/ Number of Tables: 2GB / 1 Tables
Average Runtime (minutes): 0.7 minutes
Result table rows: dynamic / 0
MapReduce stages: 7
Avg. CPU
Utilization %
User % 10.03%
System % 1.94%
IOwait% 1.29%
Memory Utilization % 27.19%
Avg. Kbytes Transmitted per Second 1547.15
Avg. Kbytes Received per Second 1547.14
Avg. Context Switches per Second 5952.83
Avg. Kbytes Read per Second 1692.01
Avg. Kbytes Written per Second 181.19
Avg. Read Requests per Second 14.25
Avg. Write Requests per Second 2.36
Avg. I/O Latencies in Milliseconds 8.89
Summary: The system is underutilized with only 10% CPU and 27% memory usage. The
network and disk utilization is also very low.
0
20
40
60
80
100
0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
82
86C
PU
Uti
liza
tio
n %
Time (sec)IOwait % User % System %
0
20
40
60
80
100
2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
82
86M
em
ory
Uti
liza
tio
n %
Time (sec)
Page 35
32
0
10000
20000
30000
40000
0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
82
86
KB
yte
s
Time (sec)
Transmitted per Second
Received per Second
0
5000
10000
15000
0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
82
86
Co
nte
xt
Sw
itch
es
per S
eco
nd
Time (sec)
0
10000
20000
30000
0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
82
86
KB
yte
s
Time (sec)Disk Bandwidth
Read per Second
Written per Second
0
50
100
150
0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
82
86
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
Reads per Second
Writes per Second
0
50
100
0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
82
86Tim
e (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies
0
2
4
6
0 2 4 6 9
13
15
17
19
21
23
25
27
29
31
34
38
41
43
45
47
49
51
53
55
57
61
65
69
71
73
Nu
mb
er o
f
Time (sec)
Mappers Reducers
Page 36
33
7.5. BigBench Query 7 (HiveQL + Spark SQL)
BigBench's query Q7 lists all the stores with at least 10 customers who bought products with the
price tag at least 20% higher than the average price of products in the same category during a
given month [29]. The query is implemented in pure HiveQL and is adopted from query 6 of the
TPC-DS benchmark.
Hive Spark SQL Hive/Spark
SQL Ratio
Scale Factor: 1TB
Input Data size/ Number of Tables: 70GB / 5 Tables
Average Runtime (minutes): 46.33 41.07 1.13
Result table rows: 52
Stages: MapReduce 39 Spark 144 & 4474 Tasks
Avg. CPU
Utilization %
User % 56.97% 16.65% 3.42
System % 3.89% 2.62% 1.48
IOwait % 0.40% 21.28% -
Memory Utilization % 94.33% 93.78% 1.01
Avg. Kbytes Transmitted per Second 11650.07 3455.03 3.37
Avg. Kbytes Received per Second 11654.28 3456.24 3.37
Avg. Context Switches per Second 10251.24 8693.44 1.18
Avg. Kbytes Read per Second 2739.21 6501.03 -
Avg. Kbytes Written per Second 7190.15 3364.60 2.14
Avg. Read Requests per Second 40.24 66.93 -
Avg. Write Requests per Second 17.13 12.20 1.40
Avg. I/O Latencies in Milliseconds 55.76 32.91 1.69
Summary: Hive (MR) is only 13% slower than Spark SQL. The Hive execution utilizes on
average 57% of the CPU, whereas the Spark SQL uses on average 17% of the CPU. Both Hive
and Spark SQL are memory bound utilizing on average 94% of the memory. However, Spark
SQL spent on average around 21% on waiting for outstanding disk I/O requests (IOwait), which
is much greater than the average for both Hive and Spark SQL. Additionally, Hive read data with
on average 7 MB/sec and writes it with on average 4MB/sec, whereas Spark SQL reads with on
average 6.3MB/sec and writes with on average 3.3MB/sec.
0
50
100
062
12
3
18
424
530
636
742
848
9
55
061
167
273
379
485
591
697
710
38
10
99
11
60
12
21
12
82
13
43
14
04
14
65
15
26
15
87
16
48
17
09
17
70
18
31
18
92
19
53
20
14
20
75
21
36
21
97
22
58
23
19
23
80
24
41
25
02
25
63
26
24
26
85
27
46
28
07
28
68
CP
U U
tili
za
tio
n %
Time (sec)
Hive
IOwait % User % System %
Page 37
34
0
50
100
058
11
517
222
928
634
340
045
751
457
162
868
574
279
985
691
397
010
27
10
84
11
41
11
98
12
55
13
12
13
69
14
26
14
83
15
40
15
97
16
54
17
11
17
69
18
26
18
83
19
40
19
97
20
54
21
11
21
68
22
25
22
82
23
39
23
96
24
57
25
43
26
10
26
68
27
25C
PU
Uti
liza
tio
n %
Time (sec)
Spark SQL
IOwait % User % System %
0
50
100
365
12
618
724
830
937
043
149
255
361
467
573
679
785
891
998
010
41
11
02
11
63
12
24
12
85
13
46
14
07
14
68
15
29
15
90
16
51
17
12
17
73
18
34
18
95
19
56
20
17
20
78
21
39
22
00
22
61
23
22
23
83
24
44
25
05
25
66
26
27
26
88
27
49
28
10
28
71
Mem
ory
Uti
liza
tio
n %
Time (sec)
Hive
0
20
40
60
80
100
359
11
416
922
427
933
438
944
449
955
460
966
471
977
482
988
493
999
410
49
11
04
11
59
12
14
12
69
13
24
13
79
14
34
14
89
15
44
15
99
16
54
17
09
17
64
18
19
18
74
19
29
19
84
20
39
20
94
21
49
22
04
22
59
23
14
23
69
24
24
24
96
25
77
26
32
26
88M
em
ory
Uti
liza
tio
n %
Time (sec)
Spark SQL
0
50000
100000
063
12
518
724
931
137
343
549
755
962
168
474
680
887
093
299
410
56
11
18
11
80
12
42
13
04
13
66
14
28
14
90
15
52
16
14
16
76
17
38
18
00
18
62
19
24
19
86
20
48
21
10
21
72
22
34
22
96
23
58
24
20
24
82
25
44
26
06
26
68
27
30
27
92
28
54
KB
yte
s
Time (sec)
HiveTransmitted per Second
Received per Second
0
10000
20000
30000
059
11
717
523
429
235
040
846
652
458
264
069
875
681
487
293
098
810
46
11
04
11
62
12
20
12
78
13
36
13
94
14
52
15
10
15
68
16
26
16
84
17
42
18
01
18
59
19
17
19
75
20
33
20
91
21
50
22
08
22
66
23
24
23
82
24
40
25
25
25
99
26
58
27
16
KB
yte
s
Time (sec)
Spark SQLTransmitted per Second
Received per Second
010000200003000040000500006000070000
055
11
016
522
027
533
038
544
049
555
060
566
073
681
890
198
310
66
11
48
12
31
13
13
13
96
14
78
15
61
16
43
17
26
18
08
18
68
19
23
19
78
20
33
20
88
21
43
21
98
22
53
23
08
23
63
24
18
24
73
25
28
25
83
26
38
26
93
27
48
28
03
28
58C
on
text
Sw
itch
es
per S
eco
nd
Time (sec)
Hive
Page 38
35
0
10000
20000
30000
40000
057
11
417
122
828
534
239
945
651
357
062
768
474
179
885
591
296
910
26
10
83
11
40
11
97
12
54
13
11
13
68
14
25
14
82
15
39
15
96
16
53
17
10
17
67
18
24
18
81
19
38
19
95
20
52
21
09
21
66
22
23
22
80
23
37
23
94
24
56
25
41
26
09
26
70C
on
text
Sw
itch
es
per S
eco
nd
Time (sec)
Spark SQL
0
50000
100000
150000
062
12
418
624
831
037
243
449
655
862
068
374
580
786
993
199
310
55
11
17
11
79
12
41
13
03
13
65
14
27
14
89
15
51
16
13
16
75
17
37
17
99
18
61
19
23
19
85
20
47
21
09
21
71
22
33
22
95
23
57
24
19
24
81
25
43
26
05
26
67
27
29
27
91
28
53
KB
yte
s
Time (sec)
HiveRead per Second
Written per Second
0
20000
40000
059
11
8
17
723
629
535
441
3
47
253
159
0
64
970
8
76
782
688
5
94
410
03
10
62
11
21
11
80
12
39
12
98
13
57
14
16
14
75
15
34
15
93
16
52
17
11
17
70
18
29
18
88
19
47
20
06
20
65
21
24
21
84
22
43
23
02
23
61
24
20
24
84
25
43
26
02
26
61
27
20
KB
yte
s
Time (sec)
Spark SQL
Read per Second
Written per Second
0
200
400
062
12
418
624
831
037
243
449
655
862
068
374
580
786
993
199
310
55
11
17
11
79
12
41
13
03
13
65
14
27
14
89
15
51
16
13
16
75
17
37
17
99
18
61
19
23
19
85
20
47
21
09
21
71
22
33
22
95
23
57
24
19
24
81
25
43
26
05
26
67
27
29
27
91
28
53
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
HiveReads per Second
Writes per Second
0
500
1000
059
11
817
723
629
535
441
347
253
159
064
970
876
782
688
594
410
03
10
62
11
21
11
80
12
39
12
98
13
57
14
16
14
75
15
34
15
93
16
52
17
11
17
70
18
29
18
88
19
47
20
06
20
65
21
24
21
84
22
43
23
02
23
61
24
20
24
84
25
43
26
02
26
61
27
20
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
Spark SQLReads per Second
Writes per Second
0
200
400
600
059
11
817
723
629
535
441
347
253
159
065
070
976
882
788
694
510
04
10
63
11
22
11
81
12
40
12
99
13
58
14
17
14
76
15
35
15
94
16
53
17
12
17
71
18
30
18
89
19
48
20
07
20
66
21
25
21
84
22
43
23
02
23
61
24
20
24
79
25
38
25
97
26
56
27
15
27
74
28
33
Tim
e (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies - Hive
0
100
200
300
057
11
417
122
828
534
239
945
651
357
062
768
474
179
885
591
296
910
26
10
83
11
40
11
97
12
54
13
11
13
68
14
25
14
82
15
39
15
96
16
53
17
10
17
67
18
24
18
81
19
38
19
95
20
52
21
09
21
67
22
24
22
81
23
38
23
95
24
54
25
14
25
71
26
28
26
85
Tim
e (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies - Spark SQL
Page 39
36
7.6. BigBench Query 9 (HiveQL + Spark SQL)
BigBench's query Q9 calculates the total sales for different types of customers (e.g. based on
marital status, education status), sales price and different combinations of state and sales profit
[29]. The query is implemented in pure HiveQL and was adopted from query 48 of the TPC-DS
benchmark.
Hive Spark SQL Hive/ Spark
SQL Ratio
Scale Factor: 1TB
Input Data size/ Number of Tables: 69GB / 5 Tables
Result table rows: 1
Stages: MapReduce 7 Spark 135 & 3065
Tasks
Average Runtime (minutes): 17.72 2.82 6.28
Avg. CPU
Utilization %
User % 60.34% 27.87% 2.17
System % 3.44% 2.22% 1.55
IOwait % 0.38% 4.09% -
Memory Utilization % 78.87% 61.27% 1.29
Avg. Kbytes Transmitted per Second 7512.13 7690.59 -
Avg. Kbytes Received per Second 7514.87 7691.04 -
Avg. Context Switches per Second 19757.83 7284.11 2.71
Avg. Kbytes Read per Second 2741.72 13174.12 -
Avg. Kbytes Written per Second 4098.95 1043.45 3.93
Avg. Read Requests per Second 9.76 48.91 -
Avg. Write Requests per Second 10.84 3.62 2.99
Avg. I/O Latencies in Milliseconds 41.67 27.32 1.53
Summary: Hive is 6 times slower than Spark SQL. The Hive execution is CPU (on average
60%) and memory (78%) bound, whereas the Spark SQL execution consumes on average 28%
CPU and 61% memory. Additionally, Hive read data with on average 2.6 MB/sec and writes it
with on average 4MB/sec, whereas Spark SQL reads with on average 13MB/sec and writes with
on average 1MB/sec.
0
20
40
1
69
16
1
22
8
29
4
35
9
41
8
47
1
54
5
60
7
65
8
70
4
75
8
81
8
87
2
92
6
98
2
10
66
11
41
12
02
12
98
13
71
14
45
15
25
15
90
16
53
17
08
17
55
18
03
18
53
19
13
19
70
20
26
20
93
21
53
22
09
22
65
23
28
23
87
24
45
25
02
25
58
26
05
26
57
27
05
27
52
28
00
28
70
Nu
mb
er o
f
Time (sec)
Hive
Mappers Reducers
Page 40
37
0
50
100
0
25
49
73
97
12
1
14
5
16
9
19
3
21
7
24
1
26
5
28
9
31
3
33
7
36
1
38
5
40
9
43
3
45
7
48
1
50
5
52
9
55
3
57
7
60
1
62
5
64
9
67
3
69
7
72
1
74
5
76
9
79
3
81
7
84
1
86
5
88
9
91
3
93
7
96
1
98
5
10
09
10
33
10
57
10
81
11
05
11
33
CP
U U
tili
za
tio
n %
Time (sec)
Hive
IOwait % User % System %
0
50
100
0 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
10
6
11
1
11
6
12
1
12
6
13
1
13
6
14
1
14
6
15
1
15
6
16
1
16
6
17
1
17
6
18
1
18
6
19
1
20
0CP
U U
tili
za
tio
n %
Time (sec)
Spark SQL
IOwait % User % System %
0
20
40
60
80
100
3
28
52
76
10
0
12
4
14
8
17
2
19
6
22
0
24
4
26
8
29
2
31
6
34
0
36
4
38
8
41
2
43
6
46
0
48
4
50
8
53
2
55
6
58
0
60
4
62
8
65
2
67
6
70
0
72
4
74
8
77
2
79
6
82
0
84
4
86
8
89
2
91
6
94
0
96
4
98
8
10
12
10
36
10
60
10
84
11
08M
em
ory
Uti
liza
tio
n %
Time (sec)
Hive
0
20
40
60
80
100
3 812
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
88
92
96
10
010
410
811
211
612
012
412
813
213
614
014
414
815
215
616
016
416
817
217
618
018
418
819
220
0Mem
ory
Uti
liza
tio
n %
Time (sec)
Spark SQL
0
10000
20000
30000
40000
50000
60000
70000
025
49
73
97
12
114
516
919
321
724
126
528
931
333
736
138
540
943
345
748
150
552
955
357
760
162
564
967
369
772
174
576
979
381
784
186
588
991
393
796
198
510
09
10
33
10
57
10
81
11
05
11
33
KB
yte
s
Time (sec)
Hive
Transmitted per Second
Received per Second
0
10000
20000
30000
0 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
10
6
11
1
11
6
12
1
12
6
13
1
13
6
14
1
14
6
15
1
15
6
16
1
16
6
17
1
17
6
18
1
18
6
19
1
20
0
KB
yte
s
Time (sec)
Spark SQLReceived per SecondTransmitted per Second
Page 41
38
0
20000
40000
60000
80000
100000
0
25
50
75
10
0
12
5
15
0
17
5
20
0
22
5
25
0
27
5
30
0
32
5
35
0
37
5
40
0
42
5
45
0
47
5
50
0
52
5
55
0
57
5
60
0
62
5
65
0
67
5
70
0
72
5
75
0
77
5
80
0
82
5
85
0
87
5
90
0
92
5
95
0
97
5
10
00
10
25
10
50
10
75
11
00
11
25
Co
nte
xt
Sw
itch
es
per
Seco
nd
Time (sec)
Hive
0
10000
20000
30000
0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
10
0
10
5
11
0
11
5
12
0
12
5
13
0
13
5
14
0
14
5
15
0
15
5
16
0
16
5
17
0
17
5
18
0
18
5
19
0
19
7Co
nte
xt
Sw
itch
es
per
Seco
nd
Time (sec)
Spark SQL
0
50000
100000
024
48
72
96
12
014
416
819
221
624
026
428
831
233
636
038
440
843
245
648
050
452
855
257
660
062
464
867
269
672
074
476
879
281
684
086
488
891
293
696
098
410
08
10
32
10
56
10
80
11
04
11
30
KB
yte
s
Time (sec)
HiveRead per SecondWritten per Second
0
20000
40000
60000
0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
10
0
10
5
11
0
11
5
12
0
12
5
13
0
13
5
14
0
14
5
15
0
15
5
16
0
16
5
17
0
17
5
18
0
18
5
19
0
19
7
KB
yte
s
Time (sec)
Spark SQL Read per SecondWritten per Second
0
100
200
024
48
72
96
12
014
416
819
221
624
026
428
831
233
636
038
440
843
245
648
050
452
855
257
660
062
464
867
269
672
074
476
879
281
684
086
488
891
293
696
098
410
08
10
32
10
56
10
80
11
04
11
30
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
HiveReads per Second
Writes per Second
0
500
0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
10
0
10
5
11
0
11
5
12
0
12
5
13
0
13
5
14
0
14
5
15
0
15
5
16
0
16
5
17
0
17
5
18
0
18
5
19
0
19
7
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
Spark SQLReads per Second
Writes per Second
0
200
400
0
24
48
72
96
12
0
14
4
16
8
19
2
21
6
24
0
26
4
28
8
31
2
33
6
36
0
38
4
40
8
43
2
45
6
48
0
50
4
52
8
55
2
57
6
60
0
62
4
64
8
67
2
69
6
72
0
74
4
76
8
79
2
81
6
84
0
86
4
88
8
91
2
93
6
96
0
98
4
10
08
10
32
10
56
10
80
11
04
11
30
Tim
e (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies - Hive
Page 42
39
7.7. BigBench Query 24 (HiveQL + Spark SQL)
BigBench's query Q24 measures the effect of competitors’ prices on products’ in-store and online
sales for a given product [29] (Compute the cross-price elasticity of demand for a given product).
The query is implemented in pure HiveQL.
Hive Spark SQL Spark SQL/Hive
Ratio
Scale Factor: 1TB
Input Data size/ Number of Tables: 99GB / 4 Tables
Result table rows: 8
Stages: MapReduce
39
Spark 42 & 6996
Tasks
Average Runtime (minutes): 14.75 77.05 5.22
Avg. CPU
Utilization %
User % 48.92% 17.52% -
System % 2.01% 1.61% -
IOwait % 0.48% 11.21% 23.35
Memory Utilization % 43.60% 82.84% 1.90
Avg. Kbytes Transmitted per Second 3123.24 4373.39 1.40
Avg. Kbytes Received per Second 3122.92 4374.41 1.40
Avg. Context Switches per Second 7077.10 8821.01 1.25
Avg. Kbytes Read per Second 7148.77 7810.38 1.09
Avg. Kbytes Written per Second 169.46 3762.42 22.20
Avg. Read Requests per Second 22.28 64.38 2.89
Avg. Write Requests per Second 4.71 8.29 1.76
Avg. I/O Latencies in Milliseconds 21.38 27.66 1.29
Summary: Overall Hive (MapReduce) is around 81% faster than Spark SQL. The query
execution on Hive utilizes on average 49% of the CPU, whereas the Spark SQL uses on average
18% of the CPU. However, for Spark SQL around 11% of the time is spent on waiting for
outstanding disk I/O requests (IOwait), which is much greater than the average for both Hive and
0
100
200
300
0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
10
0
10
5
11
0
11
5
12
0
12
5
13
0
13
5
14
0
14
5
15
0
15
5
16
0
16
5
17
0
17
5
18
0
18
5
19
0
19
7Tim
e (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies - Spark SQL
0
20
40
0
22
46
71
97
12
3
15
0
17
5
20
3
23
4
26
5
29
4
31
8
33
9
35
9
37
9
40
2
42
1
44
4
46
4
48
5
50
5
52
4
54
5
57
0
59
5
62
6
65
4
68
1
69
9
72
1
74
4
76
9
79
3
81
5
83
4
85
4
87
2
89
0
91
2
93
5
95
9
98
5
10
10
10
34
10
53
10
74
10
98
11
28
Nu
mb
er o
f
Time (sec)
HiveMappers Reducers
Page 43
40
Spark SQL. Also Spark SQL execution is memory bound utilizing 83% of the memory in
comparison to the 44% utilization for Hive.
0
50
100
0
20
40
60
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
54
0
56
0
58
0
60
0
62
0
64
0
66
0
68
0
70
0
72
0
74
0
76
0
78
0
80
0
82
0
84
0
86
0
88
0
90
0
CP
U U
tili
za
tio
n %
Time (sec)
Hive
IOwait % User % System %
0
50
100
0
81
18
0
26
1
36
0
44
1
54
0
62
1
72
0
80
1
90
0
98
1
10
80
11
61
12
60
13
41
14
40
15
21
16
20
17
01
18
00
18
81
19
80
20
61
21
60
22
41
23
40
24
21
25
20
26
01
27
00
27
81
28
80
29
61
30
60
31
41
32
40
33
21
34
20
35
01
36
00
36
81
37
80
38
61
39
60
40
41
CP
U U
tili
za
tio
n %
Time (sec)
Spark SQL
IOwait % User % System %
0
20
40
60
80
100
10
30
50
70
90
11
0
13
0
15
0
17
0
19
0
21
0
23
0
25
0
27
0
29
0
31
0
33
0
35
0
37
0
39
0
41
0
43
0
45
0
47
0
49
0
51
0
53
0
55
0
57
0
59
0
61
0
63
0
65
0
67
0
69
0
71
0
73
0
75
0
77
0
79
0
81
0
83
0
85
0
87
0
89
0
91
0Mem
ory
Uti
liza
tio
n %
Time (sec)
Hive
0
20
40
60
80
100
20
10
1
20
0
28
1
38
0
46
1
56
0
64
1
74
0
82
1
92
0
10
01
11
00
11
81
12
80
13
61
14
60
15
41
16
40
17
21
18
20
19
01
20
00
20
81
21
80
22
61
23
60
24
41
25
40
26
21
27
20
28
01
29
00
29
81
30
80
31
61
32
60
33
41
34
40
35
21
36
20
37
01
38
00
38
81
39
80M
em
ory
Uti
liza
tio
n %
Time (sec)
Spark SQL
0
5000
10000
0
20
40
60
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
54
0
56
0
58
0
60
0
62
0
64
0
66
0
68
0
70
0
72
0
74
0
76
0
78
0
80
0
82
0
84
0
86
0
88
0
90
0
KB
yte
s
Time (sec)
HiveTransmitted per Second
Received per Second
0
10000
20000
30000
0
81
18
0
26
1
36
0
44
1
54
0
62
1
72
0
80
1
90
0
98
1
10
80
11
61
12
60
13
41
14
40
15
21
16
20
17
01
18
00
18
81
19
80
20
61
21
60
22
41
23
40
24
21
25
20
26
01
27
00
27
81
28
80
29
61
30
60
31
41
32
40
33
21
34
20
35
01
36
00
36
81
37
80
38
61
39
60
40
41
KB
yte
s
Time (sec)
Spark SQLTransmitted per Second
Received per Second
Page 44
41
0
5000
10000
020
40
60
80
10
012
014
016
018
020
022
024
026
028
030
032
034
036
038
040
042
044
046
048
050
052
054
056
058
060
062
064
066
068
070
072
074
076
078
080
082
084
086
088
090
0
Co
nte
xt
Sw
itch
es
per S
eco
nd
Time (sec)
Hive
0
10000
20000
30000
0
81
18
0
26
1
36
0
44
1
54
0
62
1
72
0
80
1
90
0
98
1
10
80
11
61
12
60
13
41
14
40
15
21
16
20
17
01
18
00
18
81
19
80
20
61
21
60
22
41
23
40
24
21
25
20
26
01
27
00
27
81
28
80
29
61
30
60
31
41
32
40
33
21
34
20
35
01
36
00
36
81
37
80
38
61
39
60
40
41
Co
nte
xt
Sw
itch
es
per S
eco
nd
Time (sec)
Spark SQL
0
10000
20000
0
20
40
60
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
54
0
56
0
58
0
60
0
62
0
64
0
66
0
68
0
70
0
72
0
74
0
76
0
78
0
80
0
82
0
84
0
86
0
88
0
90
0
KB
yte
s
Time (sec)
Hive
Read per SecondWritten per Second
0
10000
20000
30000
0
81
18
0
26
1
36
0
44
1
54
0
62
1
72
0
80
1
90
0
98
1
10
80
11
61
12
60
13
41
14
40
15
21
16
20
17
01
18
00
18
81
19
80
20
61
21
60
22
41
23
40
24
21
25
20
26
01
27
00
27
81
28
80
29
61
30
60
31
41
32
40
33
21
34
20
35
01
36
00
36
81
37
80
38
61
39
60
40
41
KB
yte
s
Time (sec)
Spark SQL
Read per Second
Written per Second
0
20
40
60
0
20
40
60
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
54
0
56
0
58
0
60
0
62
0
64
0
66
0
68
0
70
0
72
0
74
0
76
0
78
0
80
0
82
0
84
0
86
0
88
0
90
0
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
HiveReads per Second
Writes per Second
0
200
400
600
0
81
18
0
26
1
36
0
44
1
54
0
62
1
72
0
80
1
90
0
98
1
10
80
11
61
12
60
13
41
14
40
15
21
16
20
17
01
18
00
18
81
19
80
20
61
21
60
22
41
23
40
24
21
25
20
26
01
27
00
27
81
28
80
29
61
30
60
31
41
32
40
33
21
34
20
35
01
36
00
36
81
37
80
38
61
39
60
40
41
Nu
mb
er o
f I/
O
Req
uest
s
Time (sec)
Spark SQLReads per Second
Writes per Second
0
20
40
60
0
20
40
60
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
54
0
56
0
58
0
60
0
62
0
64
0
66
0
68
0
70
0
72
0
74
0
76
0
78
0
80
0
82
0
84
0
86
0
88
0
90
0
Tim
e (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies - Hive
Page 45
42
8. Lessons Learned
This report presented our first attempt to use the BigBench benchmark to evaluate the data
scaling capabilities of a Hadoop cluster on both MapReduce/Hive and Spark SQL. Furthermore,
multiple issues and fixes were presented as part of our initiative to execute BigBench on Spark.
Our experiments showed that the group of 14 pure HiveQL queries can be successfully executed
on Spark SQL.
The Spark SQL performance greatly varied among the type of queries and the data sizes on
which they were executed. On one hand, a group of HiveQL queries (Q6, Q11, Q12, Q13, Q14,
Q15 and Q17) performed best, with Q9 being around 6.3 times faster than Hive with the increase
of the data size. On the other hand, we observed a group of queries (Q7, Q16, Q21, Q22 and
Q23) performed worst, with Q24 being 5.2 times slower on Spark SQL than on Hive with the
increase of the data size. The reason for this is the reported join issue [43] in the current Spark
SQL version.
In terms of resource utilization, our analysis showed that Spark SQL:
Utilized less CPU, whereas it showed higher I/O wait than Hive.
Read more data from disk, whereas it wrote less data than Hive.
Utilized less memory than Hive.
Sent less data over the network than Hive.
In the future, we plan to rerun the BigBench queries on the latest version of Spark SQL, where
the join issue should be fixed and offer more stable experience. Also we plan running the
remaining groups of BigBench queries using other components from the Spark framework.
Acknowledgements
This work has benefited from valuable discussions in the SPEC Research Group’s Big Data
Working Group. We would like to thank Tilmann Rabl (University of Toronto), John Poelman
(IBM), Bhaskar Gowda (Intel), Yi Yao (Intel), Marten Rosselli, Karsten Tolle, Roberto V. Zicari
and Raik Niemann of the Frankfurt Big Data Lab for their valuable feedback. We would like to
0
50
100
150
0
81
18
0
26
1
36
0
44
1
54
0
62
1
72
0
80
1
90
0
98
1
10
80
11
61
12
60
13
41
14
40
15
21
16
20
17
01
18
00
18
81
19
80
20
61
21
60
22
41
23
40
24
21
25
20
26
01
27
00
27
81
28
80
29
61
30
60
31
41
32
40
33
21
34
20
35
01
36
00
36
81
37
80
38
61
39
60
40
41
Tim
e (
Mil
lise
co
nd
s)
Time (sec)
I/O Latencies - Spark SQL
0
5
10
15
0
40
61
91
13
1
17
1
19
2
23
2
26
3
28
3
30
3
32
3
34
4
36
4
38
4
40
4
42
5
44
5
46
5
48
5
50
5
52
6
54
6
56
6
58
6
60
7
62
7
64
7
66
7
68
8
70
8
72
8
74
8
76
9
78
9
80
9
82
9
84
9
86
9
88
9
91
0
Nu
mb
er o
f
Time (sec)
Mappers Reducers
Page 46
43
thank the Fields Institute for supporting our visit to the Sixth Workshop on Big Data
Benchmarking at the University of Toronto.
References
[1] “Big-Data-Benchmark-for-Big-Bench · GitHub.” [Online]. Available:
https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench.
[2] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,”
Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
[3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster
computing with working sets,” in Proceedings of the 2nd USENIX conference on Hot topics
in cloud computing, 2010, pp. 10–10.
[4] TPC, “TPCx-BB.” [Online]. Available: http://www.tpc.org/tpcx-bb.
[5] M.-G. Beer, “Evaluation of BigBench on Apache Spark Compared to MapReduce,” Master
Thesis, 2015.
[6] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, “Big
data: The next frontier for innovation, competition, and productivity,” McKinsey Glob. Inst.,
pp. 1–137, 2011.
[7] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R.
Ramakrishnan, and C. Shahabi, “Big data and its technical challenges,” Commun. ACM, vol.
57, no. 7, pp. 86–94, 2014.
[8] H. Hu, Y. Wen, T. Chua, and X. Li, “Towards Scalable Systems for Big Data Analytics: A
Technology Tutorial,” 2014.
[9] T. Ivanov, N. Korfiatis, and R. V. Zicari, “On the inequality of the 3V’s of Big Data
Architectural Paradigms: A case for heterogeneity,” ArXiv Prepr. ArXiv13110805, 2013.
[10] R. Cattell, “Scalable SQL and NoSQL data stores,” ACM SIGMOD Rec., vol. 39, no. 4, pp.
12–27, 2011.
[11] “Apache Hadoop Project.” [Online]. Available: http://hadoop.apache.org/.
[12] D. Borthakur, “The hadoop distributed file system: Architecture and design,” Hadoop Proj.
Website, vol. 11, p. 21, 2007.
[13] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in
Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp.
1–10.
[14] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R.
Murthy, “Hive - a petabyte scale data warehouse using Hadoop,” in 2010 IEEE 26th
International Conference on Data Engineering (ICDE), 2010, pp. 996–1005.
[15] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B.
Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of
Map-Reduce: the Pig experience,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1414–1425, 2009.
[16] “Apache Mahout: Scalable machine learning and data mining.” [Online]. Available:
http://mahout.apache.org/.
[17] L. George, HBase: the definitive guide. O’Reilly Media, Inc., 2011.
[18] K. Ting and J. J. Cecho, Apache Sqoop Cookbook. O’Reilly Media, 2013.
[19] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J.
Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E.
Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator,” in
Page 47
44
Proceedings of the 4th Annual Symposium on Cloud Computing, New York, NY, USA,
2013, pp. 5:1–5:16.
[20] “Apache Hive.” [Online]. Available: http://hive.apache.org/.
[21] T. White, Hadoop: the definitive guide. O’Reilly, 2012.
[22] “Apache Spark.” [Online]. Available: http://spark.apache.org/.
[23] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J.
Franklin, and A. Ghodsi, “Spark SQL: Relational Data Processing in Spark,” in Proceedings
of the 2015 ACM SIGMOD International Conference on Management of Data, 2015.
[24] “Compatibility with apache hive.” [Online]. Available:
https://spark.apache.org/docs/1.2.0/sql-programming-guide.html#compatibility-with-
apache-hive.
[25] Cloudera, “CDH Datasheet,” 2014. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/resources/library/datasheet/cdh-
datasheet.html.
[26] Cloudera, “Cloudera Manager Datasheet,” 2014. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/resources/library/datasheet/cloudera-
manager-4-datasheet.html.
[27] A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen, “BigBench:
Towards an Industry Standard Benchmark for Big Data Analytics,” in Proceedings of the
2013 ACM SIGMOD International Conference on Management of Data, New York, NY,
USA, 2013, pp. 1197–1208.
[28] C. Baru, M. Bhandarkar, C. Curino, M. Danisch, M. Frank, B. Gowda, H.-A. Jacobsen, H.
Jie, D. Kumar, R. Nambiar, M. Poess, F. Raab, T. Rabl, N. Ravi, K. Sachs, S. Sen, L. Yi,
and C. Youn, “Discussion of BigBench: A Proposed Industry Standard Performance
Benchmark for Big Data,” in Performance Characterization and Benchmarking. Traditional
to Big Data, R. Nambiar and M. Poess, Eds. Springer, 2014, pp. 44–63.
[29] T. Rabl, A. Ghazal, M. Hu, A. Crolotte, F. Raab, M. Poess, and H.-A. Jacobsen, “BigBench
Specification V0.1,” in Specifying Big Data Benchmarks, T. Rabl, M. Poess, C. Baru, and
H.-A. Jacobsen, Eds. Springer Berlin Heidelberg, 2014, pp. 164–201.
[30] TPC, “TPC-DS.” [Online]. Available: http://www.tpc.org/tpcds/.
[31] B. Chowdhury, T. Rabl, P. Saadatpanah, J. Du, and H.-A. Jacobsen, “A BigBench
Implementation in the Hadoop Ecosystem,” in Advancing Big Data Benchmarks, T. Rabl,
N. Raghunath, M. Poess, M. Bhandarkar, H.-A. Jacobsen, and C. Baru, Eds. Springer
International Publishing, 2013, pp. 3–18.
[32] T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch, “A data generator for cloud-scale
benchmarking,” in Performance Evaluation, Measurement and Characterization of
Complex Systems, Springer, 2011, pp. 41–56.
[33] “LanguageManual VariableSubstitution - Apache Hive.” [Online]. Available:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution/.
[34] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai,
M. Amde, and S. Owen, “MLlib: Machine Learning in Apache Spark,” ArXiv Prepr.
ArXiv150506807, 2015.
[35] “Spark SQL Programming Guide.” [Online]. Available:
https://spark.apache.org/docs/1.2.0/sql-programming-guide.html##unsupported-hive-
functionality.
[36] S. Ryza, “How-to: Tune Your Apache Spark Jobs (Part 2) | Cloudera Engineering Blog,”
30-Mar-2015. .
Page 48
45
[37] Cloudera, “Configuration Parameters: What can you just ignore?” [Online]. Available:
http://blog.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/.
[38] Frankfurt Big Data Lab, “Big-Bench-Setup · GitHub.” [Online]. Available:
https://github.com/BigData-Lab-Frankfurt/Big-Bench-Setup.
[39] Hortonworks, “Hortonworks Data Platform - Installing HDP Manually.” [Online].
Available: http://hortonworks.com/wp-
content/uploads/2014/01/bk_installing_hdp_for_windows-20140120.pdf.
[40] Hortonworks, “How to Configure YARN and MapReduce 2 in Hortonworks Data Platform
2.0.” [Online]. Available: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-
hdp-2-0/.
[41] “LanguageManual JoinOptimization - Apache Hive.” [Online]. Available:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization/.
[42] Intel, “PAT Tool · GitHub.” [Online]. Available: https://github.com/intel-hadoop/PAT.
[43] Yi Zhou, “[SPARK-5791] [Spark SQL] show poor performance when multiple table do join
operation.” [Online]. Available: https://issues.apache.org/jira/browse/SPARK-5791.
Page 49
46
Appendix
System Information Description
Manufacturer: Dell Inc.
Product Name: PowerEdge T420
BIOS: 1.5.1 Release Date: 03/08/2013
Memory
Total Memory: 32 GB
DIMMs: 10
Configured Clock Speed: 1333 MHz Part Number:
M393B5273CH0-YH9
Size: 4096 MB
CPU
Model Name: Intel(R) Xeon(R) CPU
E5-2420 0 @ 1.90GHz
Architecture: x86_64
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
CPU MHz: 1200.000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
NIC
Settings for em1: Speed: 1000Mb/s
Ethernet controller: Broadcom Corporation NetXtreme
BCM5720 Gigabit Ethernet PCIe
Storage
Storage Controller:
LSI Logic / Symbios Logic
MegaRAID SAS 2008 [Falcon]
(rev 03)
08:00.0 RAID bus controller
Drive / Name Formatted Size Model
Disk 1/ sda1 931.5 GB
Western Digital,
WD1003FBYX RE4-1TB,
SATA3, 3.5 in, 7200RPM,
64MB Cache
Table 19: Master Node
Page 50
47
System Information Description
Manufacturer: Dell Inc.
Product Name: PowerEdge T420
BIOS: 2.1.2 Release Date: 01/20/2014
Memory
Total Memory: 32 GB
DIMMs: 4
Configured Clock Speed: 1600 MHz Part Number:
M393B2G70DB0-YK0
Size: 16384 MB
CPU
Model Name: Intel(R) Xeon(R) CPU E5-2420 v2 @
2.20GHz
Architecture: x86_64
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
CPU MHz: 2200.000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11
NIC
Settings for em1: Speed: 1000Mb/s
Ethernet controller: Broadcom Corporation NetXtreme
BCM5720 Gigabit Ethernet PCIe
Storage
Storage Controller:
Intel Corporation C600/X79 series
chipset SATA RAID Controller (rev
05)
00:1f.2 RAID bus
controller
Drive / Name Formatted Size Model
Disk 1/ sda1 931.5 GB Dell- 1TB, SATA3, 3.5 in,
7200RPM, 64MB Cache
Disk 2/ sdb1 931.5 GB WD Blue Desktop
WD10EZEX - 1TB,
SATA3, 3.5 in, 7200RPM,
64MB Cache
Disk 3/ sdc1 931.5 GB
Disk 4/ sdd1 931.5 GB
Table 20: Data Node