Journal of Computational Science - Cairo University Journal of Computational Science 26 (2018) 432–452 Contents lists available at ScienceDirect Journal of Computational Science

Em

Ra

b

a

ARRAA

KBMCSR

1

mg

((

h1

Journal of Computational Science 26 (2018) 432–452

Contents lists available at ScienceDirect

Journal of Computational Science

j ourna l h om epage: www.elsev ier .com/ locate / jocs

xploiting coarse-grained reused-based opportunities in Big Dataulti-query optimization

adhya Sahala,∗, Mohamed H. Khafagyb, Fatma A. Omaraa

Faculty of Computers and Information, Cairo University, EgyptFaculty of Computers and Information, Fayoum University, Egypt

r t i c l e i n f o

rticle history:eceived 1 September 2016eceived in revised form 20 March 2017ccepted 25 May 2017vailable online 30 May 2017

eywords:ig Dataulti-Query optimization

oarse-grainedharing opportunityeused-based

a b s t r a c t

Multi-query optimization in Big Data becomes a promising research direction due to the popularity ofmassive data analytical systems (e.g., MapReduce and Flink). The multi-query is translated into jobs.These jobs are routinely submitted with similar tasks to the underling Big Data analytical systems.These similar tasks are considered complicated and computation overhead. Therefore, there are someexisting techniques that have been proposed for exploiting sharing tasks in Big Data multi-query opti-mization (e.g., MRShare and Relaxed MRShare). These techniques are heavily tailored relaxed optimizingfactors of fine-grained reused-based opportunities. In accordance with Big Data multi-query optimiza-tion, the existing fine-grained techniques are only concerned with equal tuples size and uniform datadistribution. These issues are not applicable to the real-world distributed applications which dependon coarse-grained reused-based opportunities, such as non-equal tuples size and non-uniform data dis-tribution. These two issues receive more-attention in Big Data multi-query optimization, to minimizethe data read from or written back to Big Data infrastructures (e.g., Hadoop). In this paper, Multi-QueryOptimization using Tuple Size and Histogram (MOTH) system has been proposed to consider the granu-larity of the reused-based opportunities. The proposed MOTH system exploits the coarse-grained of thefully and partially reused-based opportunities among queries with considering non-equal tuples size andnon-uniform data distribution to avoid repeated computations. According to the proposed MOTH system,a combined technique has been introduced for estimating the coarse-grained reused-based opportuni-ties horizontally and vertically. The horizontal estimation of non-equal tuples size has been done byextracting metadata in column-level, while the vertical estimation of non-uniform data distribution isconcerned with using pre-computed histogram in row-level. In addition, the MOTH system estimates thecoarse-grained reused-based opportunities with considering slow storage (i.e., limited physical resourcesor fewer allocated virtualized resources) to produce the accurate estimation of the reused results costs.Then, a cost-based heuristic algorithm has been introduced to select the best reused-based opportunityand generate an efficient multi-query execution plan. Because the partial reused-based opportunities
have been considered, extra computations are needed to retrieve the non-derived results. Also, a partialreused-based optimizer has been tailored and added to the proposed MOTH system to reformulate thegenerated multi-query plan to improve the shared partial queries. According to the experimental resultsof the proposed MOTH system using TPC-H benchmark, it is found that multi-query execution time has
ring t
been reduced by conside
. Introduction

Big Data has been spread rapidly in many domains such as infor-ation systems. On the other hand, the distributed computing is

rowing everyday with the increase of workstations power, and

∗ Corresponding author.E-mail addresses: [email protected], [email protected]

R. Sahal), [email protected] (M.H. Khafagy), [email protected]. Omara).

ttps://doi.org/10.1016/j.jocs.2017.05.023877-7503/© 2017 Elsevier B.V. All rights reserved.

he granularity of the reused results.© 2017 Elsevier B.V. All rights reserved.

the data sets size. Therefore, the development and implementationof the distributed system for Big Data applications are considereda challenge [1–3]. One of the famous frameworks that have beenemerged for Big Data processing is the MapReduce, which is firstintroduced by Google in 2004 [4]. The main concept of MapReduceis to abstract the details of a large cluster of machines to facilitatethe computation on large datasets. Recently, much of the research
work in academia, and industry has proposed new data process-ing systems such as Flink and Spark to improve the performanceof analysis in data applications including query optimization tech-niques [5–7].
https://doi.org/10.1016/j.jocs.2017.05.023

http://www.sciencedirect.com/science/journal/18777503

http://www.elsevier.com/locate/jocs

http://crossmark.crossref.org/dialog/?doi=10.1016/j.jocs.2017.05.023&domain=pdf

mailto:[email protected]




https://doi.org/10.1016/j.jocs.2017.05.023

utational Science 26 (2018) 432–452 433

qtMpaatioipcoca

pwphicqsbqhcD

ctncweitaaD(Ii(srrFH

HcsdMpgcmaiaHa

systems (see Fig. 2) [11,13]. Therefore, two reused-based esti-mation techniques based on the Relaxed MRShare and MOTHsystem, i.e. Relaxed Reused-based Technique (RRT) and GranularReused-based Technique (GRT), will be presented and compared

MRSahre ( MQO based on reusing qual queries, No

overlapping)( )

MOTH System (MQO based on reusing coarse-grained overlapping tuples size and non-uniform distribu on

(histogram)) , ( | |)

Relaxed MRShare (MQO based on reusing fine-grained overlapping with uniform

distribu on), ( )

R. Sahal et al. / Journal of Comp

Multi-query optimization (MQO) is an essential keyword ofuery processing in the database systems. MQO describes howo produce answers to a set of queries with common tasks [8].

ore specifically, each query has a set of alternative executionlans; each plan has a set of tasks where some tasks are commonmong several queries. Thereby, MQO aims to define an appropri-te execution plan for each query and minimize the total executionime by performing the common tasks only once. Therefore, MQOs an NP-Hard problem; and many algorithms have been devel-ped to solve such problem. Currently, MQO problem re-emergesnto Big Data analysis systems especially when the datasets to berocessed are getting very large. Therefore, optimizing analyti-al queries becomes an important issue to overcome computationverheads. In particular, when multi-query applications run on theloud computing environment; the pay-as-you-go cost model addsdditional urgency for optimized processing [9,10].

The MapReduce-based system for analytical tasks (i.e., queryrocessing) has some performance limitations. Redundant andasteful processing shared data set are considered to be exam-les of these limitations for multi-shared queries. These limitationsave been studied in [11–13]. In many multi-query applications,

dentifying and exploiting the multiple shared queries which shareommon sub-expressions (CSEs) have been evolved to improveuery performance. Over about two decades, MQO has been exten-ively studied and demonstrated to be an effective technique inoth RDBMS and MapReduce to identify and exploit the CSEs amongueries to improve the overall query evaluation [14]. On the otherand, sharing similar work reduces the computation time, whichould reduce monetary charges incurred while utilizing the Bigata processing infrastructure [10,15–17].

In this work, the MQO has been studied by avoiding redundantomputation in Big Data. Substantially, finding an optimal execu-ion plan for multi-query becomes a challenge for processing aon-uniform data distribution. Therefore, the histograms are cru-ial for understanding the statistical properties of the shared datahich will be reused to optimize multi-query. On the other hand,

xploring the size of coarse-grained shared tuples using metadatas a valuable issue to discover additional data sharing. Therefore,he coarse granularity of the reused-based opportunities (i.e., fullynd partial) regarding the non-uniform data distribution, as wells, the non-equal size of shared tuples can improve the overall Bigata multi-query performance. Today, the fast storage technology

e.g., Solid State Drives (SSDs)) has been used in the Hadoop cluster.n the case of large scale Hadoop cluster, this storage technologys considered costly relative to the traditional Hard Disk DrivesHDDs) [18]. Therefore, the work in this paper considers the HDDstorages even it is considered to be a slow storage. Furthermore, theate of I/O operations in the shared multi-query needs to performeused results to avoid redundant I/O access on Hadoop Distributedile System (HDFS) over the HDDs storage. So, the limitation of theDDs storage has been overcome.

Therefore, Multi-Query Optimization using Tuple Size andistogram (MOTH) system has been introduced to exploit theoarse-grained opportunities of reused results among multiplehared queries including non-equal tuples size and non-uniformata distribution over slow storage (i.e., HDDs). The proposedOTH system has an ability to optimize the multi-query execution

lan using the reused-based cost model to estimate the coarse-rained reused results among the shared queries. The reused-basedost model uses different criteria; i) sharing opportunities amongulti-query ii) coarse-grained size of the selected attributes (i.e.,

set of non-equal attributes form a tuple) for input multi-query,
ii) pre-computed histograms of non-uniform data distribution,nd iv) parameters of distributed environment (i.e., I/O speed ofadoop storage). In particular, the cost model uses the metadata ofttributes within tuples to estimate horizontally the reused-based
Fig. 1. High-level abstraction of the MOTH system.

opportunities of shared multi-query with respect to the coarse-grained size of the shared tuples. Also, it uses the pre-computedhistograms to estimate vertically the reused-based opportunitiesof shared multi-query with respect to the non-uniform data dis-tribution of the shared tuples. Substantially, the MOTH system cangreatly improve the Big Data multi-query regarding processing andstoring by reducing the intermediate results to improve the overallperformance of multi-query over slow storage.

Fig. 1 illustrates the high-level abstraction of the proposedMOTH system. According to Fig. 1, the proposed MOTH systemclassifies the input multi-query in terms of sharing types (i.e.,fully, partial, non-sharing), and then optimizes these shared queriesusing the statistics-based technique (i.e., metadata and histogram).The partial queries need additional re-optimization to find the non-derived results to complete the desired outputs. The details of theproposed MOTH system and its cost model will be given in the nextsections. Moreover, the sharing and reusing terms, the queries andjobs terms and the storage and HDFS will be used interchangeablythroughout the paper.

Besides the coarse-grained tuples size, non-uniform data dis-tribution and I/O speed in the cluster environment, the proposedMOTH system considers the equality and overlapping among multi-query to augment the sharing as the MRShare and Relaxed MRShare

Fig. 2. The levels of reused-based opportunities estimations for multi-query opti-mization.

434 R. Sahal et al. / Journal of Computational Science 26 (2018) 432–452

chniqu

wtcTrabooRgoctnmmdtto

datordTdisw

2

smsaqrptq

Fig. 3. The reused-based opportunities te

ith respect to Naive Technique (NT). Conventionally, the NTechnique executes the input multi-query independently whichauses extra cost for loading the same input file multiple times.he RRT technique identifies the reused-based opportunities withespect to the number of fine-grained DISTINCT values of sharedttributes. The GRT technique identifies the coarse-grained reused-ased opportunities vertically with respect to the number of tuplesn row-level, and horizontally with resepect to the attributes sizen column-level among shared multi-query. More specifically, theRT technique considers the uniform data distribution and fine-rained tuples size which guarantees using all the reused-basedpportunities (i.e., shared data tuples). However, the GRT techniqueonsiders the non-uniform data distribution and coarse-graineduples size to judge the granularity of the reused-based opportu-ities and uses the minimal reused-based results among sharedulti-query. The proposed MOTH system can generate cheaperulti-query execution plans based on the tuples size, non-uniform

ata distribution, I/O speed of storage and reused-based oppor-unities among shared multi-query. Ultimately, an infographic ofhe reused-based opportunities techniques in Big Data multi-queryptimization is depicted in Fig. 3.

The rest of this paper is organized as follows. Related work isescribed in Section 2. The preliminaries background of metadatand histogram, notations, and the type of reused-based opportuni-ies are introduced in Section 3. The coarse-grained reused-basedpportunity problem is described in Section 4. The proposedeused-based estimation techniques are discussed in Section 5. Theetails of the proposed MOTH system are presented in Section 6.he MOTH reused-based multi-query optimizer modules are intro-uced in Section 7. A case study of the proposed MOTH system is

llustrated in Section 8. The performance evaluation of the MOTHystem is discussed in Section 9. Finally, conclusions and futureork are presented in Section 10.

. Related work

Several research work efforts in the domain of Hadoop-basedystems have been done for optimizing Big Data analysis, especiallyulti-query optimization. That research work can be broadly clas-

ified into two categories; concurrent multi-query optimizationnd non-concurrent multi-query optimization. Concurrent multi-uery optimization is similar to the multi-query optimization of the
elational databases [19]. It tries to find shared parts among multi-le queries that includes scan, computation, shuffling and so fortho maximize the sharing benefit [11,13]. The non-concurrent multi-uery optimization resembles the materialized view techniques. It
es in Big Data multi-query optimization.

materializes the intermediate and final computation results anduses them to answer the queries [12].

MRShare is a concurrent sharing framework which assumes thatI/O cost is dominant [11]. So, it considers the sharing opportuni-ties of scans, map output, and map functions sharing. However, theRelaxed MRShare relaxes and generalizes the overlapped queries toincrease sharing opportunities into a single job [13]. According toRelaxed MRShare, the shared map input scan and map output havebeen studied, and algorithms have been introduced to select anevaluation plan for a batch of jobs in the MapReduce context. Also,additional optimization techniques (i.e., GGT and MT) have beenintroduced in the Relaxed MRShare. Unfortunately, both MRShareand Relaxed MRShare suffer from non-uniform data distributionas well as coarse-grained tuples size consideration. Moreover, acomparative study of MRShare and Relaxed MRShare techniquesusing predicate-based filters on MapReduce are introduced in [20].The comparative study has confirmed that the Relaxed MRSharetechnique significantly improves query execution time relative toMRShare technique for shared data regarding predicate-based fil-ters in MapReduce.

ReStore system is one of the non-concurrent sharing systemsbuilt on top of Pig to optimize query evaluation using materializedresults [12]. ReStore system uses heuristics algorithm to select thesuitable materialized results even for the complete or part of themap and/or reduce the output of each job. The produced material-ized output by ReStore system might not be reused in case of thequery workloads are not repeated which causes stoarge overheads.

More recent work in [15,16] consider the reusing results whichare stored by exploiting MapReduce intermediate results for fail-ure resilience reasons as materialized views, where semantic UDFmodels (i.e., User-Defined Functions) based on Hive have been usedto enable effectively reuse views where subsequent queries canbe evaluated faster. On the other side, a multi-query optimizationframework, SharedHive, is proposed to transform a set of correlatedHiveQL into new optimized queries sets with respect to sharingscan and computation tasks [17]. The reused-based optimizationhas been addressed for Pig scripts by proposing PigReuse whichidentifies and reuses common sub expressions occurring in PigLatin scripts. Then, it selects the best ones to be merged based on acost-based search process which is implemented with the help ofa linear program solver [21].

Hereafter, the concept of data granularity has been addressed to
improve query processing from different perspectives. For instance,hierarchical data granularity analogy with storage hierarchy (i.e.cache memory, primary memory, secondary memory, and disk)has been studied to minimize the query access time on multi-

utational Science 26 (2018) 432–452 435

dsl[bqpcatussp

3

bwt

3

cdtmndfismrIimd

3

iotTtthrt

tEisttirp

mtd


imensional data warehouse [22]. On the other hand, an EXORDystem has been proposed to exploit the granularity of data corre-ations (i.e., soft and hard) to improve Big Data query optimization10]. According to the work in this paper, the granularity of reused-ased data opportunities (i.e., fine and coarse) in Big Data multi-uery optimization has been addressed. As a proof of concept, theroposed MOTH system is built on top of the Hadoop which isonsidered the most popular Big Data platforms, and it uses Hives a mature high-level querying language on Hadoop. It considershe advantage of common sub queries for structured data on non-niform distribution, coarse-grained tuples size, and I/O speed oftorage in distributed environment. It detects the common expres-ions in shared queries and merges them into one optimized globallan, such that the common parts would be executed only once.

. Preliminaries of the MOTH system

In this section, the metadata and histogram preliminaries wille reviewed which will be considered the main concepts in thisork. Then, the used notations will be introduced. Furthermore,

he reused-based opportunity types will be described.

.1. Metadata

Simply, metadata is considered a new word based on an old con-ept. It is data about data or any summary of the contents whichescribes other data. In a traditional database, metadata refers tohe schema which stores many different features of a database (or

ultiple databases) being described [23]. In the case of Big Data,ot only a big infrastructure is needed to process large volumes ofata, but also a metadata must be defined for such data in differentormats and types. In the context of sharing opportunities, exploit-ng the common tasks on shared multi-query against potential dataources is useful to avoid redundancy of computation. Metadata isore valuable to discover the data sharing, exploit sharing cor-

ectly to enhance processing and analyzing of large data sources.n other words, metadata for Big Data allows the data analysts tonvestigate and find the sharable parts of computations. Moreover,

etadata can enrich and enable large sharing especially when theata grows rapidly and dramatically.

.2. Histogram

Histogram summarizes the data distribution which highly usedn database systems to approximate query answering within queryptimization [24]. In spite of histogram can be used to keep trackhe data distribution, it is given less-attention in Big Data analysis.hereby, the work in this paper focuses on pre-computed his-ograms which used to generate an optimized multi-query plan inhe pre-execution user-level query. In particular, the pre-computedistograms are used to improve the accuracy of estimation of theeused-based opportunities among multiple shared queries againsthe stored data inside HDFS.

Two general types of histograms are used to analyze data dis-ributions; Equal-Width histogram (i.e., frequency histogram), andqual-Height histogram. The Equal-Width histogram divides datanto a fixed number of equal-width ranges and calculates the corre-ponding height of each range as the number of values falling intohat range. Moreover, the Equal-Width histogram is useful whenhe variation of the data distribution is small. When such variations large, an Equal-Height histogram is used. Equal-Height histogramanges data according to the number of values it contains whichrovides an informative view of data distributions [25].

Generally, three reasons motivate us to build a histogram andaintain it in HDFS for the reused-based opportunities estima-

ions. First, the histograms are built only once especially when theata are archived which permanently stored [23]. Second, regard-

Fig. 4. The weighted query graph using the reused results size in GB.

ing the single query level, the performance of a query is increased byreusing the optimal computed results among the candidate preced-ing queries which estimated by histogram. Third, the histogramsimprove the accuracy and efficiency of the reused-based opportu-nities estimation which enhances the overall performance of themulti-query plan.

3.3. Notations

According to the proposed MOTH system, the input queries arespecified in a high-level query language such as HiveQL and thentranslate into MapReduce jobs [26]. Each input query is modeledas (R, A, F) where R is the relation name (i.e., table name), A is theset of attributes and F is a filter applied to retrieve data, and theinput queries contain sharing opportunities among query schemas.Consequently, we denote GQ as Query Graph which represents aset of input queries such that each node in the graph represents asingle query. We denote the ith node of GQ as Qi, i ∈ [1, n] , n is thenumber of the input query. Gi is a sub-graph of GQ containing anith node and all of its child nodes. An outgoing edge from Qi to Qj

represents the reused result between two queries. COST Qj is usedto denote the cost of reused result of the Qias the estimated costof the reused-based estimator. Fig. 4 depicts the example of QueryGraph weighted by the reused result size.

3.4. Types of reused–based opportunities

In this section, different non-trivial reused-based opportunitieswill be identified. These opportunities will be defined as fully, par-tial and non-reused opportunity with respect to the query model(R, A, F). For simplicity, two queries, Qi and Qj , will be used to repre-sent the type of opportunities; fully, partial, and non-reused-based(see Fig. 5).

3.4.1. Fully reused–based opportunityFor multi-query optimization performance, different queries

might share similar work which means that several queries canscan and filter the same or part of the same database file. So, thereis an opportunity for some queries to reuse the full results of otherqueries. This opportunity is possible when they share the relations,selected attributes, and filtered data. To clarify this, suppose that Qi

can reuse some or all attributes and some or all retrieved rows ofQj result, this namely that Qi has a fully reused-based opportunityof Qj which can be defined as follows:

Definition 1. Fully Reused-based Opportunity

( ) ( )
∀Qi, Qj ∈ GQ , Ri=Rj ∧ Ai ⊆ Aj ∧ Fi ⊆ Fj →FRO Qi, Qj , Qi∩Qj=Qi,
Namely given Qi will fully reuseQj result


portu

3

qssr

D

∀

Q

Qc

D

N

si

N

3

strfQmb

D

4

tfAl

Fig. 5. Types of reused-based op

.4.2. Partial reused-based opportunityIt is the same as fully reused-based opportunity except that the

ueries scan and filter a part (not full) of the same input file. Toimplify this, suppose that Qi can reuse some or all attributes andome retrieved rows of Qj result, this namely that Qi has a partialeused-based opportunity of Qj which can be defined as:

efinition 2. Partial Reused-based Opportunity

Qi, Qj ∈ GQ , (Ri = Rj ∧ Ai ⊆ Aj ∧ Fi ⊂ Fj) → PRO(Qi, Qj),

i =(

Qi∩Qj

)∪Qi\Qj Namely given Qi will partially reused Qj result

According to the partial reused-based opportunity definition, ifi is a partial SubQuery of Qj , then the non-derived results for Qi

an be defined as:

efinition 3. Non-Derived Results

DRi = Qi\(Qi∩Qj)

Let multi-query Q = [Q1, Q2, . . ., Qn] that are considered partialub queries of the same query, the non-derived reused result, NDR,s defined as:

DR = n∪i=1

SNDRi

.4.3. No reused-based opportunityFor multi-query optimization performance, it may fail to find

hared parts among these multiple queries. For example, the mul-iple queries scan different relations. Also, they cannot reuse theiresults even they use the same relations because of selecting dif-erent attributes or different tuples. To simplify this, suppose thati cannot reuse any attribute or any retrieved rows of Qj result, thiseans that Qi has a No reused-based opportunity of Qj which can

e defined as:

efinition 4. No Reused-based Opportunity

∀Qi, Qj ∈ GQ ,(

Ri, /= Rj ∨(

Ri,=Rj ∧(

Ai � Aj ∨ Fi � Fj

)))

→ NRS(

Qi, Qj

), Qi ∩ Qj = Ø

Namely given Qi will not Qj result

. Coarse-grained reused-based opportunity problem

In this section, the coarse-grained reused-based opportunity for
he multi-query problem will be introduced and formulated in two-old; coarse-grained tuples size and non-uniform data distribution.lso, the partial reused-based opportunity problem will be high-
ighted.

nities for pairs query Qi and Qj .

4.1. Coarse-grained tuples size and non-uniform data distribution

Although, exploiting the reused-based opportunities (i.e., fineroverlapping) is useful to optimize multi-query, it is not enoughto select the smallest reused results which could reflect nega-tively on Big Data query processing. Thus, other criteria are neededto explore the coarse granularity of reused-based opportunitiesand estimate them to improve query processing especially in lowstorage Big Data computing environment. Fundamentally, in therelational databases, the vertical rows are called tuples, and thehorizontal columns are called attributes where a set of attributesform a tuple. According to the context of this work, the granularitiesof number of tuples and their sizes are fairly large so the granu-larity of reused-based opportunities will be considered. Therefore,the histogram can estimate the tuples number vertically and themetadata can maintain the size of tuples horizontally. Accordingto the example in Fig. 6, four input queries (Q1, Q2, Q3 and Q4),over the relation R (a,. . ., z) have coarse-grained attribute size andcoarse-grained classes of data (i.e., C1, C2, C3, . . ., C10) which arescattered non-uniformly around (a) attribute. Furthermore, thesequeries highlight the coarse-grained reused-based opportunitieschallenges regarding non-equal tuples size and non-uniform datadistribution (see Fig. 6).

According to query model, (R, A, F), each Q1, Q2, Q3 and Q4 aremodeled as (R, A1, F1), (R, A2, F2), (R, A3, F3), and (R, A4, F4) respec-tively. It is noted that Q1, Q2 and Q3 are independent regarding thedifferent attributes as A1 � A2 � A3. However, Q4 overlaps with theother three queries (see Table 1). It can reuse their results becausethey considered the predecessor of Q1such as Q1 ≺ Q4 or Q2 ≺ Q4 orQ3 ≺ Q4. Fig. 6 depicts the shared data with respect to Q4 (i.e., thefirst fourth classes of tuples within a, b, and c attributes) for Q1, Q2and Q3 results such as R1, R2 and R3 respectively. Moreover, Table 1presents the estimations of these shared data which indicate theessential influenced of vertical estimation (i.e., tuples number pereach class) and horizontal estimation (i.e., tuple size which calcu-lated by summation of attributes sizes) to choose the appropriateshared results for overlapped queries. For example, although Q1 hasthe largest number of items (i.e., 35 items), these items may havethe lowest estimated cost (i.e., the cost of reading the whole filewhich includes 35 items) relative to Q2 and Q3 with respect to 18and 20 items respectively.

Substantially, the estimated sharing between any pairs ofqueries is error-prone in terms of the coarse-grained attributessize such as non-equal shared tuples size and the non-uniform datadistributed. Consequently, the combination of the coarse-grainedsize of tuples and the number of tuples can be estimated to gainmore accurate reused-based opportunities. Furthermore, the opti-mized reused results size will accelerate the data retrieving onslow Big Data storage and reduce redundant computations. In addi-tion, the partial reused-based opportunity between two shared
queries requires a part of result which incurs querying out bothoverlapped queries. Therefore, the required results which callednon-derived results need to be optimized rather than retrieved

R. Sahal et al. / Journal of Computational Science 26 (2018) 432–452 437

Fig. 6. The gray shadowed items are shared in R1, R2 and, R3 with coarse-grained tuples size and tuples number.

Table 1Estimated shared result of Q1, Q2, and Q3 on Q4.

Query Sharing/Overlapping Number of Shar-ing/Overlapping

Reused Result HorizontalEstimation(Attribute)

VerticalEstimation(Tuples No)

HorizontalVerticalEstimation

fp

fqaB

4f

itsto

R

Wtdoa

nique with considering non-uniform data distribution, as well as,coarse-grained tuples size of reused-based opportunity amongshared multi-query [13]. Fig. 7 depicts the abstract level of reused-based opportunity between two shared queries (i.e., Qj reuses the

Q1 A4 ⊂ A1, F4 ⊂ F1 3 attributes, 4 classes R1



rom the whole input file multiple times especially for multipleartial shared queries.

Hereafter, the work in this paper will address comprehensively theull and partial coarse-grained reused-based opportunities for multi-uery in terms of non-equal tuples size, non-uniform data distributionnd I/O speed of Big Data storage in distributed environment (i.e., forig Data context, the data is stored on main file system, e.g., HDFS).

.2. Coarse-grained reused-based opportunity problemormulation

According to the definitions in the preliminaries, suppose a set ofnput n queries Q = [Q1, Q2, . . .Qn] ,where the reused-based oppor-unity for each input query Qi is defined as RO (Qi). The goal is toelect the minimum cost of reused result with respect to size ofuple and number of tuples for each Qi, which defined as reusedpportunity weight, described as:

O(Qi) = min(

ROW(

t(ROij), TszQj, |TQ j|

))∀i, j ∈ [1, 2. . ., n], i /= j

(1)

here ROW () denotes the weight of reused-based opportunity( )
o retrieve the result of Qi by reusing the result of Qj . t ROij
enotes the type of reused-based opportunity of Qi and Qj in termsf Fully/Partail. TszQj

and |TQ j| denote the size of reused tuplesnd number of reused tuples of Qj resepectively. So, the output

7 attributes 5 classes 35 items3 attributes 6 classes 18 items4 attributes 5 classes 20 items

is derived from multi-query execution plans (i.e., non-concurrentand concurrent) which can be described into two lists as follows:

Non Con Query List =[Q ′

1, Q ′2, . . ., Q ′

m

]

Con Query List =[Q ′

1, Q ′2, . . ., Q ′

l

]

5. The proposed reused-based estimation techniques

The work in this paper is based on the Relaxed MRShare tech-

Fig. 7. The high-level of reused-based opportunity between two shared queries.

4 utational Science 26 (2018) 432–452

rRR

5

fiutnccvsqctirm

5

cstFesbrmtmobsaltbdzttN

6

mttttIrcld

Sas

38 R. Sahal et al. / Journal of Comp

esult of Qi) using two reused-based estimation techniques; theelaxed Reused-based Technique (RRT), and our proposed Granulareused-based Technique (GRT) to optimize multi-query.

.1. Relaxed reused-based technique

The Relaxed Reused-based Technique (RRT) is concerned withne-grained overlapping with uniform data distribution [13]. Theniform data distribution is one of the statistical data distributionypes, in which every possible distinct class of data has the sameumber of elements. According to the work in this paper, the dis-rete distribution is applied. According to SQL language, the distinctlass of data is called DISTINCT values. However, each DISTINCTalue has an equal chance probability of occurrence in the dataource (i.e., table) and it can be retrieved by the summary or totalueries (i.e., aggregation queries such as COUNT and GROUP BYlause). Consequently, the RRT estimates the reused-based oppor-unities among shared queries using the shared DISTINCT valuesn the overlapped reused results. Therefore, the smaller number ofeused DISTINCT values is selected then the better decisions will beade for multi-query optimization.

.2. Granular reused-based technique

Our developed Granular Reused-based Technique (GRT) is con-erned with coarse-grained overlapping with the non-equal tuplesize and non-uniform data distribution using metadata and his-ogram respectively to identify the sharing among multi-query.undamentally, the data distribution in the real systems is notvenly balanced throughout an object, so histogram statisticshould be created [24]. The more statistics are gathered then theetter decisions that the cost-based optimizer will be made. For thiseason, the reused-based opportunities of intermediate results inulti-query optimization are needed to be evaluated with respect

o data distribution. Therefore, the reused-based opportunities esti-ator applies pre-computed histograms to estimate the selectivity

f predicates [25,27]. Although the histogram identifies the reused-ased opportunity with respect to non-uniform distribution ofhared data, the metadata is considered more valuable to discoverdditional granular data sharing among multi-query. In particu-ar, the extracted information regarding the size of attributes (i.e.,uples size) for different reused results can refine the data sharingy selecting the smallest reused results accurately. Therefore, theata size of the overlapped reused results will be estimated hori-ontally with respect to the tuples size and vertically with respect tohe number of tuples. More details will be clarified later in the illus-rative case study of the three reused-based estimation techniques;T, RRT and GRT.

. The proposed MOTH system

Again here, the proposed MOTH system optimizes Big Dataulti-query by exploiting the coarse-grained reused-based oppor-

unities of results being loaded from slow storage. It consists ofwo parts; exploiting the coarse-grained reused-based opportuni-ies of the tuples size, and the non-uniform data distribution amonghe shared multi-query to improve performance with respect to/O speed of Big Data storage. More specifically, the cost of reusedesults could be minimized by considering additional estimationriteria such as metadata and histogram-based estimation. A high-evel overview of the proposed MOTH system and its modules areepicted in Fig. 8.

Query Parser Module parses each input query based on theQL regular expression then outputs a set of SQL tokens suchs relation name, attributes names, SQL clauses and so on repe-ented as (R, A, F). Sharing Classifier Module receives the extracted

Fig. 8. MOTH system architecture.

tokens and then divides the sharing opportunities of the inputqueries regarding R, A, and F into two pre-defined lists; the sharedand non-shared query lists. The pseudo code of this moduleis represented in algorithm called MOTH Sharing Classifier (seeAppendix A Algorithm 1). MOTH Reused-based Multi-Query Opti-mizer Module investigates the sharing opportunities to producean optimized plan. This module consists of three sub mod-ules; Reused-based Estimator, Reused-based Enumerator and PartialReused-based sub modules. The Reused-based Estimator is consid-ered the main module in the proposed MOTH system because itestimates the coarse-grained reused-based opportunities in theshared multi-query. Then, the Reused-based Enumerator selects theparent query(s) from the multiple shared queries with minimumreused results using a search algorithm (e.g., greedy or dynamicprogramming) to generate a multi-query execution plan. After that,the Partial Reused-based module re-optimizes the generated planand produces it as a tree. The produced tree is defined as follows:

Definition 5. Reused-based Multi-Query Tree

- ∀ Qi, level (Qi) = 0 → Qj,is namely Root Query which is ancestor of all successor queries

- ∀ (Qi, Qj), Qj � Qi ∧ (level(

Qj

)= level (Qi) − 1) →

Qj, is namely Perent Query of Qi

- ∀ (Qi, Qj), Qi � Qj ∧ (level (Qi) = level(

Qj

)+ 1) →

Qi, is namely SubQuery of Qj

- ∀ (Qi,, Qj),n∑

i=1

SubQuery (Qi) = 0 → Qj,

is namely Leaf Query which has not predecessor or ancestor

Finally, the Query Rewriter Module divides the generated treeinto sub trees (i.e., multi-query plan). Then, it refines and rewritesthis plan to produce two final plans; Non-Concurrent and Concur-rent Multi-Query Plan. These multi-query plans are submitted back

utatio

tba4

7

mbd

7

mouttt[Hmoiseretl

tbtTw

C

rt

TT


o Hadoop cluster to be executed and then the results are writtenack to HDFS. The pseudo code of this module is represented inlgorithm called MOTH Query Rewriter (see Appendix A Algorithm).

. MOTH reused-based multi-query optimizer modules

The three sub modules of the proposed MOTH reused-basedulti-query optimizer module; Reused-based Estimator, Reused-

ased Enumerator and Partial Reused-based sub modules will beiscussed in details in the following subsections.

.1. Reused-based estimator module

The Reused-based Estimator module is implemented as a costodel. The function of this cost model is to estimate the cost

f reused-based opportunities for multi-query, Q = [Q1, Q2, . . .Qn]sing the MOTH reused-based techniques (i.e., RRT and GRT). Twoypes of parameters; system parameters and multi-query parame-ers; are passed to the cost model. The system parameters calibratehe MOTH cost model according to Hadoop cluster specification27]. These parameters include two types of costs; I/O costs (i.e.,DFS I/O, local disk I/O and network I/O) and CPU cost. Theulti-query parameters (i.e., metadata, histogram and reused-based

pportunities among shared queries) are defined according to thenput multi-query. Since, the HDFS counters are measured in BYTEuch as HDFS BYTES READ and HDFS BYTES WRITTEN, all costs arestimated in unit of nanosecond to process a byte that needs to beead from/written back to HDFS [28,29]. For simplicity, the param-ters in the MOTH reused-based cost model are categorized intohree groups; system parameters, multi-query parameters and calcu-ated parameters group (see Table 2).

Suppose a SELECT query Qi (i.e., no-grouping query on a singleable) consists of reading the data from HDFS and writing the resultack to HDFS. Since the shared data on plain multi-query is studiedhrough this work, no shuffling through the network is needed.herefore, the total cost is defined as a summation of both read andrite costs and described as follows [11,13,27,29]:

ost = CostRead + CostWrite (2)

Again here, the input query is modeled as (R, A, F) where R is theelation name (i.e., table name), A is the set of attributes and F ishe applied filter. The filter F is used to prune unqualified tuples. For

able 2he description of parameters in MOTH reused-based cost model.

Parameter Group ParameterNotation

Description

System ParametersGroup

Clr Cost ratio of local disk readClw Cost ratio of local disk writeCRHDFS Cost ratio of HDFS readCWHDFS Cost ratio of HDFS writeCNET Cost ratio of Network I/OCCPU Cost of ratio of CPU computation

TQiNumber of tuples in Qi

Multi-QueryParameters Group

|TIF | Number of tuples in the Input FileTszQi

The size of tuple (in bytes) in Qi

PQ ParentQuerySQ SubQueryNSQPQi

Number of SubQuery(s)FRO Fully Reused-based OpportunityPRO Partial Reused-based Opportunity

Calculated ParametersGroup

CostRead Total cost of readCostWrite Total cost of writeCostFilter Total cost of filterCostQi

Query costCostPQi

ParentQuery cost

nal Science 26 (2018) 432–452 439

simplicity, only one filter is assumed, so the total cost is computedas follows [27]:

CostFilter = CCPU |T | (3)

Cost = CostRead + CostFilter + CostWrite (4)

Substantially, the optimized reused-based opportunity,RO (Qi) ,for a given Qi is modeled as:

RO(Qi) = min(

ROW(

t(ROij), TszQj, |TQ j|

)),

t(ROij) → {FRO, PRO},TszQj

, |TQ j| → �, ∀i, j ∈ [1, 2. . ., n], i /= j (5)

Where ROW () denotes the weight of reused-based opportunity toretrieve the result of Qi by reusing the result of Qj . t

(ROij

)denotes

the type of reused-based opportunity of Qi and Qj (i.e., Fully/Partail).More specifically, the weighted reused-based opportunities of allpotential shared pair queries, (Qi, Qj), are estimated using tworeused-based techniques; RRT for existed fine-grained reused-based opportunities and GRT for coarse-grained reused-basedopportunities. The details of RRT techniques is described in [13,30].The GRT technique has been introduced in this work to estimatethe cost of Qi based on tuple size and tuples numbers which areextracted from the metadata and pre-computed histogram respec-tively and it could be described as follows:

|TQi| = GetTuplesNumberFromHistogram (Qi) (6)

TszQi= GetTupleSizeFromMetadata (Qi) (7)

CostQi= CRHDFS |TQi

|TszQi+ CCPU |TQi

| (8)

The pseudo code of Update Reused Opportunity Weight Algo-rithm which updates the reused-based opportunities weights offully and partial sharing according to Eqs. (6)–(8) is listed inAppendix A Algorithm 2.

After calculating the cost of fully and partial sharing, the inputqueries could be classified into a set of ParentQuery, and SubQueryas follows:

PQ = [PQ1, PQ2, . . ., PQi, . . .., PQm] , 1 ≤ m < n (9)

SQ =[SQ1, SQ2, . . ., SQj, . . .., SQl

], 1 ≤ l < n (10)

Since each SubQuery can reuse different ParentQuery(s) results,the Reused-based Estimator module estimates candidate Paren-tQuery(s), and then the Reused-based Enumerator module findsappropriate ParentQuery(s) for SubQuery(s) to minimize the overallcost for input multi-query according the following formula:

∀ SQj, PQi = arg min (cost (PQ )) (11)

On the other hand, each ParentQuery result can be reused bynumber of SubQuery(s) which represented as follows:

∀ PQi, NSQi =[NSQPQ1 , NSQPQ2 , . . ., NSQPQm

],

1 ≤ m < n (12)

Ultimately, the reused cost of ParentQuery(s), is calculated usingEq. (8) and its SubQuery(s), presented as follows:

CostPQi= CostQi

NSQPQi(13)

Therefore, the total cost of read is defined as follow:

CostRead = CRHDFS |TIF | + CCPU |TIF | +m∑

i=1

CostPQi(14)

4 utational Science 26 (2018) 432–452

W

cqpr

C

7

tmmAntsce

tatfisqtcMA

7

wdbsmmqfi

12

3

4


here IF is input file which read by RootQuery.Regarding the write cost, the necessary attributes are only con-

erned as the desired results. Meanwhile, the write cost of theueries results is not included because it is common for eachroposed reused-based estimation techniques. The write cost isepresented as follows [13]:

ostWrite =n∑

i=1

CWHDFS |TQi|TszQi

(15)

.2. Reused-based enumerator module

The output of the reused-based cost model is passed tohe Reused-based Enumerator sub module to generate different

ulti-query execution plans, and then select the cheapest costlyulti-query execution plan using a greedy search algorithm.lthough, the selected multi-query execution plan incurs differentumber of multi-query which could be executed simultaneously,he concurrency execution is not always optimal especially overlow storage. Therefore, the MOTH reused-based query optimizeran be adapted to assess the feasibility of alternative multi-queryxecution plans according to storage speed.

A little bit of details, the proposed MOTH system applies thewo-phase reused-based optimizer. The first phase is to gener-te the multi-query execution plans according to the estimationechniques (i.e., RRT and GRT). The second phase is to choose thenal multi-query execution plan with the lowest total cost whichent to the Partial Reused-based module to re-optimize the partialueries. Ultimately, the Query Rewriter module refines and rewriteshe final multi-query plan and then produces two final plans; non-oncurrent and concurrent multi-query plans. The pseudo code ofOTH Reused-based Optimizer Module is presented in Appendix A

lgorithm 3.

.3. Partial reused-based module

As far as we know, the partial multi-query sharing over Big Dataith respect to coarse-grained tuples size and non-uniform dataistribution is not studied before. Therefore, the Partial Reused-ased module in the proposed MOTH system is tailored to deal withuch issue (i.e., partial multi-query). After the Reused-based Enu-erator generates the multi-query plan, the Partial Reused-basedodule is used to retrieve the non-derived results of the partial

ueries which are not included in their parents to complete theirnal results. The steps of this module are listed as follows:

. Identify the partial queries from the generated multi-query plan.

. Find each sub non-derived query for each partial query in themulti-query execution plan.

. Merge all sub non-derived queries for all partial queriesinto one global query to retrieve the non-derived results atonce. To clarify this step, consider input multi-query Q =Q1, Q2, . . ., Qn, the partial shared queries are identified (see Def-inition 2). Moreover, the non-derived queries to retrieve thenon-derived results are called Sub Non-Derived Query denotedby [SNDQ1, SNDQ2, . . ., SNDQn] (see Definition 3). Therefore, allSNDQs are merged and unified into one query, called Non-DerivedQuery, (NDQ ), without overlapping with RootQuery as follows:

NDQ = n∪i=1

SNDQ, RootQuery∩NDQ = Ø (16)

. Apply two proposed techniques; Partial Union-based and Par-tial Branching-based; to find the partial results for each partialquery using two execution plans. The first plan, called non-derived, concentrates on finding the non-derived results, while

Fig. 9. (a)Non-derived query plan (b) Update multi-query plan for PUT with onepartial query consideration.

the second plan, called update multi-query, concentrates onupdating and rewriting the generated multi-query plan. Moredetails about these proposed techniques will be illustrated.

7.3.1. Partial UNION-based techniqueThe key idea of the Partial UNION-based Technique (PUT) is that

unifying the retrieved partial results of the partial queries with theirnon-derived results using non-derived plan, and composing thefinal results of input multi-query using update multi-query plan.The two plans of PUT are described as follows:

7.3.1.1. PUT non-derived plan. This plan executes the Non-DerivedQuery over the input file to retrieve all non-derived results atonce for all partial queries in the generated multi-query plan (seeFig. 9.a).

7.3.1.2. PUT updated multi-query plan. The function of this planis to reformulate the generated multi-query plan by unifying thePartial Parent Result, PPR, and the Non-derived Result, NDR of eachUpdated Union-based Partial Query (UUPQ), described as follows (seeFig. 9.b):

UUPQ = PPR∪NDR (17)

7.3.2. Partial branching-based techniqueAlthough the PUT technique exploits the partial reused-based

opportunity among the shared multi-query, but it incurs time over-heads because of UNION operations [29]. Therefore, the PartialBranching-based Technique (PBT) is considered a modified versionof PUT technique by using branching operations instead of unionoperations. The key idea of PBT technique is that branching queriesusing the non-derived plan, then branching and updating par-tial queries using update multi-query plan to retrieve all desiredresults. The two plans of PBT are discussed as follows:

7.3.2.1. PBT non-derived plan. The PBT non-derived plan executesthe Non-Derived Query over the input file to retrieve Non-derivedResult. Then, it branches the Non-derived Result (i.e., stored in onefile) into sub non-derived results (i.e., SNDR1, SNDQ2, . . ., SNDRn)by executing a set of sub non-derived queries (i.e., SNDQ1, SNDQ2,
. . ., SNDQn) concurrently over Non-derived Result (see Fig. 10). Fur-thermore, the sub non-derived results are directly assigned to theirpartial queries to save loading time of large Non-derived Result file.Therefore, the drawback of PUT technique is overcome.

R. Sahal et al. / Journal of Computatio

7iqRt

1

2

table (i.e., lineitem) with different tuple sizes [31]. For simplicity,

Fig. 10. Non-derived plan for PBT.

.3.2.2. PBT update multi-query plan. After the Non-derived Results branched into Sub Non-derived Result(s), the result of each partialuery, Q, is included into; Sub Non-derived Result and Partial Parentesult which are retrieved from its partial parent query. There arehree alternative branching types could be produced:

) Leaf Query, where the partial Qi is a Leaf Query (see Definition 5),so no rewriting will be done for this query (see Fig. 11.a).

) Single Branching, where the partial query, Qi, has SubQuery(s),then the result of Qi has branched into Partial Parent Result andSub Non-derived Result (see Fig. 11.b). More specifically, if, theSubQuery can fully reuse Partial Parent Result, the SubQuery isrewritten to be executed over Partial Parent Result instead of
Qi. Otherwise, the SubQuery is rewritten to be executed over SubNon-derived Result instead of Qi. This type of branching is definedas:
Fig. 11. The types of BPT update multi-query plan (a) Lea

nal Science 26 (2018) 432–452 441

Definition 6. Single Branching

Qi, SQ j, FRO(

PPQ i, SQ j

)→ PPQi is Perent Query of SQj

Qi, SQ j, FRO(

SNDQ i, SQ j

)→ SNDQi is Perent Query ofSQj

3) Double Branching, where the partial query, Qi, has SubQuery(s),then the result of Qi has branched into Partial Parent Resultand Sub Non-derived Result. Moreover, the SubQuery overlapsbetween Partial Parent Result and Sub Non-derived Result whichmeans that it can reuse partially both Partial Parent Result andSub Non-derived Result. Therefore, the SubQuery(s) can also bebranched into other SubQuery(s); the first SubQuery is executedover Partial Parent Result, which denoted by SQ-PPj and the sec-ond SubQuery is executed over Sub Non-derived Result whichdenoted by SQ-NDj . Consequently, this branching type has twobranching steps to update multi-query plan; the first branchingstep at partial query level and the second branching step at subquery level (see Fig. 11c). This type of branching is defined as:

Definition 7. Double Branching

∀Qi, SQ j, PRO(

PPQ i, SQ j

)→ PPQi is Perent Query of SQ PPj

∀Qi, SQ j, PRO(

SNDQ i, SQ j

)→ SNDQi is Perent Query of SQ NDj

The pseudo code of the Partial Reused-based module is presentedin Algorithm 5 (see Appendix A Algorithm 5).

8. Case study

In this section, the scenario of executing the phases of the pro-posed MOTH system; generating the multi-query execution plansand choosing the final multi-query execution plan; is presented.Six input queries are considered which run over TPC-H benchmark

lineitem table consists of 100 million tuples which distributed uni-formly regarding l discount attribute values (i.e., 10 million for 10DISTINCT values). Some modifications are applied on lineitem table

f Query, (b) Single branching, (c) Double branching.


Table 3The detailed description of input multi-query based on (R, A, F) model and estimation results using different criteria.

Query ID Attributes Filter Estimated Results based on

Distinct ValueNumber

Uniform DataDistribution

Non-Uniform DataDistribution(Histogram)

Tuple Size Byte(Metadata)

Data size in GB

Q1 l discount,l orderkeyl,shipdate,l commitdate,l receiptdate

[0.01–0.06] 6 60 millions 83465000 31 2.4

Q2 l discount,l orderkey

[0.03–0.06] 4 40 millions 27249000 11 0.27

Q3 l discount,l orderkeyl,shipdate,l commitdate,l receiptdate

[0.02–0.04] 3 30 millions 61652000 31 1.8

Q4 l discount, [0.02–0.04] 3 30 millions 61652000 11 0.6

2020

iul

8

p

qpoetttt

m

TE

TE

l orderkeyQ5 l discount [0.03–0.04] 2

Q6 l discount [0.02–0.03] 2

n order to achieve non-uniform data distribution by skewing val-es through l discount attribute. The samples of input queries are

isted below:

.1. Generating multi-query plans phase

This phase is executed using the following steps to select theroper ParentQuery(s).

First Step: The Query Parser module tokenizes the input multi-uery with respect to pre-defined query model (R, A, F). Table 3resents the output of Query Parser module, estimated numberf DISTINCT values regarding l discount attribute in lineitem table,stimated tuples number based on uniform distribution, estimateduples number based on the pre-computed histogram, estimateduple size of each query based on the stored metadata of lineitem
able, and estimated data size which calculated as combination ofuples number and tuple size.
Second Step: The queries are classified using Sharing Classifierodule into two lists; shared and non-shared query.

able 4stimated sharing opportunities using RRT for each pair (Qi, Qj).

Query No Q1 Q2 Q3 Q

Q1 N,0 P,3 NQ2 F,4 P,2 PQ3 F,3 N,0 NQ4 F,3 P,2 F,3

Q5 F,2 F,2 F,2 FQ6 F,2 P,1 F,2 F

able 5stimated reused-based opportunities using RRT for each pair (Qi, Qj), m denotes to milli

Query No Q1 Q2 Q3

Q1 N,0 P,30m

Q2 F,60m P,30m

Q3 F,60m N,0

Q4 F,60m P,40m F,30m

Q5 F,60m F, 40m F,30m

Q6 F,60m P,40m F,30m

millions 7910000 5 0.04 millions 54189000 5 0.25

Third Step: This step concerns about the function of the pro-posed MOTH Reused-based Multi-query Optimizer module. TheReused-based Estimator and Reused-based Enumerator sub mod-ules interact together to estimate the reused-based opportunitiesamong shared queries, then two multi-query execution plans; RRTand GRT; are generated. The GRT plan is produced into two plans;H-GRT and HM-GRT; based on histogram and histogram metadatarespectively. Generally, each sharing opportunity is presented asa pair of reused-based opportunity type (i.e., Fully, Partial and Noreused denoted by F, P, N respectively), and the estimated sharing(i.e., number of shared DISTINCT values for RRT plan or number ofshared tuples for H-GRT plan or data size for HM-GRT plan).Thesethree plans will be discussed as follows.

8.1.1. RRT planThe RRT plan is generated based on the number of overlapped

DISTINCT values among input multi-query [13]. The sharing oppor-tunities for each pair (Qi, Qj) with respect to the number ofshared DISTINCT values among shared multi-query are presented inTable 4. For instance, the sharing opportunity of pair input queries(Q2, Q1) is (F, 4) which indicates that Q2 is fully shared with Q1 by
4 DISTINCT values. According to each Qi in Table 5, the estimatedreused-based opportunities using the Reused-based Estimator submodule and the selected best ParentQuery(s) using Reused-basedEnumerator sub module are presented. For example, the estimated
4 Q5 Q6 Distinct Values of Qi

,0 N,0 N,0 6,2 N,0 N,0 4,0 N,0 N,0 3

N,0 N,0 3,2 P,1 2,2 P,1 2

on and best ParentQuery of Qi is bolded.

Q4 Q5 Q6 Best Parent of Qi

N,0 N,0 N,0 LineitemP,30m N,0 N,0 Q1N,0 N,0 N,0 Q1

N,0 N,0 Q3F,30m P,20m Q3F,30m P,20m Q3


Table 6Estimated sharing opportunities using shared tuples for non-uniform data distribution.

Query No Q1 Q2 Q3 Q4 Q5 Q6 Tuples No of Qi

Q1 N,0 P,61652000 N,0 N,0 N,0 83465000Q2 F,27249000 P,7910000 P,7910000 N,0 N,0 27249000Q3 F,61652000 N,0 N,0 N,0 N,0 61652000Q4 F,61652000 P,7910000 F,61652000 N,0 N,0 61652000Q5 F,7910000 F,7910000 F,7910000 F,7910000 P,447000 7910000Q6 F,54189000 P,447000 F,54189000 F,54189000 P,447000 54189000

Table 7Estimated reused-based opportunities using histogram, best ParentQuery of Qi is bolded .

Query No Q1 Q2 Q3 Q4 Q5 Q6 Best Parent of Qi

Q1 N,0 P,61652000 N,0 N,0 N,0 LineitemQ2 F,83465000 P,61652000 P,61652000 N,0 N,0 Q1Q3 F,83465000 N,0 N,0 N,0 N,0 Q1Q4 F,83465000 P,27249000 F,61652000 N,0 N,0 Q3Q5 F,83465000 F,27249000 F,61652000 F,61652000 P,54189000 Q2Q6 F,83465000 P,27249000 F,61652000 F,61652000 P,7910000 Q3

Table 8Estimated reused-based opportunities using data size based on histogram and metadata, each value is in GB, e.g., 2.4 is 2.4 GB.

Query No Q1 Q2 Q3 Q4 Q5 Q6 Best Parent of Qi

Q1 N,0 P,1.8 N,0 N,0 N,0 LineitemQ2 F,2.4 P,1.8 P,0.6 N,0 N,0 Q1Q3 F,2.4 N,0 N,0 N,0 N,0 Q1

rtnnsbF

8

itbss

ous

Q4 F,2.4 P,0.27 F,1.8

Q5 F,2.4 F,0.27 F,1.8

Q6 F,2.4 P,0.27 F,1.8

eused-based opportunity of (Q4, Q1) is (F, 60m) which indicateshat Q4 reuses fully Q1, and reads 60 million tuples (i.e., the tuplesumber of Q1 result) while the estimated reused-based opportu-ity of (Q4, Q3) is (F, 30m). So, the Reused-based Enumerator moduleelects Q3, as a ParentQuery of Q4 because of Q3 has smaller num-er of tuples (see Table 5). The RRT multi-query plan is depicted inig. 12a.

.1.2. GRT planThe principle of the GRT plan is the same as the RRT plan, but it

nvestigates the sharing opportunities concerning coarse-graineduple size and non-uniform data distribution. The GRT plan haseen developed in two versions. The first version estimates theharing opportunities using histogram (i.e., H-GRT plan), while theecond version uses histogram metadata (i.e., HM-GRT plan).

Therefore, the H-GRT plan is generated based on the number ofverlapped DISTINCT values among input multi-query with non-niform data distribution consideration. The vertical estimatedharing opportunities (i.e., the number of shared tuples) for each

Fig. 12. The MOTH reused-based m

N,0 N,0 Q3F,0.6 P,0.25 Q2F,0.6 P,0.04 Q4

pair (Qi, Qj) using pre-computed histogram are presented in Table 6.For instance, the sharing opportunity of pair input queries (Q2, Q1)is (F, 27249000) which indicates that Q2 is fully shared with Q1by 27249000 tuples. According to each Qi in Table 7 the estimatedreused-based opportunities using the Reused-based Estimator submodule and the selected best ParentQuery(s) using Reused-basedEnumerator sub module are presented. For example, the estimatedsharing opportunities value for (Q5, Q1), (Q5, Q2), (Q5, Q3), and (Q5,Q4) is (F, 7910000), where the shared tuples are 7910000. By usinghistogram to estimate the fully reused-based opportunities of (Q5,Q1), (Q5, Q2), (Q5, Q3), and (Q5, Q4), the estimated tuples numbersare 83465000, 27249000, 61652000 and 61652000 respectively. So,the Reused-based Enumerator module selects Q2, as a ParentQueryof Q5 because of Q2 has the smallest number of tuples (see Table 7,Fig. 12b). According to Fig. 12b, the H-GRT plan has been improved
relative to RRT plan by 55% using smaller reused results regard-ing Q5 (i.e., selecting Q2 with 27249000 tuples instead of Q3 with61652000 tuples as RRT plan).
ulti-Query execution plans.

4 utatio

gvhbaT(ooetutEhFp(p

FR


By considering non-uniform data distribution and the coarse-rained size of query attributes besides the overlapped DISTINCTalues, the HM-GRT plan is generated. The estimated vertical andorizontal sharing opportunities (i.e., data size includes the num-er of tuples and tuple size) based on pre-computed histogramnd metadata respectively for each pair (Qi, Qj) are presented inable 8. For instance, the sharing opportunity of pair input queriesQ2, Q1) is (F, 2.4) which indicates that Q2 is fully reused the resultf Q1 with 2.4 GB size. The estimated reused-based opportunitiesf (Q6, Q3) and (Q6, Q4) are fully reused 61652000 tuples whichstimated by histogram (see Table 7) According to the combina-ion of number of tuples and coarse-grained tuple size estimationsing histogram and metadata, the fully reused-based opportuni-ies of (Q6, Q3) is 1.8 GB and 0.6 GB for (Q6, Q4). So, the Reused-basednumerator module selects Q4, as a ParentQuery of Q6 because Q4as the smallest data size (see Table 8, Fig. 12c). According to
ig. 12c, the HM-GRT plan has been improved relative to H-GRTlan by 67% using smallest reused result data size regarding Q6i.e., selecting Q4 with 0.6 GB instead of Q3 with 1.8 as H-GRTlan).
ig. 13. Multi-query execution plans, gray nodes: non-concurrent queries, white nodes: cRT (c) H-GRT (d) HM-GRT plan.

nal Science 26 (2018) 432–452

8.2. Choosing final multi-query plan phase

After generating the three plans of the proposed MOTH system(i.e., RRT, H-GRT and HM-GRT), the MOTH multi-query optimizermodule selects the plan with minimum cost with respect to thehistogram, metadata and shared opportunities over slow storage(see Fig. 12). Therefore, the HM-GRT multi-query plan is picked asthe optimized multi-query execution plan. Then, the MOTH systemsends the selected plan to Query Rewriter module to refine and out-put two plans; concurrent multi-query plan and non-concurrentmulti-query plan which finally, submitted to Hadoop cluster forexecution.

For simplicity, the fully shared queries are considered only inthis case study. Therefore, the Partial Reused-based sub module willnot be concerned.

8.3. Discussions

The proposed reused-based estimation techniques of the pro-posed MOTH system; RRT, H-GRT and HM-GRT are compared with

oncurrent queries, each edge label represents size of reused result in GB, (a) NT (b)


Table 9The summary of estimated data size of multi-query execution plans; NT, RRT, H-GRT,and HM-GRT.

Plan Name NT RRT H-GRT HM-GRT

rtst

8

A6HH(pn

8

rHteptf

8

csretdsbastGt

9

Ms

9

2aLNfi3s

Table 10MOTH system parameters.

System Parameters Cost Value

local disk reads 1local disk writes 1

100000 tuples is considered to represent the skew of data. Sub-stantially, the computed histogram groups data into 11 classeswith respect to the number of tuples per each class of l discountvalues. Meanwhile, the Cumulative Frequency Distribution is used

Table 11Types of multi-query samples.

Notation Multi-Query Sample Tuples Size Tuples Number

LMQ Light Multi-Query 92.53 Bytes 100 million250 million500 million

Total data size in GB 69.6 GB 21. GB 20.7 GB 19.07 GBData reduction relative to NT 69% 71% 73%

espect to NT technique where queries are submitted sequentiallyo Hadoop cluster. In particular, the discussion can be broadly clas-ified into three perspectives; data reduction, concurrency, andhroughput.

.3.1. Data reductionThe results of this case study are presented in Fig. 13 and Table 9.

ccording to Fig. 13, the total data size which is read from HDFS are9.6 GB, 21.8 GB, 20.07 GB and 19.07 GB for NT, RRT, H-GRT andM-GRT plans respectively. The data reduction of RRT, H-GRT andM-GRT plans relative to NT plan are 69%, 71%, and 73% respectively

see Table 9). Therefore, the HM-GRT plan outperforms the otherlans due to the consideration of coarse-grained opportunities andon-uniform data distribution among multi-query on slow storage.

.3.2. ConcurrencyAccording to the results in Fig. 13, it is found that the concur-

ency degree of the RRT plan is higher than that GRT plans (i.e.,-GRT and HM-GRT) of the proposed MOTH system. Regarding

he slow storage, the increasing of the concurrency degree maylongate the read latency which affects the Big Data multi-queryerformance. Therefore, the GRT plans are considered proper forhe slow storage to avoid degradation of concurrent queries per-ormance (i.e., resource contention).

.3.3. ThroughputFundamentally, the throughput measures the number of pro-

essed queries in a period which can be expressed as queries perecond or minute [32]. Conventionally, by increasing the concur-ency degree, the throughput is increased. According to Hadoopnvironment, the increasing of concurrent queries would increasehroughput, but it could potentially cause long response timeue to the competition for accessing shared resources especiallylow storages. Consequently, the multi-query optimization needsalancing, as well as, adaptability according to multi-query char-cteristics and Hadoop environment specification such as storagepeed and network bandwidth. Therefore, the proposed MOTH sys-em can adapt itself to compensate between the using of RRT orRT plans based on the storage (i.e., fast and slow) type to increase

hroughput.

. Experimental evaluation

In this section, the experimental evaluation of the proposedOTH system is presented, starting by describing the experiment

etup.

.1. Experiment setup

The experiments have been performed using Hadoop version.6.0 and Flink version 0.9.1 on a cluster of 10 nodes configured with

4 GB of RAM, 2 cores, and 200 GB disk and running over Ubuntuinux 14.04.4 LTS. MapReduce Hive version 0.14.0 on the Hadoop
amenode and Datanodes is installed. In addition, the Hadoop con-guration parameter, io.mapred.task.io.sort.mb, was tuned between00 MB to 2 GB based on the number of tuples. Also, the MOTHystem is implemented in Java.
HDFS read 1.2HDFS write 2Network I/O 1.2

9.2. MOTH cost model parameters

According to the MOTH reused-based cost model, some simpleread and write benchmark jobs run over Hadoop cluster to testI/O performance. The considered system parameters are listed inTable 10. While, the multi-query parameters are defined accordingto the sharing among the input queries.

9.3. Datasets and queries

The used structured data is through the experiments are gen-erated using TPC-H Benchmark with different sizes [31]. Lineitemtable in TPC-H was used with some modifications for l discountattribute values (i.e., range attribute is used as the selectionattribute which stores disjoint of DISTINCT values). These modifica-tions have been done to achieve the non-uniform data distributionto compute the histogram. On the other hand, the used syntheticqueries are repesented as follows:

SELECT a1 , a2 , . . ., an FROM lineitem WHERE a ≤ l discount ≤ b

In addition, two types of input multi-query are used; LightMulti-Query (LMQ) for small tuples size and Heavy Multi-Query(HMQ) for large tuples size; with different tuple numbers (seeTable 11). These two types of multi-query are used to evaluatethe efficiency of the proposed MOTH system to select the propermulti-query execution plan.

9.4. Histogram and metadata building

The column-level statistics in lineitem table are gathered tobuild an Equal-Width histogram to provide sufficient estimation forthe proposed MOTH system [23]. The selected column, l discount,reflects the non-uniform data distribution, which divides the datainto classes according to the uniqueness of its values. Aggregatedqueries are used to calculate the frequencies of DISTINCT values.Hence, the computed histogram is considered as a table, and itsschema described as table name, column name, distinct value, fre-quency, relative frequency. The relative frequency of each DISTINCTvalue is computed by dividing the frequency by the total number oftuples. Finally, the computed histograms are written back to HDFS[27].

To simplify histogram presentation, the lineitem table with

1 billionHMQ Heavy Multi-Query 142.22 Bytes 100 million

250 million500 million1 billion


02474

53742

447

7463

14512

48271776 1756 202

12801

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10.0

10000.0

20000.0

30000.0

40000.0

50000.0

60000.0

l_discount value

Freq

uenc

y

Uniform Distribu on Frequency

Fw

ttFtm

sattt

8983

17969

27028

36088

45166

54257

63369

72489

81622

90790

100000

02474

56216 56663

64126

7863883465 85241 86997 87199

100000

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10.0

10000.0

20000.0

30000.0

40000.0

50000.0

60000.0

70000.0

80000.0

90000.0

100000.0

l_discount value

Cum

ula

ve F

requ

ency

Uniform Distribu on Cumula ve FrequencyNon-Uniform Distribu on Cumula ve Frequency
Non-Uniform Distribu on Frequency
ig. 14. The frequency of uniform and non-uniform data distribution in l discountithin original nd modified lineitem tables.

o represent the uniform and non-uniform data distribution his-ograms which are defined as the total of data frequencies [25].igs. 14 and 15 depict the uniform and non-uniform data distribu-ion and their Cumulative Frequency Distribution for the original and

odified versions of lineitem table with respect to l discount.The metadata is employed to improve the productivity of data

haring. For instance, the Hive Metastore stores all information
bout tables, their partitions, locations, schemas, columns andypes, etc [33]. Some information is extracted from Hive Metastoreo estimate coarse-grained reused-based opportunities accuratelyhrough the proposed MOTH system.
Fig. 16. The MOTH systemmulti-query execution time of light and heavy mul

Fig. 15. The cumulative frequency of uniform and non-uniform data distribution inl discount for original and modified lineitem tables.

9.5. Performance evaluation

The performance of the proposed MOTH system with respect toquery execution time is evaluated using two reused-based oppor-tunity types; fully and partially. According to fully reused-basedopportunity, the generated NT, RRT, H-GRT, and HM-GRT plans areexecuted, while the generated NT, PUT and PBT plans are executed
regarding the partial reused-based opportunity. Also, a small-scaleenvironment with slow storage is used to clarify the effect of datareduction for Big Data multi-query performance.
ti-query for different tuples number running on MapReduce and Flink..


41%43%

54% 55%

40%44%

54%

63%

42%

48%

57%58%

43%

50%

59%

65%

43%

52%

61% 62%

47%

52%

64% 66%

100 million 250 million 500 million 1 billion 100-million 250-milion 500-million 1-billion35%

40%

45%

50%

55%

60%

65%

70%

Tuples Number

Impr

ovm

ent o

f Exe

cuon

Tim

e w

rt N

T

Light Mul -Query

RRT H-GRT HM-GRT(a) MapReduce (b) Flink

t NT fo

9

is(LtlLMMr

atobFp

Fig. 17. Multi-query execution time improvement wr

.5.1. Fully reused-based opportunityThe fully reused-based queries are evaluated by implement-

ng the three generated multi-query plans of the proposed MOTHystem; RRT, H-GRT, HM-GRT over Hadoop-like infrastructuresi.e., MapReduce and Flink) using two samples of queries (i.e.,ight Multi-Query, Heavy Multi-Query) and different number ofuples (i.e., 100 million, 250 million, 500 million and 1 bil-ion). For simplifications, these samples are named as MapReduceight Multi-Query, MapReduce Heavy Multi-Query, Flink Lightulti-Query, and Flink Heavy Multi-Query which are denoted asapReduce-LMQ, MapReduce-HMQ, Flink-LMQ, and Flink-HMQ

espectively.The implementation results with respect to the execution time

re presented in Fig. 16(a)–(d). Mostly, the multi-query executionime of NT plan increases by increasing the tuples number than thether generated multi-query plans of the proposed MOTH system
ecause of the needed time to scan large input file multiple times.or example, by given 20 input queries, if the output multi-querylan of the proposed MOTH system has two root queries, it needs
28%32%

33%

36%30%

37% 40%

42%37%

44%47%

52%

100 million 250 million 500 million 1 billion20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

Tuples N

Impr

ovm

ent o

f Exe

cuon

Tim

e w

rt N

T

Heavy M

RRT(a) MapReduce

Fig. 18. Multi-query execution time improvement wrt NT fo

r LMQ on MapReduce and Flink using MOTH system.

to scan the input file only twice rather than 20 times as NT plan. Inaddition, by increasing the number of tuples, HM-GRT plan outper-forms NT, RRT, and H-GRT plans whether HMQ or LMQ are executedover both MapReduce and Flink.

For more investigation, the execution time improvement of RRT,H-GRT, and HM-GRT plans relative to NT plan for LMQ and HMQ onMapReduce and Flink is presented in Figs. 17 (a), (b) and 18 (a), (b)respectively. Moreover, according to the results of Fig. 17(a), (b), it isfound that, HM-GRT plan has been improved relative to other plansof the proposed MOTH system by 43%, 52%, 61%, and 62% for 100million, 250 million, 500 million and 1 billion respectively for LMQon MapReduce, 47%, 52%, 64% and 66% with respect Flink, while theimprovement of HM-GRT plan for HMQ on MapReduce and Flink isdepicted in Fig. 18(a), (b).

To investigate the performance of the proposed MOTH sys-tem plans with considering different Hadoop-like infrastructures,
Fig. 19(a), (b) presents the experimental results for both LMQ andHMQ over MapReduce and Flink. According to Fig. 19, it is noticedthat the HMQ affects the multi-query execution time relative to
42%

44%46%

55%

42%46%

51%

57%

45%49%

55%59%

100-million 250-milion 500-million 1-billionumber

ul -Query

H-GRT HM-GRT(b) Flink

r HMQ on MapReduce and Flink using MOTH system.


48%

32%

50%

47%

52%

37%

54%

49%

54%

45%

57%

52%

QMHQMLQMHQML30%

35%

40%

45%

50%

55%

60%

65%

Tuples NumberImpr

ovm

ent o

f Exe

cuon

Tim

e w

rt N

T

Average Execu on Time Improvment of Mul -Query

RRT H-GRT HM-GRT(a) MapReduce (b) Flink

Fig. 19. Average of multi-query execution time improvement wrt multi-query sample type on MapReduce and Flink using MOTH system.

Fa

Lsm

Htatnorglo

9

tme5et

pMe

25%

22%

18%

28%27%

25%

10% 25% 50%15%

17%

19%

21%

23%

25%

27%

29%

Partail Ra o

Exce

cuon

Tim

e Im

prov

men

t w

rt N

T

PUT wrt NT

ig. 20. Multi-query execution time of different partial multi-query ratios: 10%, 25%nd 50%.

MQ over both MapReduce and Flink because of the large tuplesize. In addition, the proposed MOTH system gains high perfor-ance on Flink then that MapReduce.We can conclude that the multi-query execution times of RRT,

-GRT, and HM-GRT plans for LMQ have been improved relativeo HMQ, using the same number of tuples on MapReduce, as wells, on Flink. This is because HMQ needs large number of I/O opera-ions considering slow storage than that LMQ for fetching the sameumber of tuples. In addition, HM-GRT plan has the superiority overther plans for both LMQ and HMQ on MapReduce and Flink. Theeason behind this is that HM-GRT plan incorporates the tuple sizeranularities and non-uniform data distribution for handling over-apping in the classical multi-query optimization problem runningn slow storage.

.5.2. Partial reused-based opportunityThe partial reused-based queries are evaluated by executing

he two multi-query plans; the non-derived plan and the updateulti-query plan using our proposed MOTH system with consid-

ring different ratio of partial queries numbers (i.e., 10%, 25% and0%). At the beginning, the HM-GRT plan is applied to detect andxploit the reused-based opportunities among multi-query. Then,hree plans; NT, PUT, and PBT; are executed.

The experimental results of multi-query execution time of three
lans of partial queries over 100 million tuples running on HiveapReduce are depicted in Fig. 20. Conventionally, the multi-query
xecution time of NT plan is the largest with respect to PUT and PBT

Fig. 21. Multi-query execution time improvement for PUT and PBT wrt NT for dif-ferent partial multi-query ratios.

because of no exploiting of coarse-grained partial sharing amongthe input multi-query.

According to the experimental results, it is found that the pro-posed PUT and PBT techniques outperform NT technique by 22%and 27% on average respectively. Furthermore, as the ratio of par-tial queries increases, the improvement of multi-query executiontime using PUT and PBT decreases (see Fig. 21). The reason behindthis is that the non-derived results size is increased by increas-ing the ratio of partial queries. Intuitively, by increasing the partialqueries ratio, the proposed PUT and PBT plans behave as NT plan.

Because the PBT plan executes branched queries instead ofunion queries, as the PUT plan, the number of I/O operations isdecreased. Therefore, the PBT plan outperforms the PUT plan by 4%,7%, and 9% for 10%, 25% and 50% ratio of partial queries respectively(see Fig. 22).

Generally, the conducted experiments indicate that exploit-ing partial reused-based opportunities for large-scale distributeddata can reduce the cost of I/O operations, and then optimize Big
Data multi-query. Ultimately, the consideration of the combina-tion of multiple factors such as the non-uniform data distribution,coarse granularity of tuple size, partial queries ratio, and I/O speed

R. Sahal et al. / Journal of Computatio

4%

7%

9%

10% 25% 50%1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

Partail Ra o

Exce

cuon

Tim

e Im

prov

men

t w

rt P

UT

Fp

oq

1

D(wtopgtbcpRaPutRrpwtNd

PBT wrt PUT

ig. 22. Multi-query execution time improvement for PBT wrt PUT for differentartial multi-query ratios.

f Hadoop storage can gain significant improvement for partialueries using the proposed MOTH system.

0. Conclusions

One of the key issues for optimizing query processing in Bigata analytical systems is to reduce the cost of data processing

i.e., read, write) over Big Data infrastructure. According to thisork, the proposed MOTH system is considered the first system

o exploit the coarse granularity of fully and partial reused-basedpportunities in Big Data multi-query optimization. The mainrinciple of the proposed MOTH system is to investigate the coarse-rained reused-based opportunities horizontally (i.e., non-equaluples size), and non-uniform data distribution (i.e., tuples num-er) vertically using metadata and histogram with slow storageonsideration. According to the MOTH system, three multi-querylans are generated for study fully reused-based opportunities (i.e.,RT, H-GRT and HM-GRT) and two multi-query plans are gener-ted for study partial reused-based opportunities (i.e., PUT andBT). A comparative study has been done to implement and eval-ate the performance of the MOTH generated plans with respecto Naive plan over MapReduce and Flink using TPC-H benchmark.egarding to fully reused-based opportunities, the experimentalesults show that RRT, H-GRT and HM-GRT plans outperform NTlan on average by 40%, 45% and 50% respectively on MapReduce,
hile 49%, 52%, and 55% respectively on Flink. In case of par-
ial reused-based opportunities, PUT and PBT plans outperformT plan on average by 22% and 27% respectively on MapRe-uce.

nal Science 26 (2018) 432–452 449

Appendix A.

Algorithm 1: MOTH Sharing Classifier

Input: = [ , , … , ]Output: = [ , , … , ]= [ , , … , ]// Step 1: initialization

1. = { };2. = { };3. ℎ = new String[n][n];//Step 2: classify parsed queries4. foreach ∈ do

5. foreach ∈ do

// only escape if the same index such as ( 1, 1), but the same queries in different index are compared 6. if ( == ) then break;7. if(FRO( , )) then

8. = ∪ ; ℎ [ ][ ] = " ";9. End if;10. if(PRO( , )) then

11. = ∪ ; ℎ [ ][ ] = " ";12. End if;13. Else14. ℎ [ ][ ] = " ";15. End for; 16. End for;// Step 3: Find non-shared queries 17. int share=0;18. foreach i from 1 to n do19. foreach i from 1 to n do20. if( ℎ [ ][ ]! = " ") share++;21. If(share==0) then22. = ∪ ; share=0;23. End if;24. End for;25. End for;26. End;27. End.

Algorithm 2: Update Reused Opportunity Weight

Input: = [ , , … . , ]Output: = [ , , … ., ]

/* updating each DAG graph with respect to reused type and length of reused results in terms of histogram, metadata*/1. foreach ∈ do

2. foreach ∈ do

3. foreach ∈ do

4. = ℎ ( , );1. = ( )2. = ( , Histogram)
3. = ( , Metadata)4. =Weight( , , ,5. End ;

4 utatio
Algorithm 3: MOTH Reused-based Multi-Query Optimizer

Input: = [ , , … ., ]= [ , , … , ]Output: = [ , , … . , ]

/* generating optimized queries represented as special DAG graph called rooted tree ordered on sharing type and data distribution */1. double = 0// Step 1: calculate the estimated cost of non-shared queries2. foreach ∈ do3. += ( )4. End for// Step 2: calculate the estimated cost of shared queries5. = 06. = 07. foreach ∈ do

8. foreach ∈ do

/* get all nodes which have fully/partial reused with in ,

these nodes are parent of */

9. = ,10. = ,/*if there are some queries in fully/partial nodes list, this means that the query has different candidate parent query, and the following step to finding it is an optimal parent. */11. if ( ≠ ) then

12. = ℎ13. +=14. = ∪15. ( )/* This function update level with respect to its query parent whose the minimum cost in as well as update its sub queries if are inserted to early*/

16. End if17. if ( = ⋀ ≠ ) then

18. = ℎ19. +=20. = ∪21. ( )22. End if23. if ( = ⋀ = ) then24. += ( )25. = 026. ( )27. = ∪28. End if29. End for30. End for31. foreach ∈ do32. += ( )33. End for34. End

nal Science 26 (2018) 432–452

utatio

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[[

[

[

[

[


eferences

[1] R. Akerkar, Big Data Computing, CRC Press, 2013.[2] A. Gkoulalas-Divanis, A. Labbi, Large-Scale Data Analytics, Springer, 2014.[3] C.P. Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques

and technologies: a survey on Big Data, Inf. Sci. 275 (2014) 314–347.[4] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large

clusters, Commun. ACM 51 (2008) 107–113.[5] S. Babu, H. Herodotou, Massively parallel databases and MapReduce systems,

Found. Trends Databases 5 (2013) 1–104.[6] N. Spangenberg, M. Roth, B. Franczyk, Evaluating new approaches of big data

analytics frameworks, Bus. Inf. Syst. (2015) 28–37.

nal Science 26 (2018) 432–452 451

[7] C. Eiras-Franco, V. Bolón-Canedo, S. Ramos, J. González-Domínguez, A.Alonso-Betanzos, J. Tourino, Multithreaded and Spark parallelization offeature selection filters, J. Comput. Sci. 17 (Part 3) (2016) 609–619.

[8] T.K. Sellis, Multiple-query optimization, ACM Trans. Database Syst. (TODS) 13(1988) 23–52.

[9] R. Sahal, M.H. Khafagy, F.A. Omara, A survey on SLA management for cloudcomputing and cloud-hosted big data analytic applications, Inte. J. DatabaseTheory Appl. 9 (2016) 107–118.

10] H. Liu, D. Xiao, P. Didwania, M.Y. Eltabakh, Exploiting soft and hardcorrelations in big data query optimization, Proc. VLDB Endow. 9 (2016).

11] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, N. Koudas, MRShare: sharingacross multiple queries in MapReduce, Proc. VLDB Endow. 3 (2010) 494–505.

12] I. Elghandour, A. Aboulnaga, Restore: reusing results of mapreduce jobs, Proc.VLDB Endow. 5 (2012) 586–597.

13] G. Wang, C.-Y. Chan, Multi-query optimization in mapreduce framework,Proc. VLDB Endow. 7 (2013) 145–156.

14] J. Park, A. Segev, Using common subexpressions to optimize multiple queries,in Data Engineering, 1988. Proceedings. Fourth International Conference on(1988) 311–319.

15] J. LeFevre, J. Sankaranarayanan, H. Hacigumus, J. Tatemura, N. Polyzotis, M.J.Carey, Opportunistic physical design for big data analytics, in ProceedingsOfthe 2014 ACM SIGMOD International Conference on Management of Data(2014) 851–862.

16] J. LeFevre, J. Sankaranarayanan, H. Hacigumus, J. Tatemura, N. Polyzotis, M.J.Carey, MISO: souping up big data query processing with a multistore system,in Proceedings Ofthe 2014 ACM SIGMOD International Conference onManagement of Data (2014) 1591–1602.

17] T. Dokeroglu, S. Ozal, M.A. Bayir, M.S. Cinar, A. Cosar, Improving theperformance of Hadoop Hive by sharing scan and computation tasks, J. CloudComput. 3 (2014) 1–11.

18] K. Kambatla, Y. Chen, The truth about mapreduce performance on ssds, 28thLarge Installation System Administration Conference (LISA14) (2014).

19] C. Olston, B. Reed, A. Silberstein, U. Srivastava, Automatic optimization ofparallel dataflow programs, in USENIX 2008 Annual Technical Conference onAnnual Technical Conference (2008) 267–273.

20] Radhya Sahal, Mohamed H. Khafagy, F.A. Omara, Comparative study ofmulti-query optimization techniques using shared predicate-based for bigdata, Int. J. Grid Distrib. Comput. 9 (2016) 229–240.

21] J. Camacho-Rodríguez, D. Colazzo, M. Herschel, I. Manolescu, S.R. Chowdhury,Reuse-based optimization for pig latin, in BDA’2014: 30e Journées Bases DeDonnées Avancées (2014).

22] S. Sen, S. Roy, A. Sarkar, N. Chaki, N.C. Debnath, Dynamic discovery of querypath on the lattice of cuboids using hierarchical data granularity and storagehierarchy, J. Comput. Sci. 5 (2014) 675–683.

23] A. Gruenheid, E. Omiecinski, L. Mark, Query optimization using columnstatistics in hive, in Proceedings of the 15th Symposium on InternationalDatabase Engineering & Applications (2011) 97–105.

24] Y. Ioannidis, The history of histograms (abridged), in Proceedings of the 29thInternational Conference on Very Large Data Bases-Volume 29 (2003) 19–30.

25] V. Poosala, P.J. Haas, Y.E. Ioannidis, E.J. Shekita, Improved histograms forselectivity estimation of range predicates, in ACM SIGMOD Record (1996)294–305.

26] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff,R. Murthy, Hive: a warehousing solution over a map-reduce framework, Proc.VLDB Endow. 2 (2009) 1626–1629.

27] S. Wu, F. Li, S. Mehrotra, B.C. Ooi, Query optimization for massively paralleldata processing, Proceedings of the 2nd ACM Symposium on CloudComputing (2011) 12.

28] T. White, Hadoop: The Definitive Guide, O’Reilly Media, Inc., 2012.29] (2015, 21-10-2015). Cost-based optimization in Hive – Apache Hive – Apache

Software Foundation. Available: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive.

30] W. Guoping, Optimization Techniques for Complex Multi-query Applications,2014.

31] T. P. P. Council. (2008). TPC-H benchmark specification. Available: http://www.tcp.org/hspec.htm.

32] F.A. Omara, S.M. Khattab, R. Sahal, Optimum resource allocation of database incloud computing, Egypt. Inf. J. 15 (2014) 1–12.

33] (2015, 20-10-2015). Design – Apache Hive – Apache Software Foundation.Available: https://cwiki.apache.org/confluence/display/Hive/Design.

Prof. Fatma A. Omara is Professor in the Computer Sci-ence Department, Faculty of Computers and Information,Cairo University. She has published over 120 researchpapers in prestigious international journals, and con-ference proceedings. She has served as Chairman andmember of Steering Committees and Program Commit-tees of several National and International Conferences.She has supervised over 55 Ph.D. and M.Sc. students.
Prof. Omara is a member of the IEEE and the IEEE Com-puter Society. Prof. Omara interests are Parallel andDistributing Computing, Distributed Operating Systems,High Performance Computing, Grid, Cloud Computing,Big Data.
http://refhub.elsevier.com/S1877-7503(17)30614-2/sbref0005

























































































































































































































































































































































































































































































































































































































































































https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive


















http://www.tcp.org/hspec.htm




























https://cwiki.apache.org/confluence/display/Hive/Design








4 utatio

University, Yemen. She holds her M.Sc in computer science in 2013. Currently, sheis a PhD student in Faculty of Computers and Information, Cairo University, Egypt.


Dr.Mohmmed H. Khafagy is an assistant professor inthe department of computer science, Faculty of com-puters and information, Fayoum University. Mohmmedobtained his PhD in computer science from the depart-ment of computer science, Helwan University in 2009.
He is Head of Egyptian Big Data Research Group, Headof E-Learning Centre of Fayoum University from 2012 to2015. His research interests are Big Data, Cloud Comput-ing, Database Replication, Distributed system and Datamining.
nal Science 26 (2018) 432–452

Radhya Sahal is a teacher assistant in computer science department in Computersand Engineering Collage, Hodeida University, Yemen. She holds a Bachelor’s degreein Computer science from Faculty of Computer Science & Engineering Hodeidah

Radhya has publications in the area of Cloud Computing and database manage-ment. Her interests are High Performance Computing, Cloud Computing, DatabaseManagement, Big Data, and Artificial Intelligent.

Journal of Computational Science - Cairo University Journal of Computational Science 26 (2018) 432–452 Contents lists available at ScienceDirect Journal of Computational Science

Documents