Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation Jie Pan · Frédéric Magoulès · Yann Le Biannic · Christophe Favart B99705024 林劭軒 B99705021 李奕德 R00725051 郗昀彥 § Ecole Centrale Paris · † SAP Research § § † † Telecommunication Systems 2013
38
Embed
for "Parallelizing Multiple Group-by Queries using MapReduce"
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parallelizing Multiple Group-by queries using MapReduce:
optimization and cost estimationJie Pan · Frédéric Magoulès ·
Yann Le Biannic · Christophe Favart
B99705024 林劭軒B99705021 李奕德R00725051 郗昀彥
§ Ecole Centrale Paris · † SAP Research
§ §
† †
Telecommunication Systems 2013
Outline
• MapReduce and Optimized MapReduce
• Cost Estimation
• Experiments and Evaluation
MapReduce
Data
MapDi MapDi MapDi MapDi
Master Node
Worker Nodes
MapReduce
Data
MapDi MapDi MapDi MapDi
Di Map
Master Node
Worker Nodes
MapReduce
Data
MapDi MapDi MapDi MapDi
Di Map
Master Node
Worker Nodes
serialize :: structured objects → byte stream
de-serialize :: byte stream → structured objects
MapReduce
Data
MapDi MapDi MapDi MapDi
Di IiMap
Master Node
Worker Nodes
MapReduce
Data
MapDi MapDi MapDi MapDi
Di IiMapDiDiDiIi
Master Node
Worker Nodes
Data
MapDi MapDi MapDi MapDi
Di IiMapReducer
Result
DiDiDiIi
Master Node
Worker Nodes
MapReduce
Motivation
• Data Analysis (Business Intelligence)
• Task with Predicates
• High Selectivity => High Communication Cost
•
• Goal: Reduce the Volume of Intermediate Data
DiDiDiIi Master NodeWorker Nodes
Selectivity = #Data
#Data Satisfying Predicates
Data
MapDi MapDi MapDi MapDi
Di IiMapsignal
Master Node
Worker Nodes
MapCombineReduce (1/2)
MapCombineReduce (2/2)
Data
MapDi MapDi MapDi MapDi
IiCombiner
Master Node
Worker Nodes
CombinerCombinerCombinerCombiner
MapCombineReduce (2/2)
Data
MapDi MapDi MapDi MapDi
Ai IiCombiner
Master Node
Worker Nodes
CombinerCombinerCombinerCombiner
Data
Reducer
Result
MapDi MapDi MapDi MapDi
DiDiDiAi
Ai IiCombiner
Master Node
Worker Nodes
CombinerCombinerCombinerCombiner
MapCombineReduce (2/2)
Cost Estimation
Notations – general
Cost
min ∑ Cst + Cw + Ccl + Ccmm
Data
MapDi MapDi MapDi MapDi
Di IiMapReducer
Result
DiDiDiIi
Master Node
Worker Nodes
Initial Build (1/4)
Creating a mappingSerialize Data
Forall mappers
Network Factor
Mapper’s Data Transfer Cost
Result Transfer Cost
Initial Build (2/4)
De-serialize Data Serialize Result
Fragment
Load to Memory
Filter Cost
Initial Build (3/4)
De-serialize All Result
Selected DataAggregation Cost
Initial Build (4/4)
• sizem = 0
• Cmpg * nbm is constant
Optimized Build (1/6)
Nodes to be Combined
Size of Combiner’s Object
Does Not Change
Optimized Build (2/6)
Does Not Change
Does Not Serialize Result
Optimized Build (3/6)
Serialize Intermediate Result
Optimized Build (4/6)
De-serialize Intermediate Result
Optimized Build (5/6)
• Network Factor * (Start to Map + Worker to Combiner + Reduce Phare)