Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Shivnath Babu Duke University Duke University
Dec 31, 2015
Towards Automatic Optimization of MapReduce Programs
(Position Paper)
Shivnath BabuShivnath Babu
Duke UniversityDuke University
JAQL
Roadmap
• Call to action to improve automatic optimization techniques in MapReduce frameworks
• Challenges & promising directions
Hadoop
HDFS
Pig Hive …
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
Map Wave 1
ReduceWave 1
Map Wave 2
ReduceWave 2
Input Splits
Lifecycle of a MapReduce JobTime
How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters
• 190+ parameters in Hadoop
• Set manually or defaults are used
• Are defaults or rules-of-thumb good enough?
Ru
nn
ing
tim
e (
se
co
nd
s)
Experiments
On EC2 andlocal clusters
Ru
nn
ing
tim
e (s
eco
nd
s)
Ru
nn
ing
tim
e (
min
ute
s)
Ru
nn
ing
tim
e (m
inu
tes)
• Performance at default and rule-of-thumb settings can be poor
• Cross-parameter interactions are significant
Illustrative Result: 50GB Terasort17-node cluster, 64+32 concurrent map+reduce slots
mapred.reduce.tasks
io.sort.factor
io.sort.record.percent
10 10 0.15
Runningtime
10 500 0.15
28 10 0.15
300 10 0.15
300 500 0.15
Based onpopularrule-of-thumb
Problem Space
Current approaches:• Predominantly manual• Post-mortem analysis
Job configurationparameters
Declarative HiveQL/Pigoperations
Multi-jobworkflows
Performanceobjectives
Cost in pay-as-you-goenvironment
Energyconsiderations
Com
plex
ity
Spa
ce o
f ex
ecut
ion
choi
ces
Is this where we want to be?
Good planGood settingof parameters
Can DB Query Optimization Technology Help?
But:
– MapReduce jobs are not declarative
– No schema about the data
– Impact of concurrent jobs & scheduling?
– Space of parameters is huge
Optimizer:• Enumerate• Cost • Search
QueryDatabaseExecution
Engine
MapReducejob
Hadoop Results
Can we:
– Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years
• Or innovate!
– Exploit design & usage properties of MapReduce frameworks
Spectrum of Query Optimizers
Conventional
OptimizersRule-based
Cost models + statistics about data
AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks
Insight: Predictability(RBO) >> Predictability(CBO)
Spectrum of Query Optimizers
Conventional
OptimizersRule-based
Cost models + statistics about data
AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks
Insight: Predictability(RBO) >> Predictability(CBO)
LearningOptimizers(learn from
execution & adapt)
TuningOptimizers(proactively
try different plans)
Spectrum of Query Optimizers
Conventional
OptimizersRule-based
Cost models + statistics about data
LearningOptimizers(learn from
execution & adapt)
Exploit usage & design properties of MapReduce frameworks:
• High ratio of repeated jobs to new jobs
• Schema can be learned (e.g., Pig scripts)
• Common sort-partition-merge skeleton
• Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results)
• Fine-grained and pluggable scheduler
TuningOptimizers(proactively
try different plans)
Summary• Call to action to improve automatic optimization
techniques in MapReduce frameworks– Automated generation of optimized Hadoop configuration
parameter settings, HiveQL/Pig/JAQL query plans, etc.
– Rich history to learn from
– MapReduce execution creates unique opportunities/challenges