Towards Automatic Optimization of MapReduce Programs (Position Paper)

Towards Automatic Optimization of MapReduce Programs

(Position Paper)

Shivnath BabuShivnath Babu

Duke UniversityDuke University

Roadmap

• Call to action to improve automatic optimization techniques in MapReduce frameworks

• Challenges & promising directions

Hadoop

Pig Hive …

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Job Configuration Parameters

• 190+ parameters in Hadoop

• Set manually or defaults are used

• Are defaults or rules-of-thumb good enough?

Experiments

On EC2 andlocal clusters

• Performance at default and rule-of-thumb settings can be poor

• Cross-parameter interactions are significant

Illustrative Result: 50GB Terasort17-node cluster, 64+32 concurrent map+reduce slots

mapred.reduce.tasks

io.sort.factor

io.sort.record.percent

10 10 0.15

Runningtime

10 500 0.15

28 10 0.15

300 10 0.15

300 500 0.15

Based onpopularrule-of-thumb

Problem Space

Current approaches:• Predominantly manual• Post-mortem analysis

Job configurationparameters

Declarative HiveQL/Pigoperations

Multi-jobworkflows

Performanceobjectives

Cost in pay-as-you-goenvironment

Energyconsiderations

Is this where we want to be?

Good planGood settingof parameters

Can DB Query Optimization Technology Help?

– MapReduce jobs are not declarative

– No schema about the data

– Impact of concurrent jobs & scheduling?

– Space of parameters is huge

Optimizer:• Enumerate• Cost • Search

QueryDatabaseExecution

Engine

MapReducejob

Hadoop Results

Can we:

– Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years

• Or innovate!

– Exploit design & usage properties of MapReduce frameworks

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

Conventional

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

LearningOptimizers(learn from

execution & adapt)

TuningOptimizers(proactively

try different plans)

Conventional

LearningOptimizers(learn from

execution & adapt)

Exploit usage & design properties of MapReduce frameworks:

• High ratio of repeated jobs to new jobs

• Schema can be learned (e.g., Pig scripts)

• Common sort-partition-merge skeleton

• Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results)

• Fine-grained and pluggable scheduler

TuningOptimizers(proactively

try different plans)

Summary• Call to action to improve automatic optimization

techniques in MapReduce frameworks– Automated generation of optimized Hadoop configuration

parameter settings, HiveQL/Pig/JAQL query plans, etc.

– Rich history to learn from

– MapReduce execution creates unique opportunities/challenges

Towards Automatic Optimization of MapReduce Programs (Position Paper)

mapreduce jobs

mapreduce joblifecycle

mapreduce jobhow

mapreduce jobmap wave

hiveqlpigjaql query

frommapreduce execution

concurrent map

number of map

Documents

BeBOP: Berkeley Benchmarking and Optimization Automatic .......

DynamicMR: A Dynamic Slot Allocation Optimization Framework....

Optimization for iterative queries on MapReduce ·...

Automatic Optimization

OLTAGE REACTIVE POWER OPTIMIZATION AUTOMATIC …

Optimization of Real-World MapReduce Applications With...

Ant Colony Optimization for MapReduce Application to ...

Automatic optimization of atomic coordinates School ... ·....

PolyMage: Automatic Optimization for Image Processing...

Automatic optimization of MapReduce Programs Michael ...

AUTOMATIC CONTROL AND OPTIMIZATION OF DRILLING …

BeBOP: Berkeley Benchmarking and Optimization Automatic...

Automatic optimization of MapReduce Programs Michael...

BeBOP: Berkeley Benchmarking and Optimization Automatic

Slipstream: Automatic Interprocess Communication...

MapReduce. MapReduce Outline MapReduce Architecture...