Towards Automatic Optimization of MapReduce Programs (Position Paper)

Towards Automatic Optimization of MapReduce Programs

(Position Paper)

Shivnath BabuShivnath Babu

Duke UniversityDuke University

JAQL

Roadmap

• Call to action to improve automatic optimization techniques in MapReduce frameworks

• Challenges & promising directions

Hadoop

HDFS

Pig Hive …

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Job Configuration Parameters

• 190+ parameters in Hadoop

• Set manually or defaults are used

• Are defaults or rules-of-thumb good enough?

Ru

nn

ing

tim

e (

se

co

nd

s)

Experiments

On EC2 andlocal clusters

Ru

nn

ing

tim

e (s

eco

nd

s)

Ru

nn

ing

tim

e (

min

ute

s)

Ru

nn

ing

tim

e (m

inu

tes)

• Performance at default and rule-of-thumb settings can be poor

• Cross-parameter interactions are significant

Illustrative Result: 50GB Terasort17-node cluster, 64+32 concurrent map+reduce slots

mapred.reduce.tasks

io.sort.factor

io.sort.record.percent

10 10 0.15

Runningtime

10 500 0.15

28 10 0.15

300 10 0.15

300 500 0.15

Based onpopularrule-of-thumb

Problem Space

Current approaches:• Predominantly manual• Post-mortem analysis

Job configurationparameters

Declarative HiveQL/Pigoperations

Multi-jobworkflows

Performanceobjectives

Cost in pay-as-you-goenvironment

Energyconsiderations

Com

plex

ity

Spa

ce o

f ex

ecut

ion

choi

ces

Is this where we want to be?

Good planGood settingof parameters

Can DB Query Optimization Technology Help?

But:

– MapReduce jobs are not declarative

– No schema about the data

– Impact of concurrent jobs & scheduling?

– Space of parameters is huge

Optimizer:• Enumerate• Cost • Search

QueryDatabaseExecution

Engine

MapReducejob

Hadoop Results

Can we:

– Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years

• Or innovate!

– Exploit design & usage properties of MapReduce frameworks

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)


Conventional



AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

LearningOptimizers(learn from

execution & adapt)

TuningOptimizers(proactively

try different plans)


Conventional



LearningOptimizers(learn from

execution & adapt)

Exploit usage & design properties of MapReduce frameworks:

• High ratio of repeated jobs to new jobs

• Schema can be learned (e.g., Pig scripts)

• Common sort-partition-merge skeleton

• Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results)

• Fine-grained and pluggable scheduler

TuningOptimizers(proactively

try different plans)

Summary• Call to action to improve automatic optimization

techniques in MapReduce frameworks– Automated generation of optimized Hadoop configuration

parameter settings, HiveQL/Pig/JAQL query plans, etc.

– Rich history to learn from

– MapReduce execution creates unique opportunities/challenges

Towards Automatic Optimization of MapReduce Programs (Position Paper)

Documents

mapreduce jobs

mapreduce joblifecycle

mapreduce jobhow

mapreduce jobmap wave

hiveqlpigjaql query

frommapreduce execution

concurrent map

number of map