Top Banner
University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota http://www.cs.umn.edu/~chandra 1
40

University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

Dec 29, 2015

Download

Documents

Letitia Allen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Running MapReduce in Non-Traditional Environments

Abhishek ChandraAssociate Professor

Department of Computer Science and Engineering

University of Minnesotahttp://www.cs.umn.edu/~chandra

1

Page 2: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Talk Outline

Big Data and MapReduce MapReduce Background MapReduce in Non-Traditional

Environments Concluding Remarks

2

Page 3: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Big Data Data-rich enterprises and communities

Both user-facing services and batch data processing

Commercial, social, scientific E.g.: Google, Facebook, Yahoo, LHC, ...

Data analysis is key!

Need massive scalability and parallelism PB’s of data, millions of files, 1000’s of nodes,

millions of users

Need to do this cost effectively and reliably Use commodity hardware where failure is the

norm Share resources among multiple projects

Page 4: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Big Data and MapReduce Simple data-parallel programming model and

framework Designed for scalability and fault-tolerance Can express several data analysis algorithms

Widely used Pioneered by Google: Processes several

petabytes of data per day Popularized by open-source Hadoop project:

Used at Yahoo!, Facebook, Amazon, …

Page 5: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Design Goals

Scalability 1000’s of machines,

10,000’s of disks TBs-PBs of data

Cost-efficiency Hardware: Commodity

machines and network Administration: Automatic

fault-tolerance, easy set up

Programming: Easy to use and write applications

Image Source: http://www.ibm.com

Page 6: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Applications (Industry)

Google: Index construction for Google Search Article clustering for Google News

Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail

Facebook: Ad optimization Spam detection

...

Page 7: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Applications (Research) Wide interest in academia/research:

High Energy Physics (Indiana) Astronomical image analysis (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)...

Page 8: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Talk Outline

Big Data and MapReduce MapReduce Background MapReduce in Non-Traditional

Environments Concluding Remarks

8

Page 9: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Computation

Input

Data

Data Push

Output

Data

Map Reduce

Page 10: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Programming Model

Data: Sequence of key-value records

Map function: converts input key-value pairs to intermediate key-value pairs

(Kin, Vin) list(Kinter, Vinter)

Reduce function: converts intermediate key-value pairs to output key-value pairs

(Kinter, list(Vinter)) list(Kout, Vout)

Page 11: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Example: Word Count

def mapper(file, text):

foreach word in text.split():

output(word, 1)

def reducer(word, list(count)):

output(word, sum(count))

Page 12: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Word Count ExampleInput Map Shuffle & Sort Reduce Output

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the,1fox,

1the,

1

how, 1now, 1brown,

1

ate, 1mouse, 1

cow, 1

Page 13: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Workflow

Input

Data

Data Push

Output

Data

Map Reduce

Page 14: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Stages Push: Input split into large chunks and placed

on local disks of cluster nodes

Map: Chunks are served to “mapper” tasks Prefer mapper that has data locally Mappers save outputs to local disk before

serving them to reducers

Reduce: “Reducers” execute reduce tasks when map phase complete

Page 15: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Partitioning/Shuffling Goal: Divide intermediate key space across

reducers k reduce tasks => k partitions (simple hash fn) E.g.: k=3, keys: {1,…6} => partitions: {1,2},

{3,4}, {5,6}

Shuffle: Send intermediate key-values to the relevant reducers All-to-all communication: since all mappers

typically have all intermediate keys

Combine: Local aggregation function for repeated keys produced by same map

Page 16: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Fault Tolerance Task re-execution: Retry task(s) on another

node On task or node failure OK for a map because it has no

dependencies OK for reduce because map outputs are on

disk Speculative execution: Launch copy of task

on another node To handle stragglers (slow tasks) Use result from first task to finish

16

Page 17: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Hadoop

Open-source Apache project Software framework for distributed data

processing Primary project: MapReduce implementation Other projects on top of MapReduce

Implemented in Java Primary data analysis platform at Yahoo!

40,000+ machines running Hadoop

Page 18: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Hadoop: Primary Components

HDFS: Distributed File System Combines cluster’s local storage into a single

namespace All data is replicated to multiple machines Provides locality information to clients

MapReduce: Batch computation framework Tasks re-executed on failure Optimizes for data locality of input

Page 19: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Talk Outline

Big Data and MapReduce MapReduce Background MapReduce in Non-Traditional

Environments Concluding Remarks

19

Page 20: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Traditional MapReduce Environments

20

Assumptions: Tightly-coupled

clusters Dedicated compute

nodes Data is centrally

available/pre-placed

Image Source: http://www.ibm.com

Page 21: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

But… Data may be distributed

Data originates in geographically distributed manner Scientific instruments, sensors.

E.g.: oceanic, atmospheric data Public/social data. E.g.: User

blogs, traffic data

21

Commercial data. E.g.: Warehouse, ecommerce data

Monitoring data. E.g.: CDN user access logs Mobile data. E.g.: phone pics, sensors

May want to combine multiple data sources E.g.: CDC+Google Maps

Page 22: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Computation may be distributed

Distributed data centers/clouds E.g.: Amazon EC2 regions,

Akamai CDN servers Computational Grids

E.g.: FutureGrid Volunteer computing

platforms E.g.: BOINC

22

Page 23: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota 23

Question: How to execute MapReduce in such non-traditional environments?

Highly-Distributed Environments

Page 24: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Research Overview

Step 1: Understanding tradeoffs Compare different deployment architectures

for MapReduce execution Step 2: Optimizing MapReduce execution

Data placement/task scheduling based on system and application characteristics

24

Page 25: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Step 1: Understanding Tradeoffs

25

Input

Data

Data Push

Output

Data

Map Reduce

Goal: Understand what deployment architectures would work best

Input

Data

Page 26: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota 26

Architecture 1: Local MapReduce

Data Source

(US)

Data Source

(EU)

Data Center (US) Data Center (EU)

MapReduce Job

Final Result

Data Push (Fast)

Data Push (Slow)

Page 27: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota 27

Architecture 2: Global MapReduce

Data Source

(US)

Data Source

(EU)

Data Center (US) Data Center (EU)

MapReduce JobFinal

Result

Data Push (Fast)

Data Push (Slow)

Data Push (Fast)

Data Push (Slow)

Page 28: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota 28

Architecture 3: Distributed MapReduce

Data Source

(US)

Data Source

(EU)

Data Center (US) Data Center (EU)

MapReduce Job

Final Result

Data Push (Fast)

MapReduce Job

CombineResults

Data Push (Fast)

Page 29: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Experimental Results: PlanetLab

29

Push

US

Push

EU

Map

Reduc

e

Result C

...

Tota

l0

200

400

600

800

1000

1200

1400

1600

1800

Local MR

Global MR

Distrib-uted MR

Tim

e in s

eco

nds

Pus

Pus

Map

Reduc

e

Result .

..

Tota

l0

100

200

300

400

500

600

700

800

900

Local MR

Global MR

Distrib-uted MR

Tim

e in s

eco

nds

Performance depends on network, application characteristics

WordCount (Random)WordCount (Text)

Result Combine cost

dominant

Data Push cost

dominant

PlanetLab: 4/1 US, 4/1 EU compute/data nodes, Hadoop 0.20.1

Page 30: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Experimental Results: Amazon EC2

30

Push

US

Push

EU

Map

Reduce

Result

Combin

eTo

tal

0

100

200

300

400

500

600

700

800

900

1000Local MR

Distributed MR

Tim

e in

se

con

ds

Push

US

Push

EU

Map

Reduce

Result

Combin

eTo

tal

0

100

200

300

400

500

600

700

800

900

Local MR

Distributed MR

Tim

e i

n S

eco

nd

s

WordCount (Random)WordCount (Text)

Amazon EC2: 6 US, 3 EU small instances, 1 data node each

Performance depends on network, application characteristics

Page 31: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Lessons Learnt

Make MapReduce topology-aware Data placement and task scheduling should

consider network locality Application-specific data aggregation critical

High aggregation => Avoid initial data push cost

Low aggregation => Avoid shuffle cost Make globally optimal decisions

“Good” local decisions can adversely impact E2E performance

31

Page 32: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Step 2: Optimizing MapReduce Execution

Framework for modeling MapReduce execution Optimizer to determine an optimal execution

plan (data placement and task scheduling) Topology-aware: Uses information about

network and node characteristics Application-aware: Uses data aggregation

characteristics Global optimization: Performs end-to-end,

multi-phase optimization Implemented in Hadoop 1.0.1

Page 33: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Execution Model

Page 34: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Execution Model

Parameters Di – Size of data supplied at each data source Bij – Link bandwidth from node i to node j Ci – Mapper/Reducer compute rates α – Ratio of size of intermediate data to input

data Execution Plan

Each source: where to push data All mappers: where to shuffle data

Page 35: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Execution Model: Constraints

xij – fraction of node i’’s data pushed/shuffled to node j Each data source (mapper) must push (shuffle) all of

its data

One-reducer-per-key: yk denotes fraction reduced at reducer k

Page 36: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

MapReduce Execution Optimization

Objective: Minimize Makespan subject to Model constraints

Use model parameters to compute execution time: Push/shuffle time based on link bandwidths,

size of data communicated over each link Map/reduce time based on compute rates, size

of data computed at each node

Page 37: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Benefit of OptimizationPlanetLab measurements: 4 US, 2 Europe, 2 Asia nodes; 1 data source each

uniform myopic multi e2e multi0

500

1000

1500

2000

2500

ReduceShuffleMapPush

Optimization Algorithm

Make

span (

s)

uniform myopic multi e2e multi0

5000

10000

15000

20000

25000

ReduceShuffleMapPush

Optimization AlgorithmM

ake

span (

s)Uniform Myopic Optimized Uniform Myopic

Optimized

α=0.1 (Data Aggregation) α=10 (Data Expansion)

Model-driven optimization achieves minimum makespan under different scenarios

Page 38: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Comparison to Hadoop

Uniform Hadoop Optimized0

1000

2000

3000

4000

5000

6000

7000Word Count

ReduceMapPush

Execution Plan

Ma

ke

spa

n (

s)

Uniform Hadoop Optimized0

5001000150020002500300035004000

Sessionization

ReduceMapPush

Execution Plan

Ma

ke

spa

n (

s)

Uniform Hadoop Optimized0

50010001500200025003000350040004500

Full Inverted Index

ReduceMapPush

Execution Plan

Ma

ke

spa

n (

s)

Emulated PlanetLab, Hadoop 1.0.1 (Modified for model-based execution plans)

Page 39: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Concluding Remarks MapReduce: Large-scale distributed data

processing Scalable: large no. of machines and data Cheap: lower hardware, programming, admin

costs Well-suited for several data analysis applications

Rich area for research Resource management, algorithms, programming

models Our focus: Optimization in highly-distributed

environments Acknowledgments:

Students, esp. Ben Heintz Jon Weissman (UMN), Ramesh Sitaraman (UMASS)

39

Page 40: University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota

Thank You!

http://www.cs.umn.edu/~chandra

40