Top Banner
Presented by Shen Li
43

Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Aug 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Presented by Shen Li

Page 2: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Paper List

• [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

• [2] MapReduce Online, T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears, NSDI’10.

• [3] Energy Efficiency of Map Reduce, Y. Chen, and Tracy Xiaoxiao Wang, Tech. Rep. UC Berkeley, 2008.

• [4] Statistical Workloads for Energy Efficient MapReduce, Y. Chen, A. Ganapathi, A. Fox, R. Katz, and D. Patterson, Tech. Rep. UC Berkeley, 2010.

• [5] On the Energy (In)efficiency of Hadoop Clusters, J. Leverich, and C. Kozyrakis, HotPower'09.

• [6] GreenHDFS: Towards An Energy-Conserving, Storage-Efficient, Hybrid Hadooop Compute Cluster, R. Kaushik, and M. Bhandarkar, HotPower'10.

• [7] Energy Management for MapReduce Clusters, W. Lang, and J. Patel, VLDB'10.

Page 3: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Outline

• MapReduce Overview

• MapReduce Acceleration

• Energy issues of MapReduce

Page 4: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Outline

• MapReduce Overview

• MapReduce Acceleration

• Energy issues of MapReduce

Page 5: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Map and Reduce in Functional Programming

• Map: applies a given function element-wise to a list of elements and

returns a list of results.

• A Simple Example:

– L = (1, 2, 3, 4, 5);

– f : Multiply an element by two;

– Map(f, L) returns (2, 4, 6, 8, 10).

Page 6: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Map and Reduce in Functional Programming

• Reduce: deals with a combining function and a list of elements of some

data structure. The Reduce then proceeds to combine elements of the data

structure using the function in some systematic way.

• A Simple Example:

– L = (1, 2, 3, 4, 5);

– f : add two elements;

– Reduce(f, L) apply f to L recursively, returns 15.

Page 7: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Motivation: Large Scale Data Processing

• Want to process lots of data

• Want to parallelize across hundreds/thousands of CPUs

• Want to make it easy

Page 8: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Basic Idea

• Split input file into pieces. (Generate a list)

• Instead of applying a function to the original data, apply it to

each piece (or split). (Map a function to a list)

• Collect all the output from each piece to calculate the final

result. (Reduce a list based on a function)

• Users can customize map and reduce functions.

Page 9: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Implementation

• Master node:

– Slipt input data into pieces;

– Assign tasks to map workers;

– Inform reduce workers where the outputs of map phase are.

• Worker node:

– Conduct mapping or Reducing;

– Mappers run map function on splits;

– Reducers run reduce function on partitions.

Page 10: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Implementation

• Job

– Defined on input file;

– To finish one job, master node will invoke multiple map Tasks and

reduce Tasks.

• Task

– Defined on a split;

– Assigned by master node to worker node.

Page 11: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Cited from [1]

Page 12: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Outline

• MapReduce Overview

• MapReduce Acceleration

• Energy issues of MapReduce

Page 13: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Acceleration

• Problem: Too much data to pass from Map worker to Reduce

worker

– Solution: Map nodes apply combiner functions to their local output.

Cited from [2]

Page 14: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Acceleration

• Problem: No reduce can start until map is complete.

– Solution: Master redundantly executes “slow-moving” map tasks; uses

result of first copy to finish.

– Solution: Intermediate data is pipelined between mappers and reducers.

Thus, reducers begin processing data as soon as it is produced by

mappers.

Page 15: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Evaluation

Cited from [2]

Page 16: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Outline

• MapReduce Overview

• MapReduce Acceleration

• Energy issues of MapReduce

Page 17: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Knobs

• Configuration [3][4]

• Machine On/Off [5][6][7]

• DVFS

• Temperature

• Workload [4][6]

Page 18: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Cluster Configuration

• General cost metric [3]

– All cost over all benefits

– Minimize the metric

Page 19: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Cluster Configuration

• Their model [3]

E: Energyt: DelayM: Number of machinesW: WorkloadR: Replication

Page 20: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Arguments

• vs

– Too heavily prioritize the latency.

– A system could achieve better performance just by decreasing

the workload.

E: Energyt: DelayM: Number of machinesW: WorkloadR: Replication

Page 21: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Arguments

• vs

– Too heavily prioritize the number of machines.

– A system could achieve better performance just by using fewer

machines.

E: Energyt: DelayM: Number of machinesW: WorkloadR: Replication

Page 22: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Cluster Configuration[3]

Page 23: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Cluster Configuration[3]

Page 24: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Result

Map Compute

Page 25: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Cluster Configuration[4]

• 1. Launch jobs as they arrive vs. queue up the job and launch

them in batches;

• 2. For batched execution, launch all jobs on the queue at the

same time vs. in a staggered fashion;

• 3. Use the standard HDFS block size of 64MB vs. larger block

sizes;

• 4. Assign the default 4 task trackers per node vs. more task

trackers per node.

Page 26: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Result

Page 27: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Knobs

• Cluster Configuration [3][4]

• Machine On/Off [5][6][7]

• DVFS

• Temperature

• Workload [4][6]

Page 28: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Machine On/Off

• Main Challenge

– DFS: Files are stored in Distributed File Systems, each machine holds a

subset of whole data. Turning off a set of machine can make some data

unavailable.

– SLA: The cluster should satisfy Service Level Agreement even after

turning off some machines.

Page 29: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Machine On/Off

• Covering Set (CS) [5] “On the Energy (In)efficiency of Hadoop Clusters”

• Hot Zone/Cold Zone (HC) [6] “GreenHDFS: Towards An Energy-

Conserving, Storage-Efficient”

• All-In Strategy(AIS) [7] “Energy Management for MapReduce Clusters”

Page 30: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Covering Set[5]

• At least one replica of a data-block must be stored in a

subset of nodes referred as covering set.

• Large numbers of nodes can be gracefully turned off

without affecting the availability of data.

“On the Energy (In)efficiency of Hadoop Clusters”

Page 31: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Covering Set[5]

“On the Energy (In)efficiency of Hadoop Clusters”

Page 32: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Hot Zone/Cold Zone[6]

• Classify data by its “temperature”

– They use age of file, as defined by the last access to the file, as the

measure of temperature of the file.

– Hot files stored on high performance hot zone servers.

– Cold files stored on cold zone servers with large storage space.

“GreenHDFS: Towards An Energy-Conserving, Storage-Efficient”

Page 33: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Hot Zone/Cold Zone[6]

• 26% energy savings within a three-month trace

simulation.

“GreenHDFS: Towards An Energy-Conserving, Storage-Efficient”

Page 34: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

All In Service[7]

• In cases where there is a consistent low utilization period,

AIS would batch the MR jobs in a queue.

• Periodically power up the entire system and run the entire

batch of jobs as fast as they can and power off again.

“Energy Management for MapReduce Clusters”

Page 35: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

All In Service[7]

“Energy Management for MapReduce Clusters”

Page 36: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Workload Ditribution

• Explicitly

– Modify task assignment algorithm.

• Inexplicitly

– Change frequency;

– Modify data storage strategy;

Page 37: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Hot Zone/Cold Zone[6]

• HC stores “hot” data on running and powerful machines.

• MapReduce will try to run a task locally first.

• Most task will be run on hot servers.

Page 38: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Hot Zone/Cold Zone[6]

• HC stores “hot” data on running and powerful machines.

• MapReduce will try to run a task locally first.

• Most task will be run on hot servers.

Page 39: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Workload Generation

• Motivation

– Company Competitive concerns

– Better evaluation

Page 40: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Trace Analysis

Page 41: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Trace Analysis

Page 42: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Trace Statistics to Synthetic Workloads

• Simple non-parametric statistics are average and standard

deviation.

– Problem: Distributions are irregular, skewed, and asymmetric. Hence,

averages and standard deviations are insufficient.

• Solution:use percentiles.

– The authors choose percentiles based on Gaussian model;

– Five-number summary: 1st, 25th, 50th, 75th, 99th,

– Seven-number summary and so on.

Page 43: Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

Thank you !