Top Banner
1 HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters 22/5/26 1 Xiao Qin Department of Computer Science and Software Engineering Auburn University http://www.eng.auburn.edu/~xqin [email protected] Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)
50

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

May 10, 2015

Download

Technology

Xiao Qin

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.

The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

1

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

23/4/11 1

Xiao Qin

Department of Computer Science and Software Engineering

Auburn Universityhttp://www.eng.auburn.edu/~xqin

[email protected]

Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)

Page 2: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

2

Motivation

• Large-Scale Data Processing– Want to use 1000s of CPUs

• But don’t want hassle of managing things

• MapReduce provides– Automatic parallelization & distribution– Fault tolerance– I/O scheduling– Monitoring & status updates

Page 3: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

3

Map/Reduce• Map/Reduce

– Programming model from Lisp – (and other functional languages)

• Many problems can be phrased this way• Easy to distribute across nodes• Nice retry/failure semantics

Page 4: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

4

Distributed Grep

Very big

data

Split data

Split data

Split data

Split data

grep

grep

grep

grep

matches

matches

matches

matches

catAll

matches

Page 5: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

5

Distributed Word Count

Very big

data

Split data

Split data

Split data

Split data

count

count

count

count

count

count

count

count

mergemergedcount

Page 6: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

6

Map Reduce

• Map:– Accepts input

key/value pair– Emits intermediate

key/value pair

• Reduce :– Accepts intermediate

key/value* pair– Emits output key/value

pair

Very big

data

ResultMAP

REDUCE

PartitioningFunction

Page 7: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

7

Map in Lisp (Scheme)

• (map f list [list2 list3 …])

• (map square ‘(1 2 3 4))– (1 4 9 16)

• (reduce + ‘(1 4 9 16))– (+ 16 (+ 9 (+ 4 1) ) )– 30

• (reduce + (map square (map – l1 l2))))

Unary operator

Binary operator

Page 8: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

8

Map/Reduce ala Google• map(key, val) is run on each item in set

– emits new-key / new-val pairs

• reduce(key, vals) is run for each unique key emitted by map()– emits final output

Page 9: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

9

count words in docs

– Input consists of (url, contents) pairs

– map(key=url, val=contents):• For each word w in contents, emit (w, “1”)

– reduce(key=word, values=uniq_counts):• Sum all “1”s in values list• Emit result “(word, sum)”

Page 10: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

10

Count, Illustrated

map(key=url, val=contents):For each word w in contents, emit (w, “1”)

reduce(key=word, values=uniq_counts):Sum all “1”s in values list

Emit result “(word, sum)”

see bob throw

see spot run

see 1

bob 1

run 1

see 1

spot 1

throw 1

bob 1 Run 1

see 2

spot 1

throw 1

Page 11: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

11

Grep

– Input consists of (url+offset, single line)– map(key=url+offset, val=line):

• If contents matches regexp, emit (line, “1”)

– reduce(key=line, values=uniq_counts):• Don’t do anything; just emit line

Page 12: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

12

Reverse Web-Link Graph

• Map– For each URL linking to target, …– Output <target, source> pairs

• Reduce– Concatenate list of all source URLs– Outputs: <target, list (source)> pairs

Page 13: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

13

Example uses: distributed grep   distributed sort   web link-graph reversal

term-vector / host web access log stats inverted index construction

document clustering machine learning statistical machine translation

... ... ...

Model is Widely ApplicableMapReduce Programs In Google Source Tree

Page 14: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

14

Typical cluster:

• 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Limited bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks,

scheduler assigns tasks to machines

Implementation is a C++ library linked into user programs

Implementation Overview

Page 15: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

15

Execution

• How is this distributed?1. Partition input key/value pairs into chunks,

run map() tasks in parallel

2. After all map()s are complete, consolidate all emitted values for each unique emitted key

3. Now partition space of output map keys, and run reduce() in parallel

• If map() or reduce() fails, reexecute!

Page 16: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

16

Job Processing

JobTracker

TaskTracker 0TaskTracker 1 TaskTracker 2

TaskTracker 3 TaskTracker 4 TaskTracker 5

1. Client submits “grep” job, indicating code and input files

2. JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers.

3. After map(), tasktrackers exchange map-output to build reduce() keyspace

4. JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work.

5. reduce() output may go to NDFS

“ grep”

Page 17: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

17

Execution

Page 18: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

18

Parallel Execution

Page 19: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

19

Task Granularity & Pipelining• Fine granularity tasks: map tasks >> machines

– Minimizes time for fault recovery– Can pipeline shuffling with map execution– Better dynamic load balancing

• Often use 200,000 map & 5000 reduce tasks • Running on 2000 machines

Page 20: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

20

MapReduce outside Google

• Hadoop (Java)– Emulates MapReduce and GFS

• The architecture of Hadoop MapReduce and DFS is master/slave

Master Slave

MapReduce jobtracker tasktracker

DFS namenode datanode

Page 21: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Improving MapReduce Performance through Data Placement in

Heterogeneous Hadoop Clusters

Download Software at: http://www.eng.auburn.edu/~xqin/software/hdfs-hc

This HDFS-HC tool was described in our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.

Page 22: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

22

Hadoop Overview

2222

(J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)

Page 23: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

23

One time setup

• set hadoop-site.xml and slaves• Initiate namenode

• Run Hadoop MapReduce and DFS

• Upload your data to DFS

• Run your process…

• Download your data from DFS

Page 24: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

24

Hadoop Distributed File System

2424

(http://lucene.apache.org/hadoop)

Page 25: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

25

Motivational Example

Time (min)

Node A(fast)

Node B(slow)

Node C(slowest)

2x slower

3x slower

1 task/min

Page 26: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

26

The Native Strategy

Node A

Node B

Node C

3 tasks

2 tasks

6 tasks

Loading Transferring Processing 2626

Time (min)

Page 27: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

27

Our Solution--Reducing data transfer time

2727

Node A’

Node B’

Node C’

3 tasks

2 tasks

6 tasks

Loading Transferring Processing 2727

Time (min)

Node A

Page 28: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

28

Preliminary Results

Impact of data placement on performance of grep

Page 29: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

29

Challenges  

• Does computing ratio depend on the application?

• Initial data distribution

• Data skew problem– New data arrival– Data deletion   – New joining node– Data updating

Page 30: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

30

Measure Computing Ratios• Computing ratio

• Fast machines process large data sets

3030

Time

Node A

Node B

Node C

2x slower

3x slower

1 task/min

Page 31: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

31

Steps to MeasureComputing Ratios

3131

Node Response time(s)

Ratio # of File Fragments

Speed

Node A 10 1 6 Fastest

Node B 20 2 3 Average

Node C 30 3 2 Slowest

1. Run the application on each node with the same size data, individually collect the response time

2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes

3.Caculate the least common multiple of these ratios

4. Count the portion of each node

Page 32: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

32

Initial Data Distribution

Namenode

Datanodes

112233

File1445566

778899

aabb

cc

• Input files split into 64MB blocks

• Round-robin data distribution algorithm

CBA

3232

Portion 3:2:1

Page 33: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

33

1Data Redistribution

1.Get network topology, the ratio and utilization

2.Build and sort two lists:under-utilized node list L1

over-utilized node list L2

3. Select the source and destination node from the lists.

4.Transfer data

5.Repeat step 3, 4 until the list is empty.

Namenode

1122

33

4455

66778899

aabbcc

CA

CBA

B

234

L1

L2

3333

Portion 3:2:1

Page 34: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

34

Sharing Files among Multiple Applications

• The computing ratio depends on data-intensive applications.– Redistribution– Redundancy

3434

Page 35: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

35

Experimental Environment

Five nodes in a hadoop heterogeneous cluster

3535

Node CPU Model CPU(Hz) L1 Cache(KB)

Node A Intel core 2 Duo 2*1G=2G 204

Node B Intel Celeron 2.8G 256

Node C Intel Pentium 3 1.2G 256

Node D Intel Pentium 3 1.2G 256

Node E Intel Pentium 3 1.2G 256

Page 36: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

36

Grep and WordCount

• Grep is a tool searching for a regular expression in a text file

• WordCount is a program used to count words in a text file

3636

Page 37: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

37

Computing ratio for two applications

3737

Computing ratio of the five nodes with respective of Grep and Wordcount applications

Computing Node Ratios for Grep Ratios for Wordcount

Node A 1 1

Node B 2 2

Node C 3.3 5

Node D 3.3 5

Node E 3.3 5

Page 38: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

38

Response time of Grep andwordcount in each Node

3838

Application dependence

Data size independence

Page 39: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

39

Six Data Placement Decisions

3939

Page 40: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

40

Impact of data placement on performance of Grep

4040

Page 41: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

41

Impact of data placement on performance of WordCount

4141

Page 42: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

42

Conclusion

• Identify the performance degradation caused by heterogeneity.

• Designed and implemented a data placement mechanism in HDFS.

4242

Page 43: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

43

Future Work

• Data redundancy issue

• Dynamic data distribution mechanism

• Prefetching

4343

Page 44: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

44

Fellowship Program Samuel Ginn College of Engineering at

Auburn University• Dean's Fellowship: $32,000 per year plus tuition

fellowship• College Fellowship: $24,000 per year plus tuition

fellowship• Departmental Fellowship: $20,000 per year plus tuition

fellowship.• Tuition Fellowships: Tuition Fellowships provide a full

tuition waiver for a student with a 25 percent or greater full-time-equivalent (FTE) assignment. Both graduate research assistants (GRAs) and graduate teaching assistants (GTAs) are eligible.

Page 45: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

45

http://www.eng.auburn.edu/programs/grad-school/fellowship-program/

Page 46: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

46

http://www.eng.auburn.edu

Page 47: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

47

http://www.eng.auburn.edu

Page 48: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

48

http://www.eng.auburn.edu

Page 49: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Download the presentation slideshttp://www.slideshare.net/xqin74

Google: slideshare Xiao Qin

Page 50: HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

50

Questions