1 HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters 22/5/26 1 Xiao Qin Department of Computer Science and Software Engineering Auburn University http://www.eng.auburn.edu/~xqin [email protected]Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)
50
Embed
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
23/4/11 1
Xiao Qin
Department of Computer Science and Software Engineering
Model is Widely ApplicableMapReduce Programs In Google Source Tree
14
Typical cluster:
• 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Limited bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines
Implementation is a C++ library linked into user programs
Implementation Overview
15
Execution
• How is this distributed?1. Partition input key/value pairs into chunks,
run map() tasks in parallel
2. After all map()s are complete, consolidate all emitted values for each unique emitted key
3. Now partition space of output map keys, and run reduce() in parallel
• If map() or reduce() fails, reexecute!
16
Job Processing
JobTracker
TaskTracker 0TaskTracker 1 TaskTracker 2
TaskTracker 3 TaskTracker 4 TaskTracker 5
1. Client submits “grep” job, indicating code and input files
2. JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers.
3. After map(), tasktrackers exchange map-output to build reduce() keyspace
4. JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work.
This HDFS-HC tool was described in our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.
(J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)
23
One time setup
• set hadoop-site.xml and slaves• Initiate namenode
• Run Hadoop MapReduce and DFS
• Upload your data to DFS
• Run your process…
• Download your data from DFS
24
Hadoop Distributed File System
2424
(http://lucene.apache.org/hadoop)
25
Motivational Example
Time (min)
Node A(fast)
Node B(slow)
Node C(slowest)
2x slower
3x slower
1 task/min
26
The Native Strategy
Node A
Node B
Node C
3 tasks
2 tasks
6 tasks
Loading Transferring Processing 2626
Time (min)
27
Our Solution--Reducing data transfer time
2727
Node A’
Node B’
Node C’
3 tasks
2 tasks
6 tasks
Loading Transferring Processing 2727
Time (min)
Node A
28
Preliminary Results
Impact of data placement on performance of grep
29
Challenges
• Does computing ratio depend on the application?
• Initial data distribution
• Data skew problem– New data arrival– Data deletion – New joining node– Data updating
30
Measure Computing Ratios• Computing ratio
• Fast machines process large data sets
3030
Time
Node A
Node B
Node C
2x slower
3x slower
1 task/min
31
Steps to MeasureComputing Ratios
3131
Node Response time(s)
Ratio # of File Fragments
Speed
Node A 10 1 6 Fastest
Node B 20 2 3 Average
Node C 30 3 2 Slowest
1. Run the application on each node with the same size data, individually collect the response time
2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes
3.Caculate the least common multiple of these ratios
4. Count the portion of each node
32
Initial Data Distribution
Namenode
Datanodes
112233
File1445566
778899
aabb
cc
• Input files split into 64MB blocks
• Round-robin data distribution algorithm
CBA
3232
Portion 3:2:1
33
1Data Redistribution
1.Get network topology, the ratio and utilization
2.Build and sort two lists:under-utilized node list L1
over-utilized node list L2
3. Select the source and destination node from the lists.
4.Transfer data
5.Repeat step 3, 4 until the list is empty.
Namenode
1122
33
4455
66778899
aabbcc
CA
CBA
B
234
L1
L2
3333
Portion 3:2:1
34
Sharing Files among Multiple Applications
• The computing ratio depends on data-intensive applications.– Redistribution– Redundancy
3434
35
Experimental Environment
Five nodes in a hadoop heterogeneous cluster
3535
Node CPU Model CPU(Hz) L1 Cache(KB)
Node A Intel core 2 Duo 2*1G=2G 204
Node B Intel Celeron 2.8G 256
Node C Intel Pentium 3 1.2G 256
Node D Intel Pentium 3 1.2G 256
Node E Intel Pentium 3 1.2G 256
36
Grep and WordCount
• Grep is a tool searching for a regular expression in a text file
• WordCount is a program used to count words in a text file
3636
37
Computing ratio for two applications
3737
Computing ratio of the five nodes with respective of Grep and Wordcount applications
Computing Node Ratios for Grep Ratios for Wordcount
Node A 1 1
Node B 2 2
Node C 3.3 5
Node D 3.3 5
Node E 3.3 5
38
Response time of Grep andwordcount in each Node
3838
Application dependence
Data size independence
39
Six Data Placement Decisions
3939
40
Impact of data placement on performance of Grep
4040
41
Impact of data placement on performance of WordCount
4141
42
Conclusion
• Identify the performance degradation caused by heterogeneity.
• Designed and implemented a data placement mechanism in HDFS.
4242
43
Future Work
• Data redundancy issue
• Dynamic data distribution mechanism
• Prefetching
4343
44
Fellowship Program Samuel Ginn College of Engineering at
Auburn University• Dean's Fellowship: $32,000 per year plus tuition
fellowship• College Fellowship: $24,000 per year plus tuition
fellowship• Departmental Fellowship: $20,000 per year plus tuition
fellowship.• Tuition Fellowships: Tuition Fellowships provide a full
tuition waiver for a student with a 25 percent or greater full-time-equivalent (FTE) assignment. Both graduate research assistants (GRAs) and graduate teaching assistants (GTAs) are eligible.