MapReduceppt

8/13/2019 MapReduceppt

1/38

MapReduce: simplified

data processing on largeclusters

Jeffrey Dean andSanjay Ghemawat

Presented By :-Venkataramana Chunduru


2/38

AGENDA

GFS

MAP REDUCE

HADOOP


3/38

Motivation

Input data is large. The whole Web, billions of Pages.

Lots of machines

Use them efficiently.

Google needed good Distributed file System

Why not use the existing file systems? Googles problems are different from anyone else. GFS is designed for Google apps and workloads.

Google apps are designed for GFS.


4/38

NFS Disadvantages

Network congestion

Heavy disk activity of the NFS server adversely affects

the NFSs performance.

When the client attempts to mount , the client systemhangs, although this can be mitigated using a specificmount.

If the server hosting the exportedfile system becomes

unavailable due to any reason, no one can access theresource. NFS has security problems because its designassumes a trustednetwork.


5/38

GFS Assumptions

High Component failure rates Inexpensive commodity components fail all the

time.

Modest number of huge files.Just a few million

Each is 100 MB or larger: multi GB files typically

Files are write once ,mostly appended to Perhaps Concurrently

Large streaming reads.


6/38

GFS Design Decisions

Files are stored as chunks.

- Fixed size(64 MB).

Reliability through replication.

- Each chunk is replicated across 3+ chunkservers

Single master to co ordinate access,keep metadata

- Simple centralized management.

No data caching

- Little benefit due to large datasets,streaming reads.


7/38

GFS Architecture


8/38

Single Master

From Distributed systems we know it is a :

- Single point of failure.

- Scalibility bottleneck.

GFS solutions

- Shadow masters

- Minimize master involvement

Simple and good enough.


9/38

Metadata (1/2)

Global metadata is stored on the master.

- File and chunk namespaces.

- Mapping from files to chunks.

- Locations of each chunk replicas.

All in memory (64bytes/chunk)

- Fast

- Easily Accessible.


10/38

Metadata (2/2)

Master has an operation log for persistent logging

of critical metadata updates.

- Persistent on local disk

- Replicated

- Check points for faster recovery.


11/38


12/38

Conclusion of GFS

GFS demonstrates how to support large scale

processing workloads on commodity hardware

- Designed to tolerate frequent component failures.

- Optimized for huge files that are mostly appended and read.- Go for simple solutions.

GFS has met Google's storage needs. it must be

good !!!


13/38

Example for MapReduce

Page 1: the weather is good

Page 2: today is good

Page 3: good weather is good.


14/38

Map output

Worker 1: (the 1), (weather 1), (is 1), (good 1).

Worker 2: (today 1), (is 1), (good 1).

Worker 3: (good 1), (weather 1), (is 1), (good 1).


15/38

Reduce Input

Worker 1: (the 1)

Worker 2:

(is 1), (is 1), (is 1) Worker 3:

(weather 1), (weather 1)

Worker 4:

(today 1) Worker 5:

(good 1), (good 1), (good 1), (good 1)


16/38

Reduce Output

Worker 1: (the 1)

Worker 2:

(is 3) Worker 3:

(weather 2)

Worker 4:

(today 1) Worker 5:

(good 4)


17/38

MapReduce Architecture


18/38

Parallel Execution


19/38

Fault Tolerance

Network Failure: Detect failure via periodic heartbeats

Re-execute completed and in-progress maptasks

Re-execute in progress reducetasks Task completion committed through master

Master failure: Could handle, but don't yet (master failure unlikely)


20/38

Refinement

Different partitioning functions.

Combiner function.

Different input/output types.

Skipping bad records.

Local execution.

Status info.

Counters.


21/38

Whats

Framework for running applications on large clusters of commodity hardware

Scale: petabytes of data on thousands of nodes

Include

Storage: HDFS

Processing: MapReduce

Support the Map/Reduce programming model

Requirements

Economy: use cluster of comodity computers

Easy to use

Users: no need to deal with the complexity of distributed computing Reliable: can handle node failures automatically


22/38

Whats Hadoop ..Contd.

Hadoop is a software platform that lets one easily write andrun applications that process vast amounts of data.

Here's what makes Hadoop especially useful:

Scalable

Economical

Efficient

Reliable


23/38

HDFS

Hadoop implements MapReduce, using the HadoopDistributed File System (HDFS) (see figure below.)

MapReduce divides applications into many small blocks

of work. HDFS creates multiple replicas of data blocksfor reliability, placing them on compute nodes aroundthe cluster. MapReduce can then process the data

where it is located.

Hadoop has been demonstrated on clusters with 2000nodes. The current design target is 10,000 nodeclusters.
http://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReduce


24/38

Hadoop Architecture

Data Data data data data data

Data data data data data











ResultsData data data data

Data data data data

Data data data data

Data data data data

Data data data data

Data data data data

Data data data data

Data data data data

Data data data data

Hadoop Cluster

DFS Block 1

DFS Block 1

DFS Block 2

DFS Block 2

DFS Block 2

DFS Block 1

DFS Block 3

DFS Block 3

DFS Block 3

MAP

MAP

MAP

Reduce


25/38

Sample Hadoop Code

Sample text-files as input:

$ bin/hadoop dfs -ls /usr/joe/wordcount/input//usr/joe/wordcount/input/file01/usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01Hello World, Bye World!

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02Hello Hadoop, Goodbye to hadoop.

Run the application:

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount/usr/joe/wordcount/input /usr/joe/wordcount/output

Output:

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000Bye 1Goodbye 1Hadoop, 1Hello 2World! 1World, 1hadoop. 1to 1


26/38

Contd

Notice that the inputs differ from the first version we looked at, and how theyaffect the outputs.

Now, lets plug-in a pattern-file which lists the word-patterns to be ignored, viathe DistributedCache.

$ hadoop dfs -cat /user/joe/wordcount/patterns.txt\.

\,\!to

Run it again, this time with more options:

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount -Dwordcount.case.sensitive=true /usr/joe/wordcount/input

/usr/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt As expected, the output:

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000Bye 1Goodbye 1Hadoop 1Hello 2

World 2hadoop 1


27/38

Contd

Run it once more, this time switch-off case-sensitivity:

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount-Dwordcount.case.sensitive=false/usr/joe/wordcount/input/usr/joe/wordcount/output -skip

/user/joe/wordcount/patterns.txt Sure enough, the output:

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000bye 1goodbye 1hadoop 2hello 2

world 2


28/38

Hadoop

HDFS assumes that hardware is unreliable and willeventually fail.

Similar to RAID level except-HDFS can replicate data across several machines

Provides Fault tolerance

Extremely high capacity storage


29/38

Hadoop

Moving Computation is cheaper than moving data

HDFS is said to be rack aware.


30/38

Who uses Hadoop?

Facebookuses Hadoop to analyze user behaviorand the effectiveness of ads on the site.

The tech team at The New York Timesrentedcomputing power on Amazonscloud and usedHadoop to convert 11 million archived articles,

dating back to 1851, to digital and searchabledocuments. They turned around in a single day ajob that otherwise would have taken months.


31/38

Who uses Hadoop?

BesidesYahoo!,many other organizations are using Hadoop to run largedistributed computations. Some of them include:

A9.com Facebook Fox Interactive Media

IBM ImageShack ISI Joost Last.fm

Powerset The New York Times Rackspace Veoh
http://en.wikipedia.org/wiki/A9.comhttp://en.wikipedia.org/wiki/Facebookhttp://en.wikipedia.org/wiki/Fox_Interactive_Mediahttp://en.wikipedia.org/wiki/IBMhttp://en.wikipedia.org/wiki/ImageShackhttp://en.wikipedia.org/wiki/Information_Sciences_Institutehttp://en.wikipedia.org/wiki/Joosthttp://en.wikipedia.org/wiki/Last.fmhttp://en.wikipedia.org/wiki/Powerset_(company)http://en.wikipedia.org/wiki/The_New_York_Timeshttp://en.wikipedia.org/wiki/Rackspacehttp://en.wikipedia.org/wiki/Veohhttp://en.wikipedia.org/wiki/Veohhttp://en.wikipedia.org/wiki/Rackspacehttp://en.wikipedia.org/wiki/The_New_York_Timeshttp://en.wikipedia.org/wiki/Powerset_(company)http://en.wikipedia.org/wiki/Last.fmhttp://en.wikipedia.org/wiki/Joosthttp://en.wikipedia.org/wiki/Information_Sciences_Institutehttp://en.wikipedia.org/wiki/ImageShackhttp://en.wikipedia.org/wiki/IBMhttp://en.wikipedia.org/wiki/Fox_Interactive_Mediahttp://en.wikipedia.org/wiki/Facebookhttp://en.wikipedia.org/wiki/A9.com


32/38

Yahoo! Launches World's Largest

Hadoop Production Application

YAHOO! RECENTLY LAUNCHED WHAT WE BELIEVE IS THEWORLDS LARGESTAPACHE HADOOPPRODUCTIONAPPLICATION. THE YAHOO! SEARCH WEBMAP IS A HADOOPAPPLICATION THAT RUNS ON A MORE THAN 10,000 CORELINUX CLUSTER AND PRODUCES DATA THAT IS NOW USED IN

EVERY YAHOO! WEB SEARCH QUERY.

THE WEBMAP BUILD STARTS WITH EVERY WEB PAGE CRAWLEDBY YAHOO! AND PRODUCES A DATABASE OF ALL KNOWN WEBPAGES AND SITES ON THE INTERNET AND A VAST ARRAY OFDATA ABOUT EVERY PAGE AND SITE. THIS DERIVED DATA

FEEDS THE MACHINE LEARNED RANKING ALGORITHMS ATTHE HEART OF YAHOO! SEARCH.
http://hadoop.apache.org/http://hadoop.apache.org/


33/38

Yahoos Hadoop One of Yahoo's Hadoopclusters sorted 1 terabyte of data in 209 seconds, which beat

the previous record of 297 seconds in the annual general purpose (daytona) terabytesort benchmark. The sort benchmark, which was created in 1998 by Jim Gray,specifies the input data (10 billion 100 byte records), which must be completely sortedand written to disk. This is the first time that either a Java or an open source programhas won. Yahoo is both the largest user of Hadoop with 13,000+ nodes runninghundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usageand contributionsareincreasing rapidly.

The cluster statistics were:

910 nodes, 2 quad core Xeons @ 2.0ghz per node

4 SATA disks per node, 8G RAM per node

1 gigabit ethernet on each node, 40 nodes per rack

8 gigabit ethernet uplinks from each rack to the core.

Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)

Sun Java JDK 1.6.0_05-b13
http://hadoop.apache.org/corehttp://www.hpl.hp.com/hosted/sortbenchmark/http://www.hpl.hp.com/hosted/sortbenchmark/http://wiki.apache.org/hadoop/PoweredByhttp://people.apache.org/~omalley/CorePatchesByBranch-18.pnghttp://people.apache.org/~omalley/CorePatchesByBranch-18.pnghttp://wiki.apache.org/hadoop/PoweredByhttp://www.hpl.hp.com/hosted/sortbenchmark/http://www.hpl.hp.com/hosted/sortbenchmark/http://hadoop.apache.org/core


34/38

Process Diagram


35/38

Map/Reduce Processes

Launching ApplicationUser application codeSubmits a specific kind of Map/Reduce job

JobTracker

Handles all jobsMakes all scheduling decisions

TaskTrackerManager for all tasks on a given node

TaskRuns an individual map or reduce fragment for a

given jobForks from the TaskTracker


36/38

Hadoop Map-Reduce Architecture

Master-Slave architecture

Map-Reduce Master Jobtracker

Accepts MR jobs submitted by usersAssigns Map and Reduce tasks to TasktrackersMonitors task and tasktracker status, re-executes tasks upon failure

Map-Reduce Slaves Tasktrackers

Run Map and Reduce tasks upon instruction from the JobtrackerManage storage and transmission of intermediate output


37/38

Imp Links

http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html

http://www.youtube.com/watch?v=5Eib_H_zCEY&feature=related

http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=related

http://labs.google.com/papers/gfs-sosp2003.pdf
http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html


38/38

Thank you !!!!!

MapReduceppt

Documents