Top Banner

of 38

MapReduceppt

Jun 04, 2018

Download

Documents

car
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/13/2019 MapReduceppt

    1/38

    MapReduce: simplified

    data processing on largeclusters

    Jeffrey Dean andSanjay Ghemawat

    Presented By :-Venkataramana Chunduru

  • 8/13/2019 MapReduceppt

    2/38

    AGENDA

    GFS

    MAP REDUCE

    HADOOP

  • 8/13/2019 MapReduceppt

    3/38

    Motivation

    Input data is large. The whole Web, billions of Pages.

    Lots of machines

    Use them efficiently.

    Google needed good Distributed file System

    Why not use the existing file systems? Googles problems are different from anyone else. GFS is designed for Google apps and workloads.

    Google apps are designed for GFS.

  • 8/13/2019 MapReduceppt

    4/38

    NFS Disadvantages

    Network congestion

    Heavy disk activity of the NFS server adversely affects

    the NFSs performance.

    When the client attempts to mount , the client systemhangs, although this can be mitigated using a specificmount.

    If the server hosting the exportedfile system becomes

    unavailable due to any reason, no one can access theresource. NFS has security problems because its designassumes a trustednetwork.

  • 8/13/2019 MapReduceppt

    5/38

    GFS Assumptions

    High Component failure rates Inexpensive commodity components fail all the

    time.

    Modest number of huge files.Just a few million

    Each is 100 MB or larger: multi GB files typically

    Files are write once ,mostly appended to Perhaps Concurrently

    Large streaming reads.

  • 8/13/2019 MapReduceppt

    6/38

    GFS Design Decisions

    Files are stored as chunks.

    - Fixed size(64 MB).

    Reliability through replication.

    - Each chunk is replicated across 3+ chunkservers

    Single master to co ordinate access,keep metadata

    - Simple centralized management.

    No data caching

    - Little benefit due to large datasets,streaming reads.

  • 8/13/2019 MapReduceppt

    7/38

    GFS Architecture

  • 8/13/2019 MapReduceppt

    8/38

    Single Master

    From Distributed systems we know it is a :

    - Single point of failure.

    - Scalibility bottleneck.

    GFS solutions

    - Shadow masters

    - Minimize master involvement

    Simple and good enough.

  • 8/13/2019 MapReduceppt

    9/38

    Metadata (1/2)

    Global metadata is stored on the master.

    - File and chunk namespaces.

    - Mapping from files to chunks.

    - Locations of each chunk replicas.

    All in memory (64bytes/chunk)

    - Fast

    - Easily Accessible.

  • 8/13/2019 MapReduceppt

    10/38

    Metadata (2/2)

    Master has an operation log for persistent logging

    of critical metadata updates.

    - Persistent on local disk

    - Replicated

    - Check points for faster recovery.

  • 8/13/2019 MapReduceppt

    11/38

  • 8/13/2019 MapReduceppt

    12/38

    Conclusion of GFS

    GFS demonstrates how to support large scale

    processing workloads on commodity hardware

    - Designed to tolerate frequent component failures.

    - Optimized for huge files that are mostly appended and read.- Go for simple solutions.

    GFS has met Google's storage needs. it must be

    good !!!

  • 8/13/2019 MapReduceppt

    13/38

    Example for MapReduce

    Page 1: the weather is good

    Page 2: today is good

    Page 3: good weather is good.

  • 8/13/2019 MapReduceppt

    14/38

    Map output

    Worker 1: (the 1), (weather 1), (is 1), (good 1).

    Worker 2: (today 1), (is 1), (good 1).

    Worker 3: (good 1), (weather 1), (is 1), (good 1).

  • 8/13/2019 MapReduceppt

    15/38

    Reduce Input

    Worker 1: (the 1)

    Worker 2:

    (is 1), (is 1), (is 1) Worker 3:

    (weather 1), (weather 1)

    Worker 4:

    (today 1) Worker 5:

    (good 1), (good 1), (good 1), (good 1)

  • 8/13/2019 MapReduceppt

    16/38

    Reduce Output

    Worker 1: (the 1)

    Worker 2:

    (is 3) Worker 3:

    (weather 2)

    Worker 4:

    (today 1) Worker 5:

    (good 4)

  • 8/13/2019 MapReduceppt

    17/38

    MapReduce Architecture

  • 8/13/2019 MapReduceppt

    18/38

    Parallel Execution

  • 8/13/2019 MapReduceppt

    19/38

    Fault Tolerance

    Network Failure: Detect failure via periodic heartbeats

    Re-execute completed and in-progress maptasks

    Re-execute in progress reducetasks Task completion committed through master

    Master failure: Could handle, but don't yet (master failure unlikely)

  • 8/13/2019 MapReduceppt

    20/38

    Refinement

    Different partitioning functions.

    Combiner function.

    Different input/output types.

    Skipping bad records.

    Local execution.

    Status info.

    Counters.

  • 8/13/2019 MapReduceppt

    21/38

    Whats

    Framework for running applications on large clusters of commodity hardware

    Scale: petabytes of data on thousands of nodes

    Include

    Storage: HDFS

    Processing: MapReduce

    Support the Map/Reduce programming model

    Requirements

    Economy: use cluster of comodity computers

    Easy to use

    Users: no need to deal with the complexity of distributed computing Reliable: can handle node failures automatically

  • 8/13/2019 MapReduceppt

    22/38

    Whats Hadoop ..Contd.

    Hadoop is a software platform that lets one easily write andrun applications that process vast amounts of data.

    Here's what makes Hadoop especially useful:

    Scalable

    Economical

    Efficient

    Reliable

  • 8/13/2019 MapReduceppt

    23/38

    HDFS

    Hadoop implements MapReduce, using the HadoopDistributed File System (HDFS) (see figure below.)

    MapReduce divides applications into many small blocks

    of work. HDFS creates multiple replicas of data blocksfor reliability, placing them on compute nodes aroundthe cluster. MapReduce can then process the data

    where it is located.

    Hadoop has been demonstrated on clusters with 2000nodes. The current design target is 10,000 nodeclusters.

    http://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReduce
  • 8/13/2019 MapReduceppt

    24/38

    Hadoop Architecture

    Data Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    Data data data data data

    ResultsData data data data

    Data data data data

    Data data data data

    Data data data data

    Data data data data

    Data data data data

    Data data data data

    Data data data data

    Data data data data

    Hadoop Cluster

    DFS Block 1

    DFS Block 1

    DFS Block 2

    DFS Block 2

    DFS Block 2

    DFS Block 1

    DFS Block 3

    DFS Block 3

    DFS Block 3

    MAP

    MAP

    MAP

    Reduce

  • 8/13/2019 MapReduceppt

    25/38

    Sample Hadoop Code

    Sample text-files as input:

    $ bin/hadoop dfs -ls /usr/joe/wordcount/input//usr/joe/wordcount/input/file01/usr/joe/wordcount/input/file02

    $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01Hello World, Bye World!

    $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02Hello Hadoop, Goodbye to hadoop.

    Run the application:

    $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount/usr/joe/wordcount/input /usr/joe/wordcount/output

    Output:

    $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000Bye 1Goodbye 1Hadoop, 1Hello 2World! 1World, 1hadoop. 1to 1

  • 8/13/2019 MapReduceppt

    26/38

    Contd

    Notice that the inputs differ from the first version we looked at, and how theyaffect the outputs.

    Now, lets plug-in a pattern-file which lists the word-patterns to be ignored, viathe DistributedCache.

    $ hadoop dfs -cat /user/joe/wordcount/patterns.txt\.

    \,\!to

    Run it again, this time with more options:

    $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount -Dwordcount.case.sensitive=true /usr/joe/wordcount/input

    /usr/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt As expected, the output:

    $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000Bye 1Goodbye 1Hadoop 1Hello 2

    World 2hadoop 1

  • 8/13/2019 MapReduceppt

    27/38

    Contd

    Run it once more, this time switch-off case-sensitivity:

    $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount-Dwordcount.case.sensitive=false/usr/joe/wordcount/input/usr/joe/wordcount/output -skip

    /user/joe/wordcount/patterns.txt Sure enough, the output:

    $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000bye 1goodbye 1hadoop 2hello 2

    world 2

  • 8/13/2019 MapReduceppt

    28/38

    Hadoop

    HDFS assumes that hardware is unreliable and willeventually fail.

    Similar to RAID level except-HDFS can replicate data across several machines

    Provides Fault tolerance

    Extremely high capacity storage

  • 8/13/2019 MapReduceppt

    29/38

    Hadoop

    Moving Computation is cheaper than moving data

    HDFS is said to be rack aware.

  • 8/13/2019 MapReduceppt

    30/38

    Who uses Hadoop?

    Facebookuses Hadoop to analyze user behaviorand the effectiveness of ads on the site.

    The tech team at The New York Timesrentedcomputing power on Amazonscloud and usedHadoop to convert 11 million archived articles,

    dating back to 1851, to digital and searchabledocuments. They turned around in a single day ajob that otherwise would have taken months.

  • 8/13/2019 MapReduceppt

    31/38

    Who uses Hadoop?

    BesidesYahoo!,many other organizations are using Hadoop to run largedistributed computations. Some of them include:

    A9.com Facebook Fox Interactive Media

    IBM ImageShack ISI Joost Last.fm

    Powerset The New York Times Rackspace Veoh

    http://en.wikipedia.org/wiki/A9.comhttp://en.wikipedia.org/wiki/Facebookhttp://en.wikipedia.org/wiki/Fox_Interactive_Mediahttp://en.wikipedia.org/wiki/IBMhttp://en.wikipedia.org/wiki/ImageShackhttp://en.wikipedia.org/wiki/Information_Sciences_Institutehttp://en.wikipedia.org/wiki/Joosthttp://en.wikipedia.org/wiki/Last.fmhttp://en.wikipedia.org/wiki/Powerset_(company)http://en.wikipedia.org/wiki/The_New_York_Timeshttp://en.wikipedia.org/wiki/Rackspacehttp://en.wikipedia.org/wiki/Veohhttp://en.wikipedia.org/wiki/Veohhttp://en.wikipedia.org/wiki/Rackspacehttp://en.wikipedia.org/wiki/The_New_York_Timeshttp://en.wikipedia.org/wiki/Powerset_(company)http://en.wikipedia.org/wiki/Last.fmhttp://en.wikipedia.org/wiki/Joosthttp://en.wikipedia.org/wiki/Information_Sciences_Institutehttp://en.wikipedia.org/wiki/ImageShackhttp://en.wikipedia.org/wiki/IBMhttp://en.wikipedia.org/wiki/Fox_Interactive_Mediahttp://en.wikipedia.org/wiki/Facebookhttp://en.wikipedia.org/wiki/A9.com
  • 8/13/2019 MapReduceppt

    32/38

    Yahoo! Launches World's Largest

    Hadoop Production Application

    YAHOO! RECENTLY LAUNCHED WHAT WE BELIEVE IS THEWORLDS LARGESTAPACHE HADOOPPRODUCTIONAPPLICATION. THE YAHOO! SEARCH WEBMAP IS A HADOOPAPPLICATION THAT RUNS ON A MORE THAN 10,000 CORELINUX CLUSTER AND PRODUCES DATA THAT IS NOW USED IN

    EVERY YAHOO! WEB SEARCH QUERY.

    THE WEBMAP BUILD STARTS WITH EVERY WEB PAGE CRAWLEDBY YAHOO! AND PRODUCES A DATABASE OF ALL KNOWN WEBPAGES AND SITES ON THE INTERNET AND A VAST ARRAY OFDATA ABOUT EVERY PAGE AND SITE. THIS DERIVED DATA

    FEEDS THE MACHINE LEARNED RANKING ALGORITHMS ATTHE HEART OF YAHOO! SEARCH.

    http://hadoop.apache.org/http://hadoop.apache.org/
  • 8/13/2019 MapReduceppt

    33/38

    Yahoos Hadoop One of Yahoo's Hadoopclusters sorted 1 terabyte of data in 209 seconds, which beat

    the previous record of 297 seconds in the annual general purpose (daytona) terabytesort benchmark. The sort benchmark, which was created in 1998 by Jim Gray,specifies the input data (10 billion 100 byte records), which must be completely sortedand written to disk. This is the first time that either a Java or an open source programhas won. Yahoo is both the largest user of Hadoop with 13,000+ nodes runninghundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usageand contributionsareincreasing rapidly.

    The cluster statistics were:

    910 nodes, 2 quad core Xeons @ 2.0ghz per node

    4 SATA disks per node, 8G RAM per node

    1 gigabit ethernet on each node, 40 nodes per rack

    8 gigabit ethernet uplinks from each rack to the core.

    Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)

    Sun Java JDK 1.6.0_05-b13

    http://hadoop.apache.org/corehttp://www.hpl.hp.com/hosted/sortbenchmark/http://www.hpl.hp.com/hosted/sortbenchmark/http://wiki.apache.org/hadoop/PoweredByhttp://people.apache.org/~omalley/CorePatchesByBranch-18.pnghttp://people.apache.org/~omalley/CorePatchesByBranch-18.pnghttp://wiki.apache.org/hadoop/PoweredByhttp://www.hpl.hp.com/hosted/sortbenchmark/http://www.hpl.hp.com/hosted/sortbenchmark/http://hadoop.apache.org/core
  • 8/13/2019 MapReduceppt

    34/38

    Process Diagram

  • 8/13/2019 MapReduceppt

    35/38

    Map/Reduce Processes

    Launching ApplicationUser application codeSubmits a specific kind of Map/Reduce job

    JobTracker

    Handles all jobsMakes all scheduling decisions

    TaskTrackerManager for all tasks on a given node

    TaskRuns an individual map or reduce fragment for a

    given jobForks from the TaskTracker

  • 8/13/2019 MapReduceppt

    36/38

    Hadoop Map-Reduce Architecture

    Master-Slave architecture

    Map-Reduce Master Jobtracker

    Accepts MR jobs submitted by usersAssigns Map and Reduce tasks to TasktrackersMonitors task and tasktracker status, re-executes tasks upon failure

    Map-Reduce Slaves Tasktrackers

    Run Map and Reduce tasks upon instruction from the JobtrackerManage storage and transmission of intermediate output

  • 8/13/2019 MapReduceppt

    37/38

    Imp Links

    http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html

    http://www.youtube.com/watch?v=5Eib_H_zCEY&feature=related

    http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=related

    http://labs.google.com/papers/gfs-sosp2003.pdf

    http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=yjPBkvYh-ss&feature=relatedhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://www.youtube.com/watch?v=5Eib_H_zCEY&feature=relatedhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.htmlhttp://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html
  • 8/13/2019 MapReduceppt

    38/38

    Thank you !!!!!