Top Banner
What does a Hadoop Process do on Your Machine Wang Xu [email protected] Feb, 2011 . . 1 / 17 . What does a Hadoop Process do on Your Machine .
17
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 20110227 hadoop disk-linuxfb

What does a Hadoop Process do on Your Machine

Wang Xu

[email protected]

Feb, 2011

..1 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 2: 20110227 hadoop disk-linuxfb

Outline

...1 Hadoop: a Clone of Google Infrastructure

...2 What’s MapReduce

...3 How HDFS supports MapReduce and Others

...4 What’s DataNode Doing

...5 What’s TaskTracker Doing

..2 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 3: 20110227 hadoop disk-linuxfb

Apache Hadoop: History & Dreams

nutch, lucene. . .Yahoo and search engines. . .Doug Cutting. . .Yahoo, CloudEra, & Facebook

..3 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 4: 20110227 hadoop disk-linuxfb

The Hadoop Family

..

Projects and Their Relatives in Google.

.. Common: ipc, utils, and other common stuff

.. HDFS ⇐⇒ Google GFS: Distributed File System

.. MapReduce ⇐⇒ Google MapReduce: Framework of DistributedComputing

.. HBase ⇐⇒ BigTable: Column Family based Non-RelationalDatabase

.. Zookeeper ⇐⇒ Chubby: Distributed Lock Service, forQuorum. . .

.. Avro ⇐⇒ Protocol Buffers: Cross language data Serializationand Exchange

.. Hive & Pig: Data Warehouse based on MapReduce Platform

.. Oozie: Data flow engine

..4 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 5: 20110227 hadoop disk-linuxfb

How Hadoop Help Your Business

..

Usages of Hadoop.

.. Search Engine: Nutch Projects, Yahoo (Now Bing Based), andsome others

.. Log Analysis: for user behavior, network signalling, etc.

.. New Messaging system of Facebook is based on HBase

.. Advertisement: Yahoo and other company

.. Hive is used in Facebook

..5 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 6: 20110227 hadoop disk-linuxfb

The Nature of MapReduce

..

Map in Functional Programming.

.. Map: map({1,2,3,4}, (×2)) ⇒ {2,4,6,8}

.. Every elements are processed with given method

.. Elements do not affect each other

.. The input is immutable, and the output is a new list

.. Fit for Parallel Processing

..

Reduce in Functional Programming.

.. Reduce: reduce({1,2,3,4},(×)) Rightarrow {24}

.. All the elements in list are processed together

.. The input is immutable, and the output is a new list

..6 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 7: 20110227 hadoop disk-linuxfb

Distributed MapReduce

..

A Map Task’s Life.

.. Input: Segment of Input Records (from DFS)

.. Job: Process Records one by one — Emit K-V Pairs, 0, 1, orMore

.. Then: Working As a Server, Waiting the Reduce’s K-V retrivingrequest.

..

A Reduce Task’s Life.

.. Shuffle: Retrive from All Map Tasks for Specific Keys

.. Sort: Group and merge the K-V Pairs

.. Reduce: Write File Back to DFS

..7 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 8: 20110227 hadoop disk-linuxfb

The Landscape of MapReduce

Map 1

Map 2

Map 3

Map 4

Reduce 1

Reduce 2

Reduce 3

Figure: Data Flow of MapReduce

...1 Map read data from DFSseperately

...2 Map process the data, anddo not communicate eachother

...3 Map keep result in nodelocal storage (local disk)

...4 Reduce retrive data from allthe Maps

...5 Reduce do not communicateeach other either

...6 Reduce write back result toDFS

..8 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 9: 20110227 hadoop disk-linuxfb

Hadoop Distributed File System

..

Commodity PC based Massive Data Storage System.

.. Redundancy: block replicated to different nodes in differentracks

.. Location awareness, task can be sched to nodes storing data

.. Write once, read multi-times

.. Large files will be splitted to Blocks

..9 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 10: 20110227 hadoop disk-linuxfb

The Role of a DataNode

..

Block (chunk) container of HDFS.

.. Manage Dirs as a soft RAID0 — Write block files round-robin

.. Keep a block-dir Map in Memory

.. DataNodeProtocol(by NameNode): Communicate withNameNode — Report, Heartbeat and get command

.. DataTransferProtocol: Communicate with Client and otherDataNodes — Transfer Blocks

..10 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 11: 20110227 hadoop disk-linuxfb

DataNode in Disk

..

Block Files.

.. Those blk XXX

.. 64MB or 128MB blocks

..

Meta Files.

.. Those blk XXX.meta

.. Header: layout version, and bytes per checksum

.. Checksums

..11 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 12: 20110227 hadoop disk-linuxfb

Block Writing To DataNode

..

The Pipe Line.

.. Setup Pipe line: Client → DataNode1 → DataNode2 →DataNode3

.. DataNode: Receiving packet, and forward to next datanode

.. DataNode Write Received Data Buffer

.. DataNode then Write correspond meta

.. DataNode flush the file stream.

..12 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 13: 20110227 hadoop disk-linuxfb

The Role of a TaskTracker

..

Local Commander of a Node.

.. Running from begin to the end

.. Get task from JobTracker — The Big BOSS

.. Both Map and Reduce are runned by TaskTracker

.. Assign tasks to Mapper and Reducer Process

.. Work as Http Server (Jetty) for data transfer between TTs

..13 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 14: 20110227 hadoop disk-linuxfb

Daily Life of a Mapper

..

Direct Mapper Output.

.. Run map() against Every Records, and Collect The K-Vs

.. Write K-V into File (in OutputFormat) once got a K-V pair

.. Flush file.

..

Buffered Mapper (The Normal Case).

.. Run map() against Every Records, and Collect The K-Vs

.. collect K-V’s into a buffer set by io.sort.mb

.. Spill to external file if Map output fulfill the buffer.

.. Finally, do a external sort (Optional Combiner) and write to thefinal files

.. file: $local/taskTracker/jobcache/jobid/taskid/file.out

..14 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 15: 20110227 hadoop disk-linuxfb

Illustration of Map and Combiner from Yahoo

..

Combiner step inserted into the MapReduce data flow.

Figure: http://developer.yahoo.com/hadoop/tutorial/module4.html

..15 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 16: 20110227 hadoop disk-linuxfb

Life of a Reducer

..

Shuffle & Sort.

.. Copy map results from all Maps

.. Store map output in disk or memory

.. file:$local/taskTracker/jobcache/jobid/taskid/output/maplocationid.out

.. Sort: Merge the map outputs (like the Combiner in Map,hmmm. . . It should be combiner likes Sort)

..Reduce.

.. Write the result out with Output Format to HDFS

..16 / 17

.

What does a Hadoop Process do on Your Machine

. ▲

Page 17: 20110227 hadoop disk-linuxfb

Q & A

..17 / 17

.

What does a Hadoop Process do on Your Machine

. ▲