Top Banner
Apache Hadoop Presented By, Darpan Dekivadiya(09BCE008)
59
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache hadoop

Apache Hadoop Presented By,

Darpan Dekivadiya(09BCE008)

Page 2: Apache hadoop

What is Hadoop? • A framework for storing and processing big data on

lots of commodity machines. o Up to 4,000 machines in a cluster

o Up to 20 PB in a cluster

• Open Source Apache project

• High reliability done in software o Automated fail-over for data and computation

• Implemented in Java

28-10-2012 2

Page 3: Apache hadoop

Hadoop development • Hadoop was created by Doug Cutting

• This is named as Hadoop from his son‟s toy

elephant.

• It is originally developed to support Nutch search

engine project.

• After that, So many companies adopted it and

contributed in this project.

28-10-2012 3

Page 4: Apache hadoop

Hadoop Echo system • Apache Hadoop is a collection of open-source software

for reliable, scalable, distributed computing.

• Hadoop Common: The common utilities that support the

other Hadoop subprojects.

• HDFS: A distributed file system that provides high

throughput access to application data.

• MapReduce: A software framework for distributed

processing of large data sets on compute clusters.

• Pig: A high-level data-flow language and execution

framework for parallel computation.

• HBase: A scalable, distributed database that supports

structured data storage for large tables.

28-10-2012 4

Page 5: Apache hadoop

28-10-2012 5

Page 6: Apache hadoop

Hadoop, Why? • Need to process Multi Petabyte Datasets

• Expensive to build reliability in each application.

• Nodes fail every day – Failure is expected, rather than exceptional.

– The number of nodes in a cluster is not constant.

• Need common infrastructure –Efficient, reliable, Open Source Apache License

• The above goals are same as Condor, but o Workloads are IO bound and not CPU bound

28-10-2012 6

Page 7: Apache hadoop

Hadoop History • Dec 2004 – Google GFS paper published

• July 2005 – Nutch(Search engine) uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• May 2009 – Hadoop sorts Petabyte in 17 hours

• Aug 2010 – World‟s Largest Hadoop cluster at o Facebook

o 2900 nodes, 30+ PetaByte

28-10-2012 7

Page 8: Apache hadoop

Who uses Hadoop? • Amazon/A9

• Facebook

• Google

• IBM

• Joost

• Last.fm

• New York Times

• PowerSet

• Veoh

• Yahoo!

28-10-2012 8

Page 9: Apache hadoop

Applications of Hadoop • Search

o Yahoo, Amazon, Zvents

• Log processing o Facebook, Yahoo, ContextWeb. Joost, Last.fm

• Recommendation Systems o Facebook

• Data Warehouse o Facebook, AOL

• Video and Image Analysis o New York Times, Eyealike

28-10-2012 9

Page 10: Apache hadoop

Who generates the data? • Lots of data is generated on Facebook

o 500+ million active users

o 30 billion pieces of content shared every month (news stories, photos,

blogs, etc)

• Lots of data is generated for Yahoo search engine.

• Lots of data is generated at Amazon S3 cloud

service.

28-10-2012 10

Page 11: Apache hadoop

Data usage • Data Usage

o Statistics per day:

o 20 TB of compressed new data added per day

o 3 PB of compressed data scanned per day

o 20K jobs on production cluster per day

o 480K compute hours per day

• Barrier to entry is significantly reduced: o New engineers go though a Hadoop/Hive training session

o 300+ people run jobs on Hadoop

o Analysts (non-engineers) use Hadoop through Hive

28-10-2012 11

Page 12: Apache hadoop

HDFS Hadoop Distributed File System

28-10-2012 12

Page 13: Apache hadoop

28-10-2012 13

Based on Google File System

Page 14: Apache hadoop

28-10-2012 14

Redundant storage

Page 15: Apache hadoop

Commodity Hardware

• Typically in 2 level architecture o Nodes are commodity PCs

o 20-40 nodes/rack

o The default size of Apache Hadoop block is 64 MB.

o Relational databases typically store data blocks in sizes ranging from 4KB

to 32KB.

28-10-2012 15

Page 16: Apache hadoop

How does HDFS maintain everything? • Two types of nodes

o Single NameNode and a number of DataNodes

• Namenode o File names, permissions, modified flags, etc.

o Data locations exposed so that computations can

• Datanode o Store and retrieve blocks when they are told to .

o HDFS is built using the Java language; any machine that supports Java

can run the NameNode or the DataNode software

28-10-2012 16

Page 17: Apache hadoop

How HDFS works?

28-10-2012 17

Page 18: Apache hadoop

• The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

• The DataNodes are responsible for serving read and write requests from the file system‟s clients.

28-10-2012 18

Page 19: Apache hadoop

MapReduce Google‟s MapReduce Technique

28-10-2012 19

Page 20: Apache hadoop

MapReduce Overview • Provides a clean abstraction for programmers to

write distributed application.

• Factors out many reliability concerns from

application logic

• A batch data processing system

• Automatic parallelization & distribution

• Fault-tolerance

• Status and monitoring tools

28-10-2012 20

Page 21: Apache hadoop

Programming Model • Programmer has to implement interface of two

functions:

– map (in_key, in_value) ->

(out_key, intermediate_value) list

– reduce (out_key, intermediate_value list) ->

out_value list

28-10-2012 21

Page 22: Apache hadoop

MapReduce Flow

28-10-2012 22

Page 23: Apache hadoop

Mapper(indexing example)

• Input is the line no and the actual line.

• Input 1 : (“100”,“I Love India ”)

• Output 1 : (“I”,“100”), (“Love”,“100”),

(“India”,“100”)

• Input 2 : (“101”,“I Love eBay”)

• Output 2 : (“I”,“101”), (“Love”,“101”),

(“eBay”,“101”)

28-10-2012 23

Page 24: Apache hadoop

Reducer (indexing example)

• Input is word and the line nos.

• Input 1 : (“I”,“100”,”101”)

• Input 2 : (“Love”,“100”,”101”)

• Input 3 : (“India”, “100”)

• Input 4 : (“eBay”, “101”)

• Output, the words are stored along with the line nos.

28-10-2012 24

Page 25: Apache hadoop

Google Page Rank example

• Mapper

o Input is a link and the html content

o Output is a list of outgoing link and pagerank of this page

• Reducer

o Input is a link and a list of pagranks of pages linking to this page

o Output is the pagerank of this page, which is the weighted

average of all input pageranks

28-10-2012 25

Page 26: Apache hadoop

Conti. • Limited atomicity and transaction support.

o HBase supports multiple batched mutations of

single rows only.

o Data is unstructured and untyped.

• No accessed or manipulated via SQL.

o Programmatic access via Java, REST, or Thrift APIs.

o Scripting via JRuby.

28-10-2012 26

Page 27: Apache hadoop

Introduction of HBase

Page 28: Apache hadoop

OVERVIEW • HBase is an Apache open source project

whose goal is to provide storage for the

Hadoop Distributed Computing

Environment.

• Data is logically organized into tables, rows

and columns.

28-10-2012 28

Page 29: Apache hadoop

Outline • Data Model

• Architecture and Implementation

• Examples & Tests

28-10-2012 29

Page 30: Apache hadoop

Conceptual View

• A data row has a sortable row key and an arbitrary number of columns.

• A Time Stamp is designated automatically if not artificially.

• <family>:<label>

Row key Time

Stamp

Column

“contents:” Column “anchor:”

“com.apach

e.www”

t12 “<html>…”

t11 “<html>…”

t10 “anchor:apache.

com” “APACHE”

“com.cnn.w

ww”

t15 “anchor:cnnsi.com” “CNN”

t13 “anchor:my.look.c

a” “CNN.com”

t6 “<html>…”

t5 “<html>…”

t3 “<html>…”

<family>:<label>

Page 31: Apache hadoop

Physical Storage View • Physically, tables are

stored on a per-column family basis.

• Empty cells are not stored in a column-oriented storage format.

• Each column family is managed by an HStore.

Row key TS Column

“contents:”

“com.apache.w

ww”

t12 “<html>…”

t11 “<html>…”

“com.cn.www”

t6 “<html>…”

t5 “<html>…”

t3 “<html>…”

Row key TS Column “anchor:”

“com.apache.

www”

t10

“anchor:

apache.com”

“APACHE”

com.cn.www”

t9 “anchor:

cnnsi.com” “CNN”

t8 “anchor:

my.look.ca”

“CNN.co

m”

HStore

Data MapFile Index MapFile

Key/Value

Index key

HStore

Memcache

Page 32: Apache hadoop

Row Ranges: Regions • Row key/ Column

ascending, Timestamp

descending

• Physically, tables are broken

into row ranges contain rows

from start-key to end-key

Row key Time

Stamp

Column

“contents:” Column “anchor:”

aaaa

t15 anchor:cc value

t13 ba

t12 bb

t11 anchor:cd value

t10 bc

aaab t14

aaac anchor:be value

aaad anchor:ad value

aaae

t5 ae

t3 af

Page 33: Apache hadoop

Outline • Data Model

• Architecture and Implementation

• Examples & Tests

Page 34: Apache hadoop

Three major components • The HBaseMaster

• The HRegionServer

• The HBase client

Page 35: Apache hadoop

HBaseMaster • Assign regions to

HRegionServers.

1. ROOT region locates all the

META regions.

2. META region maps a number

of user regions.

3. Assign user regions to the

HRegionServers.

• Enable/Disable table and

change table schema

• Monitor the health of each Server

Master

1 ROOT Region

Server Server

2 META Region

Server

2 META Region

Server

2 META Region

Server

2 META Region

ROOT Region

META Region

META Region

USER Region

USER Region

USER Region

Page 36: Apache hadoop

HBase Client

Page 37: Apache hadoop

HBase Client ROOT Region

Page 38: Apache hadoop

HBase Client META Region

Page 39: Apache hadoop

HBase Client User Region

Information cached

Page 40: Apache hadoop

Outline • Data Model

• Architecture and Implementation

• Examples & Tests

Page 41: Apache hadoop

Create MyTable HBaseAdmin admin= new HBaseAdmin(config);

HColumnDescriptor []column;

column= new HColumnDescriptor[2];

column[0]=new HColumnDescriptor("columnFamily1:");

column[1]=new HColumnDescriptor("columnFamily2:");

HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));

desc.addFamily(column[0]);

desc.addFamily(column[1]);

admin.createTable(desc);

Row Key Timestamp columnFamily1: columnFamily2:

Page 42: Apache hadoop

Insert Values BatchUpdate batchUpdate = new

BatchUpdate("myRow",timestamp);

batchUpdate.put("columnFamily1:labela",Bytes.toBytes("labela value"));

batchUpdate.put("columnFamily1:labelb",Bytes.toBytes(“labelb value"));

table.commit(batchUpdate);

Row Key Timestamp columnFamily1:

myRow

ts1 labela labela value

ts2 labelb labelb value

Page 43: Apache hadoop

Search Row key

Time

Stamp Column “anchor:”

“com.apache.www”

t12

t11

t10 “anchor:apache.com” “APACHE”

“com.cnn.www”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”

t6

t5

t3

Select value from table where key=‘com.apache.www’ AND label=‘anchor:apache.com’

Page 44: Apache hadoop

Search Scanner

Select value from table where anchor=‘cnnsi.com’

Row key Time

Stamp Column “anchor:”

“com.apache.www”

t12

t11

t10 “anchor:apache.com” “APACHE”

“com.cnn.www”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”

t6

t5

t3

Page 45: Apache hadoop

PIG Programming Language for Hadoop Framework

28-10-2012 45

Page 46: Apache hadoop

Introduction • Pig was initially developed at Yahoo!

• Pig programming language is designed

to handle any kind of data-hence the

name!

• Pig is made of two components:

Language itself, which is called PigLatin .

Runtime Environment where PigLatin programs

are executed.

28-10-2012 46

Page 47: Apache hadoop

Why PigLatin? • Map Reduce is very powerful, but:

o It requires a Java programmer.

o User has to re-invent common functionality (join, filter, etc.).

• For non-java programmers Pig Latin is introduced.

• Pig Latin is a data flow language rather than procedural or declarative.

• User code and existing binaries can be included almost anywhere.

• Metadata not required, but used when available.

• Support for nested types.

• Operates on files in HDFS.

28-10-2012 47

Page 48: Apache hadoop

Pig Latin Overview • Pig provides a higher level language,

Pig Latin, that: o Increases productivity.

o In one test 10 lines of Pig Latin ≈ 200 lines of Java.

• What took 4 hours to write in Java took

15 minutes in Pig Latin. o Opens the system to non-Java programmers.

o Provides common operations like join, group,

filter, sort.

28-10-2012 48

Page 49: Apache hadoop

Load Data • The objects that are being worked on by Hadoop

are stored in HDFS.

• To access this data, the program must first tell Pig

what file (or files) it will use.

• That‟s done through the LOAD ‘data_file’

command .

• If the data is stored in a file format that is not

natively accessible to Pig,

• Add the “USING” function to the LOAD statement to

specify a user-defined function that can read in

and interpret the data.

28-10-2012 49

Page 50: Apache hadoop

Transform Data • The transform logic is where all the

data manipulation happens.

• For example : FILTER out rows that are not of interest.

JOIN two sets of data files .

GROUP data to build aggregations .

ORDER results .

28-10-2012 50

Page 51: Apache hadoop

Example of Pig Program • file composed of Twitter feeds, selects only those

tweets that are using en(English) iso_language

code, then groups them by the user who is

tweeting, and displays the sum of the number of the

re tweets of that user‟s tweets.

L = LOAD „hdfs//node/tweet_data‟;

FL = FILTER L BY iso_language_code EQ „en‟;

G = GROUP FL BY from_user;

RT = FOREACH G GENERATE group, SUM(retweets);

28-10-2012 51

Page 52: Apache hadoop

DUMP and STORE • DUMP or STORE command generates the results of a

Pig program.

• DUMP command sends the output to the screen,

while debugging Pig programs.

• DUMP command can be used anywhere in

program to dump intermediate result sets to the

screen.

• STORE command will store results from running

programs in a file for further processing and analysis.

28-10-2012 52

Page 53: Apache hadoop

Pig Runtime Environment • Pig runtime is used when Pig program need to run in

the Hadoop environment .

• There are three ways to run a Pig program:

Embedded in a Script.

Embedded in Java Program.

From the Pig Command line, called Grunt.

• The Pig runtime environment translates the program

into a set of map and reduce tasks and runs.

• This greatly simplifies the work associated with the

analysis of large amounts of data.

28-10-2012 53

Page 54: Apache hadoop

PIG is used for? • Web log processing.

• Data processing for web search platforms.

• Ad hoc queries across large data sets.

• Rapid prototyping of algorithms for processing large

data sets

28-10-2012 54

Page 55: Apache hadoop

Hadoop@BIG Statistics of Hadoop used at giant structure

28-10-2012 55

Page 56: Apache hadoop

Hadoop@Facebook • Production cluster

o 4800 cores, 600 machines, 16GB per machine – April 2009

o 8000 cores, 1000 machines, 32 GB per machine – July 2009

o 4 SATA disks of 1 TB each per machine

o 2 level network hierarchy, 40 machines per rack

o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009

• Test cluster

• 800 cores, 16GB each

28-10-2012 56

Page 57: Apache hadoop

Hadoop@Yahoo • World's largest Hadoop production application.

• The Yahoo! Search Webmap is a Hadoop

application that runs on a more than 10,000 core

Linux cluster

• Biggest contributor to Hadoop.

• Converting All its batches to Hadoop.

28-10-2012 57

Page 58: Apache hadoop

Hadoop@Amazon • Hadoop can be run on Amazon Elastic Compute

Cloud (EC2) and Amazon Simple Storage Service (S3)

• The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240

• Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.

28-10-2012 58

Page 59: Apache hadoop

Thank You

28-10-2012 59