Transcript

Apache Hadoop Presented By,

Darpan Dekivadiya(09BCE008)

What is Hadoop? • A framework for storing and processing big data on

lots of commodity machines. o Up to 4,000 machines in a cluster

o Up to 20 PB in a cluster

• Open Source Apache project

• High reliability done in software o Automated fail-over for data and computation

• Implemented in Java

28-10-2012 2

Hadoop development • Hadoop was created by Doug Cutting

• This is named as Hadoop from his son‟s toy

elephant.

• It is originally developed to support Nutch search

engine project.

• After that, So many companies adopted it and

contributed in this project.

28-10-2012 3

Hadoop Echo system • Apache Hadoop is a collection of open-source software

for reliable, scalable, distributed computing.

• Hadoop Common: The common utilities that support the

other Hadoop subprojects.

• HDFS: A distributed file system that provides high

throughput access to application data.

• MapReduce: A software framework for distributed

processing of large data sets on compute clusters.

• Pig: A high-level data-flow language and execution

framework for parallel computation.

• HBase: A scalable, distributed database that supports

structured data storage for large tables.

28-10-2012 4

28-10-2012 5

Hadoop, Why? • Need to process Multi Petabyte Datasets

• Expensive to build reliability in each application.

• Nodes fail every day – Failure is expected, rather than exceptional.

– The number of nodes in a cluster is not constant.

• Need common infrastructure –Efficient, reliable, Open Source Apache License

• The above goals are same as Condor, but o Workloads are IO bound and not CPU bound

28-10-2012 6

Hadoop History • Dec 2004 – Google GFS paper published

• July 2005 – Nutch(Search engine) uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• May 2009 – Hadoop sorts Petabyte in 17 hours

• Aug 2010 – World‟s Largest Hadoop cluster at o Facebook

o 2900 nodes, 30+ PetaByte

28-10-2012 7

Who uses Hadoop? • Amazon/A9

• Facebook

• Google

• IBM

• Joost

• Last.fm

• New York Times

• PowerSet

• Veoh

• Yahoo!

28-10-2012 8

Applications of Hadoop • Search

o Yahoo, Amazon, Zvents

• Log processing o Facebook, Yahoo, ContextWeb. Joost, Last.fm

• Recommendation Systems o Facebook

• Data Warehouse o Facebook, AOL

• Video and Image Analysis o New York Times, Eyealike

28-10-2012 9

Who generates the data? • Lots of data is generated on Facebook

o 500+ million active users

o 30 billion pieces of content shared every month (news stories, photos,

blogs, etc)

• Lots of data is generated for Yahoo search engine.

• Lots of data is generated at Amazon S3 cloud

service.

28-10-2012 10

Data usage • Data Usage

o Statistics per day:

o 20 TB of compressed new data added per day

o 3 PB of compressed data scanned per day

o 20K jobs on production cluster per day

o 480K compute hours per day

• Barrier to entry is significantly reduced: o New engineers go though a Hadoop/Hive training session

o 300+ people run jobs on Hadoop

o Analysts (non-engineers) use Hadoop through Hive

28-10-2012 11

HDFS Hadoop Distributed File System

28-10-2012 12

28-10-2012 13

Based on Google File System

28-10-2012 14

Redundant storage

Commodity Hardware

• Typically in 2 level architecture o Nodes are commodity PCs

o 20-40 nodes/rack

o The default size of Apache Hadoop block is 64 MB.

o Relational databases typically store data blocks in sizes ranging from 4KB

to 32KB.

28-10-2012 15

How does HDFS maintain everything? • Two types of nodes

o Single NameNode and a number of DataNodes

• Namenode o File names, permissions, modified flags, etc.

o Data locations exposed so that computations can

• Datanode o Store and retrieve blocks when they are told to .

o HDFS is built using the Java language; any machine that supports Java

can run the NameNode or the DataNode software

28-10-2012 16

How HDFS works?

28-10-2012 17

• The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

• The DataNodes are responsible for serving read and write requests from the file system‟s clients.

28-10-2012 18

MapReduce Google‟s MapReduce Technique

28-10-2012 19

MapReduce Overview • Provides a clean abstraction for programmers to

write distributed application.

• Factors out many reliability concerns from

application logic

• A batch data processing system

• Automatic parallelization & distribution

• Fault-tolerance

• Status and monitoring tools

28-10-2012 20

Programming Model • Programmer has to implement interface of two

functions:

– map (in_key, in_value) ->

(out_key, intermediate_value) list

– reduce (out_key, intermediate_value list) ->

out_value list

28-10-2012 21

MapReduce Flow

28-10-2012 22

Mapper(indexing example)

• Input is the line no and the actual line.

• Input 1 : (“100”,“I Love India ”)

• Output 1 : (“I”,“100”), (“Love”,“100”),

(“India”,“100”)

• Input 2 : (“101”,“I Love eBay”)

• Output 2 : (“I”,“101”), (“Love”,“101”),

(“eBay”,“101”)

28-10-2012 23

Reducer (indexing example)

• Input is word and the line nos.

• Input 1 : (“I”,“100”,”101”)

• Input 2 : (“Love”,“100”,”101”)

• Input 3 : (“India”, “100”)

• Input 4 : (“eBay”, “101”)

• Output, the words are stored along with the line nos.

28-10-2012 24

Google Page Rank example

• Mapper

o Input is a link and the html content

o Output is a list of outgoing link and pagerank of this page

• Reducer

o Input is a link and a list of pagranks of pages linking to this page

o Output is the pagerank of this page, which is the weighted

average of all input pageranks

28-10-2012 25

Conti. • Limited atomicity and transaction support.

o HBase supports multiple batched mutations of

single rows only.

o Data is unstructured and untyped.

• No accessed or manipulated via SQL.

o Programmatic access via Java, REST, or Thrift APIs.

o Scripting via JRuby.

28-10-2012 26

Introduction of HBase

OVERVIEW • HBase is an Apache open source project

whose goal is to provide storage for the

Hadoop Distributed Computing

Environment.

• Data is logically organized into tables, rows

and columns.

28-10-2012 28

Outline • Data Model

• Architecture and Implementation

• Examples & Tests

28-10-2012 29

Conceptual View

• A data row has a sortable row key and an arbitrary number of columns.

• A Time Stamp is designated automatically if not artificially.

• <family>:<label>

Row key Time

Stamp

Column

“contents:” Column “anchor:”

“com.apach

e.www”

t12 “<html>…”

t11 “<html>…”

t10 “anchor:apache.

com” “APACHE”

“com.cnn.w

ww”

t15 “anchor:cnnsi.com” “CNN”

t13 “anchor:my.look.c

a” “CNN.com”

t6 “<html>…”

t5 “<html>…”

t3 “<html>…”

<family>:<label>

Physical Storage View • Physically, tables are

stored on a per-column family basis.

• Empty cells are not stored in a column-oriented storage format.

• Each column family is managed by an HStore.

Row key TS Column

“contents:”

“com.apache.w

ww”

t12 “<html>…”

t11 “<html>…”

“com.cn.www”

t6 “<html>…”

t5 “<html>…”

t3 “<html>…”

Row key TS Column “anchor:”

“com.apache.

www”

t10

“anchor:

apache.com”

“APACHE”

com.cn.www”

t9 “anchor:

cnnsi.com” “CNN”

t8 “anchor:

my.look.ca”

“CNN.co

m”

HStore

Data MapFile Index MapFile

Key/Value

Index key

HStore

Memcache

Row Ranges: Regions • Row key/ Column

ascending, Timestamp

descending

• Physically, tables are broken

into row ranges contain rows

from start-key to end-key

Row key Time

Stamp

Column

“contents:” Column “anchor:”

aaaa

t15 anchor:cc value

t13 ba

t12 bb

t11 anchor:cd value

t10 bc

aaab t14

aaac anchor:be value

aaad anchor:ad value

aaae

t5 ae

t3 af

Outline • Data Model

• Architecture and Implementation

• Examples & Tests

Three major components • The HBaseMaster

• The HRegionServer

• The HBase client

HBaseMaster • Assign regions to

HRegionServers.

1. ROOT region locates all the

META regions.

2. META region maps a number

of user regions.

3. Assign user regions to the

HRegionServers.

• Enable/Disable table and

change table schema

• Monitor the health of each Server

Master

1 ROOT Region

Server Server

2 META Region

Server

2 META Region

Server

2 META Region

Server

2 META Region

ROOT Region

META Region

META Region

USER Region

USER Region

USER Region

HBase Client

HBase Client ROOT Region

HBase Client META Region

HBase Client User Region

Information cached

Outline • Data Model

• Architecture and Implementation

• Examples & Tests

Create MyTable HBaseAdmin admin= new HBaseAdmin(config);

HColumnDescriptor []column;

column= new HColumnDescriptor[2];

column[0]=new HColumnDescriptor("columnFamily1:");

column[1]=new HColumnDescriptor("columnFamily2:");

HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));

desc.addFamily(column[0]);

desc.addFamily(column[1]);

admin.createTable(desc);

Row Key Timestamp columnFamily1: columnFamily2:

Insert Values BatchUpdate batchUpdate = new

BatchUpdate("myRow",timestamp);

batchUpdate.put("columnFamily1:labela",Bytes.toBytes("labela value"));

batchUpdate.put("columnFamily1:labelb",Bytes.toBytes(“labelb value"));

table.commit(batchUpdate);

Row Key Timestamp columnFamily1:

myRow

ts1 labela labela value

ts2 labelb labelb value

Search Row key

Time

Stamp Column “anchor:”

“com.apache.www”

t12

t11

t10 “anchor:apache.com” “APACHE”

“com.cnn.www”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”

t6

t5

t3

Select value from table where key=‘com.apache.www’ AND label=‘anchor:apache.com’

Search Scanner

Select value from table where anchor=‘cnnsi.com’

Row key Time

Stamp Column “anchor:”

“com.apache.www”

t12

t11

t10 “anchor:apache.com” “APACHE”

“com.cnn.www”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”

t6

t5

t3

PIG Programming Language for Hadoop Framework

28-10-2012 45

Introduction • Pig was initially developed at Yahoo!

• Pig programming language is designed

to handle any kind of data-hence the

name!

• Pig is made of two components:

Language itself, which is called PigLatin .

Runtime Environment where PigLatin programs

are executed.

28-10-2012 46

Why PigLatin? • Map Reduce is very powerful, but:

o It requires a Java programmer.

o User has to re-invent common functionality (join, filter, etc.).

• For non-java programmers Pig Latin is introduced.

• Pig Latin is a data flow language rather than procedural or declarative.

• User code and existing binaries can be included almost anywhere.

• Metadata not required, but used when available.

• Support for nested types.

• Operates on files in HDFS.

28-10-2012 47

Pig Latin Overview • Pig provides a higher level language,

Pig Latin, that: o Increases productivity.

o In one test 10 lines of Pig Latin ≈ 200 lines of Java.

• What took 4 hours to write in Java took

15 minutes in Pig Latin. o Opens the system to non-Java programmers.

o Provides common operations like join, group,

filter, sort.

28-10-2012 48

Load Data • The objects that are being worked on by Hadoop

are stored in HDFS.

• To access this data, the program must first tell Pig

what file (or files) it will use.

• That‟s done through the LOAD ‘data_file’

command .

• If the data is stored in a file format that is not

natively accessible to Pig,

• Add the “USING” function to the LOAD statement to

specify a user-defined function that can read in

and interpret the data.

28-10-2012 49

Transform Data • The transform logic is where all the

data manipulation happens.

• For example : FILTER out rows that are not of interest.

JOIN two sets of data files .

GROUP data to build aggregations .

ORDER results .

28-10-2012 50

Example of Pig Program • file composed of Twitter feeds, selects only those

tweets that are using en(English) iso_language

code, then groups them by the user who is

tweeting, and displays the sum of the number of the

re tweets of that user‟s tweets.

L = LOAD „hdfs//node/tweet_data‟;

FL = FILTER L BY iso_language_code EQ „en‟;

G = GROUP FL BY from_user;

RT = FOREACH G GENERATE group, SUM(retweets);

28-10-2012 51

DUMP and STORE • DUMP or STORE command generates the results of a

Pig program.

• DUMP command sends the output to the screen,

while debugging Pig programs.

• DUMP command can be used anywhere in

program to dump intermediate result sets to the

screen.

• STORE command will store results from running

programs in a file for further processing and analysis.

28-10-2012 52

Pig Runtime Environment • Pig runtime is used when Pig program need to run in

the Hadoop environment .

• There are three ways to run a Pig program:

Embedded in a Script.

Embedded in Java Program.

From the Pig Command line, called Grunt.

• The Pig runtime environment translates the program

into a set of map and reduce tasks and runs.

• This greatly simplifies the work associated with the

analysis of large amounts of data.

28-10-2012 53

PIG is used for? • Web log processing.

• Data processing for web search platforms.

• Ad hoc queries across large data sets.

• Rapid prototyping of algorithms for processing large

data sets

28-10-2012 54

Hadoop@BIG Statistics of Hadoop used at giant structure

28-10-2012 55

Hadoop@Facebook • Production cluster

o 4800 cores, 600 machines, 16GB per machine – April 2009

o 8000 cores, 1000 machines, 32 GB per machine – July 2009

o 4 SATA disks of 1 TB each per machine

o 2 level network hierarchy, 40 machines per rack

o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009

• Test cluster

• 800 cores, 16GB each

28-10-2012 56

Hadoop@Yahoo • World's largest Hadoop production application.

• The Yahoo! Search Webmap is a Hadoop

application that runs on a more than 10,000 core

Linux cluster

• Biggest contributor to Hadoop.

• Converting All its batches to Hadoop.

28-10-2012 57

Hadoop@Amazon • Hadoop can be run on Amazon Elastic Compute

Cloud (EC2) and Amazon Simple Storage Service (S3)

• The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240

• Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.

28-10-2012 58

Thank You

28-10-2012 59

top related