Apache hadoop

Apache Hadoop Presented By,

Darpan Dekivadiya(09BCE008)

What is Hadoop? • A framework for storing and processing big data on

lots of commodity machines. o Up to 4,000 machines in a cluster

o Up to 20 PB in a cluster

• Open Source Apache project

• High reliability done in software o Automated fail-over for data and computation

• Implemented in Java

28-10-2012 2

Hadoop development • Hadoop was created by Doug Cutting

• This is named as Hadoop from his son‟s toy

elephant.

• It is originally developed to support Nutch search

engine project.

• After that, So many companies adopted it and

contributed in this project.

28-10-2012 3

Hadoop Echo system • Apache Hadoop is a collection of open-source software

for reliable, scalable, distributed computing.

• Hadoop Common: The common utilities that support the

other Hadoop subprojects.

• HDFS: A distributed file system that provides high

throughput access to application data.

• MapReduce: A software framework for distributed

processing of large data sets on compute clusters.

• Pig: A high-level data-flow language and execution

framework for parallel computation.

• HBase: A scalable, distributed database that supports

structured data storage for large tables.

28-10-2012 4

28-10-2012 5

Hadoop, Why? • Need to process Multi Petabyte Datasets

• Expensive to build reliability in each application.

• Nodes fail every day – Failure is expected, rather than exceptional.

– The number of nodes in a cluster is not constant.

• Need common infrastructure –Efficient, reliable, Open Source Apache License

• The above goals are same as Condor, but o Workloads are IO bound and not CPU bound

28-10-2012 6

Hadoop History • Dec 2004 – Google GFS paper published

• July 2005 – Nutch(Search engine) uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• May 2009 – Hadoop sorts Petabyte in 17 hours

• Aug 2010 – World‟s Largest Hadoop cluster at o Facebook

o 2900 nodes, 30+ PetaByte

28-10-2012 7

Who uses Hadoop? • Amazon/A9

• Facebook

• Google

• IBM

• Joost

• Last.fm

• New York Times

• PowerSet

• Veoh

• Yahoo!

28-10-2012 8

Applications of Hadoop • Search

o Yahoo, Amazon, Zvents

• Log processing o Facebook, Yahoo, ContextWeb. Joost, Last.fm

• Recommendation Systems o Facebook

• Data Warehouse o Facebook, AOL

• Video and Image Analysis o New York Times, Eyealike

28-10-2012 9

Who generates the data? • Lots of data is generated on Facebook

o 500+ million active users

o 30 billion pieces of content shared every month (news stories, photos,

blogs, etc)

• Lots of data is generated for Yahoo search engine.

• Lots of data is generated at Amazon S3 cloud

service.

28-10-2012 10

Data usage • Data Usage

o Statistics per day:

o 20 TB of compressed new data added per day

o 3 PB of compressed data scanned per day

o 20K jobs on production cluster per day

o 480K compute hours per day

• Barrier to entry is significantly reduced: o New engineers go though a Hadoop/Hive training session

o 300+ people run jobs on Hadoop

o Analysts (non-engineers) use Hadoop through Hive

28-10-2012 11

HDFS Hadoop Distributed File System

28-10-2012 12

28-10-2012 13

Based on Google File System

28-10-2012 14

Redundant storage

Commodity Hardware

• Typically in 2 level architecture o Nodes are commodity PCs

o 20-40 nodes/rack

o The default size of Apache Hadoop block is 64 MB.

o Relational databases typically store data blocks in sizes ranging from 4KB

to 32KB.

28-10-2012 15

How does HDFS maintain everything? • Two types of nodes

o Single NameNode and a number of DataNodes

• Namenode o File names, permissions, modified flags, etc.

o Data locations exposed so that computations can

• Datanode o Store and retrieve blocks when they are told to .

o HDFS is built using the Java language; any machine that supports Java

can run the NameNode or the DataNode software

28-10-2012 16

How HDFS works?

28-10-2012 17

• The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

• The DataNodes are responsible for serving read and write requests from the file system‟s clients.

28-10-2012 18

MapReduce Google‟s MapReduce Technique

28-10-2012 19

MapReduce Overview • Provides a clean abstraction for programmers to

write distributed application.

• Factors out many reliability concerns from

application logic

• A batch data processing system

• Automatic parallelization & distribution

• Fault-tolerance

• Status and monitoring tools

28-10-2012 20

Programming Model • Programmer has to implement interface of two

functions:

– map (in_key, in_value) ->

(out_key, intermediate_value) list

– reduce (out_key, intermediate_value list) ->

out_value list

28-10-2012 21

MapReduce Flow

28-10-2012 22

Mapper(indexing example)

• Input is the line no and the actual line.

• Input 1 : (“100”,“I Love India ”)

• Output 1 : (“I”,“100”), (“Love”,“100”),

(“India”,“100”)

• Input 2 : (“101”,“I Love eBay”)

• Output 2 : (“I”,“101”), (“Love”,“101”),

(“eBay”,“101”)

28-10-2012 23

Reducer (indexing example)

• Input is word and the line nos.

• Input 1 : (“I”,“100”,”101”)

• Input 2 : (“Love”,“100”,”101”)

• Input 3 : (“India”, “100”)

• Input 4 : (“eBay”, “101”)

• Output, the words are stored along with the line nos.

28-10-2012 24

Google Page Rank example

• Mapper

o Input is a link and the html content

o Output is a list of outgoing link and pagerank of this page

• Reducer

o Input is a link and a list of pagranks of pages linking to this page

o Output is the pagerank of this page, which is the weighted

average of all input pageranks

28-10-2012 25

Conti. • Limited atomicity and transaction support.

o HBase supports multiple batched mutations of

single rows only.

o Data is unstructured and untyped.

• No accessed or manipulated via SQL.

o Programmatic access via Java, REST, or Thrift APIs.

o Scripting via JRuby.

28-10-2012 26

Introduction of HBase

OVERVIEW • HBase is an Apache open source project

whose goal is to provide storage for the

Hadoop Distributed Computing

Environment.

• Data is logically organized into tables, rows

and columns.

28-10-2012 28

Outline • Data Model

• Architecture and Implementation

• Examples & Tests

28-10-2012 29

Conceptual View

• A data row has a sortable row key and an arbitrary number of columns.

• A Time Stamp is designated automatically if not artificially.

• <family>:<label>

Row key Time

Column

“contents:” Column “anchor:”

“com.apach

e.www”

t12 “<html>…”

t11 “<html>…”

t10 “anchor:apache.

com” “APACHE”

“com.cnn.w

t15 “anchor:cnnsi.com” “CNN”

t13 “anchor:my.look.c

a” “CNN.com”

t6 “<html>…”

t5 “<html>…”

t3 “<html>…”

Physical Storage View • Physically, tables are

stored on a per-column family basis.

• Empty cells are not stored in a column-oriented storage format.

• Each column family is managed by an HStore.

Row key TS Column

“contents:”

“com.apache.w

t12 “<html>…”

t11 “<html>…”

“com.cn.www”

t6 “<html>…”

t5 “<html>…”

t3 “<html>…”

Row key TS Column “anchor:”

“com.apache.

www”

“anchor:

apache.com”

“APACHE”

com.cn.www”

t9 “anchor:

cnnsi.com” “CNN”

t8 “anchor:

my.look.ca”

“CNN.co

HStore

Data MapFile Index MapFile

Key/Value

Index key

HStore

Memcache

Row Ranges: Regions • Row key/ Column

ascending, Timestamp

descending

• Physically, tables are broken

into row ranges contain rows

from start-key to end-key

Row key Time

Column

“contents:” Column “anchor:”

t15 anchor:cc value

t13 ba

t12 bb

t11 anchor:cd value

t10 bc

aaab t14

aaac anchor:be value

aaad anchor:ad value

Three major components • The HBaseMaster

• The HRegionServer

• The HBase client

HBaseMaster • Assign regions to

HRegionServers.

1. ROOT region locates all the

META regions.

2. META region maps a number

of user regions.

3. Assign user regions to the

HRegionServers.

• Enable/Disable table and

change table schema

• Monitor the health of each Server

Master

1 ROOT Region

Server Server

2 META Region

Server

2 META Region

Server

2 META Region

Server

2 META Region

ROOT Region

META Region

USER Region

HBase Client

HBase Client ROOT Region

HBase Client META Region

HBase Client User Region

Information cached

Create MyTable HBaseAdmin admin= new HBaseAdmin(config);

HColumnDescriptor []column;

column= new HColumnDescriptor[2];

column[0]=new HColumnDescriptor("columnFamily1:");

column[1]=new HColumnDescriptor("columnFamily2:");

HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));

desc.addFamily(column[0]);

desc.addFamily(column[1]);

admin.createTable(desc);

Row Key Timestamp columnFamily1: columnFamily2:

Insert Values BatchUpdate batchUpdate = new

BatchUpdate("myRow",timestamp);

batchUpdate.put("columnFamily1:labela",Bytes.toBytes("labela value"));

batchUpdate.put("columnFamily1:labelb",Bytes.toBytes(“labelb value"));

table.commit(batchUpdate);

Row Key Timestamp columnFamily1:

ts1 labela labela value

ts2 labelb labelb value

Search Row key

Stamp Column “anchor:”

“com.apache.www”

t10 “anchor:apache.com” “APACHE”

“com.cnn.www”

t8 “anchor:my.look.ca” “CNN.com”

Select value from table where key=‘com.apache.www’ AND label=‘anchor:apache.com’

Search Scanner

Select value from table where anchor=‘cnnsi.com’

Row key Time

Stamp Column “anchor:”

“com.apache.www”

t10 “anchor:apache.com” “APACHE”

“com.cnn.www”

t8 “anchor:my.look.ca” “CNN.com”

PIG Programming Language for Hadoop Framework

28-10-2012 45

Introduction • Pig was initially developed at Yahoo!

• Pig programming language is designed

to handle any kind of data-hence the

• Pig is made of two components:

Language itself, which is called PigLatin .

Runtime Environment where PigLatin programs

are executed.

28-10-2012 46

Why PigLatin? • Map Reduce is very powerful, but:

o It requires a Java programmer.

o User has to re-invent common functionality (join, filter, etc.).

• For non-java programmers Pig Latin is introduced.

• Pig Latin is a data flow language rather than procedural or declarative.

• User code and existing binaries can be included almost anywhere.

• Metadata not required, but used when available.

• Support for nested types.

• Operates on files in HDFS.

28-10-2012 47

Pig Latin Overview • Pig provides a higher level language,

Pig Latin, that: o Increases productivity.

o In one test 10 lines of Pig Latin ≈ 200 lines of Java.

• What took 4 hours to write in Java took

15 minutes in Pig Latin. o Opens the system to non-Java programmers.

o Provides common operations like join, group,

filter, sort.

28-10-2012 48

Load Data • The objects that are being worked on by Hadoop

are stored in HDFS.

• To access this data, the program must first tell Pig

what file (or files) it will use.

• That‟s done through the LOAD ‘data_file’

command .

• If the data is stored in a file format that is not

natively accessible to Pig,

• Add the “USING” function to the LOAD statement to

specify a user-defined function that can read in

and interpret the data.

28-10-2012 49

Transform Data • The transform logic is where all the

data manipulation happens.

• For example : FILTER out rows that are not of interest.

JOIN two sets of data files .

GROUP data to build aggregations .

ORDER results .

28-10-2012 50

Example of Pig Program • file composed of Twitter feeds, selects only those

tweets that are using en(English) iso_language

code, then groups them by the user who is

tweeting, and displays the sum of the number of the

re tweets of that user‟s tweets.

L = LOAD „hdfs//node/tweet_data‟;

FL = FILTER L BY iso_language_code EQ „en‟;

G = GROUP FL BY from_user;

RT = FOREACH G GENERATE group, SUM(retweets);

28-10-2012 51

DUMP and STORE • DUMP or STORE command generates the results of a

Pig program.

• DUMP command sends the output to the screen,

while debugging Pig programs.

• DUMP command can be used anywhere in

program to dump intermediate result sets to the

screen.

• STORE command will store results from running

programs in a file for further processing and analysis.

28-10-2012 52

Pig Runtime Environment • Pig runtime is used when Pig program need to run in

the Hadoop environment .

• There are three ways to run a Pig program:

Embedded in a Script.

Embedded in Java Program.

From the Pig Command line, called Grunt.

• The Pig runtime environment translates the program

into a set of map and reduce tasks and runs.

• This greatly simplifies the work associated with the

analysis of large amounts of data.

28-10-2012 53

PIG is used for? • Web log processing.

• Data processing for web search platforms.

• Ad hoc queries across large data sets.

• Rapid prototyping of algorithms for processing large

data sets

28-10-2012 54

Hadoop@BIG Statistics of Hadoop used at giant structure

28-10-2012 55

Hadoop@Facebook • Production cluster

o 4800 cores, 600 machines, 16GB per machine – April 2009

o 8000 cores, 1000 machines, 32 GB per machine – July 2009

o 4 SATA disks of 1 TB each per machine

o 2 level network hierarchy, 40 machines per rack

o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009

• Test cluster

• 800 cores, 16GB each

28-10-2012 56

Hadoop@Yahoo • World's largest Hadoop production application.

• The Yahoo! Search Webmap is a Hadoop

application that runs on a more than 10,000 core

Linux cluster

• Biggest contributor to Hadoop.

• Converting All its batches to Hadoop.

28-10-2012 57

Hadoop@Amazon • Hadoop can be run on Amazon Elastic Compute

Cloud (EC2) and Amazon Simple Storage Service (S3)

• The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240

• Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.

28-10-2012 58

Thank You

28-10-2012 59

Apache hadoop

o data locations

hadoop development hadoop

facebook o

application data

o hdfs

data blocks

hadoop common

data storage

Education

Making Apache Hadoop Secure

Apache hadoop technology : Beginners

Introduction Apache oozie (Hadoop workflow engine)€¦ ·....

Apache Hadoop Security - Ranger

Apache Hadoop Today & Tomorrow - SNIA · 2020-05-05 ·...

20100130 hadoop apache

Apache Hadoop - Conceitos teóricos e práticos, evolução....

Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache...

MapReduce & Apache Hadoop

Spring for Apache Hadoop - Reference Documentation ·...

Apache hadoop q&a

Introduction to Apache Hadoop & Pig -...

Apache Hadoop Releaseshadoop.apache.org/old/releases.pdf ·...

Python 3 + apache hadoop

Apache Hadoop Ingestion Patterns & Apache Flume

Apache Hadoop Crash Course