Apache Hadoop Presented By, Darpan Dekivadiya(09BCE008)
Apache Hadoop Presented By,
Darpan Dekivadiya(09BCE008)
What is Hadoop? • A framework for storing and processing big data on
lots of commodity machines. o Up to 4,000 machines in a cluster
o Up to 20 PB in a cluster
• Open Source Apache project
• High reliability done in software o Automated fail-over for data and computation
• Implemented in Java
28-10-2012 2
Hadoop development • Hadoop was created by Doug Cutting
• This is named as Hadoop from his son‟s toy
elephant.
• It is originally developed to support Nutch search
engine project.
• After that, So many companies adopted it and
contributed in this project.
28-10-2012 3
Hadoop Echo system • Apache Hadoop is a collection of open-source software
for reliable, scalable, distributed computing.
• Hadoop Common: The common utilities that support the
other Hadoop subprojects.
• HDFS: A distributed file system that provides high
throughput access to application data.
• MapReduce: A software framework for distributed
processing of large data sets on compute clusters.
• Pig: A high-level data-flow language and execution
framework for parallel computation.
• HBase: A scalable, distributed database that supports
structured data storage for large tables.
28-10-2012 4
28-10-2012 5
Hadoop, Why? • Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day – Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure –Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but o Workloads are IO bound and not CPU bound
28-10-2012 6
Hadoop History • Dec 2004 – Google GFS paper published
• July 2005 – Nutch(Search engine) uses MapReduce
• Feb 2006 – Starts as a Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• May 2009 – Hadoop sorts Petabyte in 17 hours
• Aug 2010 – World‟s Largest Hadoop cluster at o Facebook
o 2900 nodes, 30+ PetaByte
28-10-2012 7
Who uses Hadoop? • Amazon/A9
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
28-10-2012 8
Applications of Hadoop • Search
o Yahoo, Amazon, Zvents
• Log processing o Facebook, Yahoo, ContextWeb. Joost, Last.fm
• Recommendation Systems o Facebook
• Data Warehouse o Facebook, AOL
• Video and Image Analysis o New York Times, Eyealike
28-10-2012 9
Who generates the data? • Lots of data is generated on Facebook
o 500+ million active users
o 30 billion pieces of content shared every month (news stories, photos,
blogs, etc)
• Lots of data is generated for Yahoo search engine.
• Lots of data is generated at Amazon S3 cloud
service.
28-10-2012 10
Data usage • Data Usage
o Statistics per day:
o 20 TB of compressed new data added per day
o 3 PB of compressed data scanned per day
o 20K jobs on production cluster per day
o 480K compute hours per day
• Barrier to entry is significantly reduced: o New engineers go though a Hadoop/Hive training session
o 300+ people run jobs on Hadoop
o Analysts (non-engineers) use Hadoop through Hive
28-10-2012 11
HDFS Hadoop Distributed File System
28-10-2012 12
28-10-2012 13
Based on Google File System
28-10-2012 14
Redundant storage
Commodity Hardware
• Typically in 2 level architecture o Nodes are commodity PCs
o 20-40 nodes/rack
o The default size of Apache Hadoop block is 64 MB.
o Relational databases typically store data blocks in sizes ranging from 4KB
to 32KB.
28-10-2012 15
How does HDFS maintain everything? • Two types of nodes
o Single NameNode and a number of DataNodes
• Namenode o File names, permissions, modified flags, etc.
o Data locations exposed so that computations can
• Datanode o Store and retrieve blocks when they are told to .
o HDFS is built using the Java language; any machine that supports Java
can run the NameNode or the DataNode software
28-10-2012 16
How HDFS works?
28-10-2012 17
• The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
• The DataNodes are responsible for serving read and write requests from the file system‟s clients.
28-10-2012 18
MapReduce Google‟s MapReduce Technique
28-10-2012 19
MapReduce Overview • Provides a clean abstraction for programmers to
write distributed application.
• Factors out many reliability concerns from
application logic
• A batch data processing system
• Automatic parallelization & distribution
• Fault-tolerance
• Status and monitoring tools
28-10-2012 20
Programming Model • Programmer has to implement interface of two
functions:
– map (in_key, in_value) ->
(out_key, intermediate_value) list
– reduce (out_key, intermediate_value list) ->
out_value list
28-10-2012 21
MapReduce Flow
28-10-2012 22
Mapper(indexing example)
• Input is the line no and the actual line.
• Input 1 : (“100”,“I Love India ”)
• Output 1 : (“I”,“100”), (“Love”,“100”),
(“India”,“100”)
• Input 2 : (“101”,“I Love eBay”)
• Output 2 : (“I”,“101”), (“Love”,“101”),
(“eBay”,“101”)
28-10-2012 23
Reducer (indexing example)
• Input is word and the line nos.
• Input 1 : (“I”,“100”,”101”)
• Input 2 : (“Love”,“100”,”101”)
• Input 3 : (“India”, “100”)
• Input 4 : (“eBay”, “101”)
• Output, the words are stored along with the line nos.
28-10-2012 24
Google Page Rank example
• Mapper
o Input is a link and the html content
o Output is a list of outgoing link and pagerank of this page
• Reducer
o Input is a link and a list of pagranks of pages linking to this page
o Output is the pagerank of this page, which is the weighted
average of all input pageranks
28-10-2012 25
Conti. • Limited atomicity and transaction support.
o HBase supports multiple batched mutations of
single rows only.
o Data is unstructured and untyped.
• No accessed or manipulated via SQL.
o Programmatic access via Java, REST, or Thrift APIs.
o Scripting via JRuby.
28-10-2012 26
Introduction of HBase
OVERVIEW • HBase is an Apache open source project
whose goal is to provide storage for the
Hadoop Distributed Computing
Environment.
• Data is logically organized into tables, rows
and columns.
28-10-2012 28
Outline • Data Model
• Architecture and Implementation
• Examples & Tests
28-10-2012 29
Conceptual View
• A data row has a sortable row key and an arbitrary number of columns.
• A Time Stamp is designated automatically if not artificially.
• <family>:<label>
Row key Time
Stamp
Column
“contents:” Column “anchor:”
“com.apach
e.www”
t12 “<html>…”
t11 “<html>…”
t10 “anchor:apache.
com” “APACHE”
“com.cnn.w
ww”
t15 “anchor:cnnsi.com” “CNN”
t13 “anchor:my.look.c
a” “CNN.com”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
<family>:<label>
Physical Storage View • Physically, tables are
stored on a per-column family basis.
• Empty cells are not stored in a column-oriented storage format.
• Each column family is managed by an HStore.
Row key TS Column
“contents:”
“com.apache.w
ww”
t12 “<html>…”
t11 “<html>…”
“com.cn.www”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
Row key TS Column “anchor:”
“com.apache.
www”
t10
“anchor:
apache.com”
“APACHE”
com.cn.www”
t9 “anchor:
cnnsi.com” “CNN”
t8 “anchor:
my.look.ca”
“CNN.co
m”
HStore
Data MapFile Index MapFile
Key/Value
Index key
HStore
Memcache
Row Ranges: Regions • Row key/ Column
ascending, Timestamp
descending
• Physically, tables are broken
into row ranges contain rows
from start-key to end-key
Row key Time
Stamp
Column
“contents:” Column “anchor:”
aaaa
t15 anchor:cc value
t13 ba
t12 bb
t11 anchor:cd value
t10 bc
aaab t14
aaac anchor:be value
aaad anchor:ad value
aaae
t5 ae
t3 af
Outline • Data Model
• Architecture and Implementation
• Examples & Tests
Three major components • The HBaseMaster
• The HRegionServer
• The HBase client
HBaseMaster • Assign regions to
HRegionServers.
1. ROOT region locates all the
META regions.
2. META region maps a number
of user regions.
3. Assign user regions to the
HRegionServers.
• Enable/Disable table and
change table schema
• Monitor the health of each Server
Master
1 ROOT Region
Server Server
2 META Region
Server
2 META Region
Server
2 META Region
Server
2 META Region
ROOT Region
META Region
META Region
USER Region
USER Region
USER Region
HBase Client
HBase Client ROOT Region
HBase Client META Region
HBase Client User Region
Information cached
Outline • Data Model
• Architecture and Implementation
• Examples & Tests
Create MyTable HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
Row Key Timestamp columnFamily1: columnFamily2:
Insert Values BatchUpdate batchUpdate = new
BatchUpdate("myRow",timestamp);
batchUpdate.put("columnFamily1:labela",Bytes.toBytes("labela value"));
batchUpdate.put("columnFamily1:labelb",Bytes.toBytes(“labelb value"));
table.commit(batchUpdate);
Row Key Timestamp columnFamily1:
myRow
ts1 labela labela value
ts2 labelb labelb value
Search Row key
Time
Stamp Column “anchor:”
“com.apache.www”
t12
t11
t10 “anchor:apache.com” “APACHE”
“com.cnn.www”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
t6
t5
t3
Select value from table where key=‘com.apache.www’ AND label=‘anchor:apache.com’
Search Scanner
Select value from table where anchor=‘cnnsi.com’
Row key Time
Stamp Column “anchor:”
“com.apache.www”
t12
t11
t10 “anchor:apache.com” “APACHE”
“com.cnn.www”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
t6
t5
t3
PIG Programming Language for Hadoop Framework
28-10-2012 45
Introduction • Pig was initially developed at Yahoo!
• Pig programming language is designed
to handle any kind of data-hence the
name!
• Pig is made of two components:
Language itself, which is called PigLatin .
Runtime Environment where PigLatin programs
are executed.
28-10-2012 46
Why PigLatin? • Map Reduce is very powerful, but:
o It requires a Java programmer.
o User has to re-invent common functionality (join, filter, etc.).
• For non-java programmers Pig Latin is introduced.
• Pig Latin is a data flow language rather than procedural or declarative.
• User code and existing binaries can be included almost anywhere.
• Metadata not required, but used when available.
• Support for nested types.
• Operates on files in HDFS.
28-10-2012 47
Pig Latin Overview • Pig provides a higher level language,
Pig Latin, that: o Increases productivity.
o In one test 10 lines of Pig Latin ≈ 200 lines of Java.
• What took 4 hours to write in Java took
15 minutes in Pig Latin. o Opens the system to non-Java programmers.
o Provides common operations like join, group,
filter, sort.
28-10-2012 48
Load Data • The objects that are being worked on by Hadoop
are stored in HDFS.
• To access this data, the program must first tell Pig
what file (or files) it will use.
• That‟s done through the LOAD ‘data_file’
command .
• If the data is stored in a file format that is not
natively accessible to Pig,
• Add the “USING” function to the LOAD statement to
specify a user-defined function that can read in
and interpret the data.
28-10-2012 49
Transform Data • The transform logic is where all the
data manipulation happens.
• For example : FILTER out rows that are not of interest.
JOIN two sets of data files .
GROUP data to build aggregations .
ORDER results .
28-10-2012 50
Example of Pig Program • file composed of Twitter feeds, selects only those
tweets that are using en(English) iso_language
code, then groups them by the user who is
tweeting, and displays the sum of the number of the
re tweets of that user‟s tweets.
L = LOAD „hdfs//node/tweet_data‟;
FL = FILTER L BY iso_language_code EQ „en‟;
G = GROUP FL BY from_user;
RT = FOREACH G GENERATE group, SUM(retweets);
28-10-2012 51
DUMP and STORE • DUMP or STORE command generates the results of a
Pig program.
• DUMP command sends the output to the screen,
while debugging Pig programs.
• DUMP command can be used anywhere in
program to dump intermediate result sets to the
screen.
• STORE command will store results from running
programs in a file for further processing and analysis.
28-10-2012 52
Pig Runtime Environment • Pig runtime is used when Pig program need to run in
the Hadoop environment .
• There are three ways to run a Pig program:
Embedded in a Script.
Embedded in Java Program.
From the Pig Command line, called Grunt.
• The Pig runtime environment translates the program
into a set of map and reduce tasks and runs.
• This greatly simplifies the work associated with the
analysis of large amounts of data.
28-10-2012 53
PIG is used for? • Web log processing.
• Data processing for web search platforms.
• Ad hoc queries across large data sets.
• Rapid prototyping of algorithms for processing large
data sets
28-10-2012 54
Hadoop@BIG Statistics of Hadoop used at giant structure
28-10-2012 55
Hadoop@Facebook • Production cluster
o 4800 cores, 600 machines, 16GB per machine – April 2009
o 8000 cores, 1000 machines, 32 GB per machine – July 2009
o 4 SATA disks of 1 TB each per machine
o 2 level network hierarchy, 40 machines per rack
o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009
• Test cluster
• 800 cores, 16GB each
28-10-2012 56
Hadoop@Yahoo • World's largest Hadoop production application.
• The Yahoo! Search Webmap is a Hadoop
application that runs on a more than 10,000 core
Linux cluster
• Biggest contributor to Hadoop.
• Converting All its batches to Hadoop.
28-10-2012 57
Hadoop@Amazon • Hadoop can be run on Amazon Elastic Compute
Cloud (EC2) and Amazon Simple Storage Service (S3)
• The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240
• Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.
28-10-2012 58
Thank You
28-10-2012 59