Hadoop & HDFS for Beginners

1

Hadoop

Rahul JainSoftware Engineerhttp://www.linkedin.com/in/rahuldausa

2

Agenda

• Hadoop– Introduction– Hadoop (Why)– Hadoop History– Uses of Hadoop– High Level Architecture– Map-Reduce

• HDFS– GFS (Google File System)– HDFS Architecture

• Installation/Configuration• Examples

3

Introduction An open source software framework Supports Data intensive Distributed Applications. Enables Application to work thousand of computational

independent computers and petabytes of data. Derived from Google’s Map-Reduce and Google File System

papers. Written in the Java Programming Language. Started by Doug Cutting, who named it after his son’s toy

elephant to support distribution for the Nutch ( A sub-project of Lucene)

4

Hadoop (Why)

• Need to process huge datasets on large no. of computers.• It is expensive to build reliability into each application.• Nodes fails everyday

- Failure is expected, rather than exceptional.- Need common infrastructure

- Efficient, reliable, easy to use.- Open sourced , Apache License

5

Hadoop History

• Dec 2004 – Google GFS paper published• July 2005 – Nutch uses Map-Reduce• Jan 2006 – Doug Cutting joins Yahoo!• Feb 2006 – Become Lucene Subproject• Apr 2007 – Yahoo! On 1000 node cluster• Jan 2008 – An Apache Top Level Project• Feb 2008 – Yahoo Production search index

6

What is Hadoop Used for ?

• Searching (Yahoo) • Log Processing• Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)

• Analytics(Facebook, LinkedIn)

• Video and Image Analysis (NASA)

• Data Retention

7

Hadoop High Level Architecture

Map-Reduce

Multiple Map-Reduce phases

Framework for processing parallel problems across huge datasets using a large numbers of computers(nodes), collectively referred as Cluster : If all nodes are on same local network and uses similar network.Or Grid: If the nodes are shared across geographically and uses more heterogeneous hardware.

Consists Two Step : 1.Map Step- The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

2.Reduce Step -The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

8

9

Map-Reduce Life-Cycle

Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html

10

HDFSHadoop Distributed File System

11

Lets Understand GFS first … Google File System

12

GFS Architecture

13

Goals of HDFS

1. Very Large Distributed File System - 10K nodes, 100 million files, 10 PB

2. Assumes Commodity Hardware- Files are replicated to handle hardware failure- Detect failures and recovers from them

3. Optimized for Batch Processing- Data locations exposed so that computation can move to where data resides.

14

15

Installation/ Configuration[[email protected] hadoop-1.0.3]$ vi conf/hdfs-site.xml<configuration><property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>true</value> </property><property> <name>dfs.data.dir</name> <value>/home/rjain/rahul/hdfs/data</value> </property> <property> <name>dfs.name.dir</name> <value>/home/rjain/rahul/hdfs/name</value> </property></configuration>

[[email protected] hadoop-1.0.3]$ vi conf/mapred-site.xml<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property></configuration>

[[email protected] hadoop-1.0.3]$ vi conf/core-site.xml<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property></configuration>

[[email protected] hadoop-1.0.3]$ jps29756 SecondaryNameNode19847 TaskTracker18756 Jps29483 NameNode29619 DataNode19711 JobTracker

[[email protected] hadoop-1.0.3]$ pwd/home/rjain/hadoop-1.0.3[[email protected] hadoop-1.0.3]$ bin/start-all.sh

[[email protected] hadoop-1.0.3]$ bin/start-mapred.sh[[email protected] hadoop-1.0.3]$ bin/start-dfs.sh

[[email protected] hadoop-1.0.3]$ bin/hadoop fsUsage: java FsShell [-ls <path>] [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R. [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file. [-dus <path>] : Displays a summary of file lengths. [-count[-q] <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm [-skipTrash] <path>] [-rmr [-skipTrash] <path>] : Recursive version of delete(rm). [-expunge] : Empty the Trash [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the destination filesystem [-copyFromLocal <localsrc> ... <dst>] [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst>] [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream. [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal [-crc] <src> <localdst>] [-mkdir <path>] [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file [-touchz <path>] : Create a file of zero length. [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true. [-stat [format] <path>] : Returns the stat information on the path like created time of dir [-tail [-f] <file>] : Displays last kilobyte of the file to stdout [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-help [cmd]]

16

HDFS- Read/Write ExampleConfiguration conf = new Configuration(); FileSystem fs = FileSystem.get(conf);

Given an input/output file name as string, we construct inFile/outFile Path objects. Most of the FileSystem APIs accepts Path objects.

Path inFile = new Path(argv[0]); Path outFile = new Path(argv[1]);

Validate the input/output paths before reading/writing.

if (!fs.exists(inFile)) printAndExit("Input file not found"); if (!fs.isFile(inFile)) printAndExit("Input should be a file"); if (fs.exists(outFile)) printAndExit("Output already exists");

Open inFile for reading.

FSDataInputStream in = fs.open(inFile);

Open outFile for writing.

FSDataOutputStream out = fs.create(outFile);

Read from input stream and write to output stream until EOF.

while ((bytesRead = in.read(buffer)) > 0) { out.write(buffer, 0, bytesRead); }

Close the streams when done.

in.close(); out.close();

17

Hadoop Sub-Projects• Hadoop Common: The common utilities that support the other Hadoop subprojects.• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-

throughput access to application data.• Hadoop MapReduce: A software framework for distributed processing of large data sets on

compute clusters.

Other Hadoop-related projects at Apache include:

• Avro™: A data serialization system.• Cassandra™: A scalable multi-master database with no single points of failure.• Chukwa™: A data collection system for managing large distributed systems.• HBase™: A scalable, distributed database that supports structured data storage for large tables.• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.• Mahout™: A Scalable machine learning and data mining library.• Pig™: A high-level data-flow language and execution framework for parallel computation.• ZooKeeper™: A high-performance coordination service for distributed applications.

http://hadoop.apache.org/common/

http://hadoop.apache.org/hdfs/

http://hadoop.apache.org/mapreduce/

http://hadoop.apache.org/mapreduce/

http://avro.apache.org/

http://cassandra.apache.org/

http://incubator.apache.org/chukwa/

http://incubator.apache.org/chukwa/

http://hbase.apache.org/

http://hbase.apache.org/

http://hive.apache.org/

http://mahout.apache.org/

http://pig.apache.org/

http://zookeeper.apache.org/

http://zookeeper.apache.org/

18

Questions ?

Hadoop & HDFS for Beginners

Technology