Hadoop 1 Rahul Jain Software Engineer http://www.linkedin.com/in/ rahuldausa
May 07, 2015
1
Hadoop
Rahul JainSoftware Engineerhttp://www.linkedin.com/in/rahuldausa
2
Agenda
• Hadoop– Introduction– Hadoop (Why)– Hadoop History– Uses of Hadoop– High Level Architecture– Map-Reduce
• HDFS– GFS (Google File System)– HDFS Architecture
• Installation/Configuration• Examples
3
Introduction An open source software framework Supports Data intensive Distributed Applications. Enables Application to work thousand of computational
independent computers and petabytes of data. Derived from Google’s Map-Reduce and Google File System
papers. Written in the Java Programming Language. Started by Doug Cutting, who named it after his son’s toy
elephant to support distribution for the Nutch ( A sub-project of Lucene)
4
Hadoop (Why)
• Need to process huge datasets on large no. of computers.• It is expensive to build reliability into each application.• Nodes fails everyday
- Failure is expected, rather than exceptional.- Need common infrastructure
- Efficient, reliable, easy to use.- Open sourced , Apache License
5
Hadoop History
• Dec 2004 – Google GFS paper published• July 2005 – Nutch uses Map-Reduce• Jan 2006 – Doug Cutting joins Yahoo!• Feb 2006 – Become Lucene Subproject• Apr 2007 – Yahoo! On 1000 node cluster• Jan 2008 – An Apache Top Level Project• Feb 2008 – Yahoo Production search index
6
What is Hadoop Used for ?
• Searching (Yahoo) • Log Processing• Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)
• Analytics(Facebook, LinkedIn)
• Video and Image Analysis (NASA)
• Data Retention
7
Hadoop High Level Architecture
Map-Reduce
Multiple Map-Reduce phases
Framework for processing parallel problems across huge datasets using a large numbers of computers(nodes), collectively referred as Cluster : If all nodes are on same local network and uses similar network.Or Grid: If the nodes are shared across geographically and uses more heterogeneous hardware.
Consists Two Step : 1.Map Step- The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
2.Reduce Step -The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
8
9
Map-Reduce Life-Cycle
Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html
10
HDFSHadoop Distributed File System
11
Lets Understand GFS first … Google File System
12
GFS Architecture
13
Goals of HDFS
1. Very Large Distributed File System - 10K nodes, 100 million files, 10 PB
2. Assumes Commodity Hardware- Files are replicated to handle hardware failure- Detect failures and recovers from them
3. Optimized for Batch Processing- Data locations exposed so that computation can move to where data resides.
14
15
Installation/ Configuration[[email protected] hadoop-1.0.3]$ vi conf/hdfs-site.xml<configuration><property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>true</value> </property><property> <name>dfs.data.dir</name> <value>/home/rjain/rahul/hdfs/data</value> </property> <property> <name>dfs.name.dir</name> <value>/home/rjain/rahul/hdfs/name</value> </property></configuration>
[[email protected] hadoop-1.0.3]$ vi conf/mapred-site.xml<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property></configuration>
[[email protected] hadoop-1.0.3]$ vi conf/core-site.xml<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property></configuration>
[[email protected] hadoop-1.0.3]$ jps29756 SecondaryNameNode19847 TaskTracker18756 Jps29483 NameNode29619 DataNode19711 JobTracker
[[email protected] hadoop-1.0.3]$ pwd/home/rjain/hadoop-1.0.3[[email protected] hadoop-1.0.3]$ bin/start-all.sh
[[email protected] hadoop-1.0.3]$ bin/start-mapred.sh[[email protected] hadoop-1.0.3]$ bin/start-dfs.sh
[[email protected] hadoop-1.0.3]$ bin/hadoop fsUsage: java FsShell [-ls <path>] [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R. [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file. [-dus <path>] : Displays a summary of file lengths. [-count[-q] <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm [-skipTrash] <path>] [-rmr [-skipTrash] <path>] : Recursive version of delete(rm). [-expunge] : Empty the Trash [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the destination filesystem [-copyFromLocal <localsrc> ... <dst>] [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst>] [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream. [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal [-crc] <src> <localdst>] [-mkdir <path>] [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file [-touchz <path>] : Create a file of zero length. [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true. [-stat [format] <path>] : Returns the stat information on the path like created time of dir [-tail [-f] <file>] : Displays last kilobyte of the file to stdout [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-help [cmd]]
16
HDFS- Read/Write ExampleConfiguration conf = new Configuration(); FileSystem fs = FileSystem.get(conf);
Given an input/output file name as string, we construct inFile/outFile Path objects. Most of the FileSystem APIs accepts Path objects.
Path inFile = new Path(argv[0]); Path outFile = new Path(argv[1]);
Validate the input/output paths before reading/writing.
if (!fs.exists(inFile)) printAndExit("Input file not found"); if (!fs.isFile(inFile)) printAndExit("Input should be a file"); if (fs.exists(outFile)) printAndExit("Output already exists");
Open inFile for reading.
FSDataInputStream in = fs.open(inFile);
Open outFile for writing.
FSDataOutputStream out = fs.create(outFile);
Read from input stream and write to output stream until EOF.
while ((bytesRead = in.read(buffer)) > 0) { out.write(buffer, 0, bytesRead); }
Close the streams when done.
in.close(); out.close();
17
Hadoop Sub-Projects• Hadoop Common: The common utilities that support the other Hadoop subprojects.• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data.• Hadoop MapReduce: A software framework for distributed processing of large data sets on
compute clusters.
Other Hadoop-related projects at Apache include:
• Avro™: A data serialization system.• Cassandra™: A scalable multi-master database with no single points of failure.• Chukwa™: A data collection system for managing large distributed systems.• HBase™: A scalable, distributed database that supports structured data storage for large tables.• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.• Mahout™: A Scalable machine learning and data mining library.• Pig™: A high-level data-flow language and execution framework for parallel computation.• ZooKeeper™: A high-performance coordination service for distributed applications.
18
Questions ?