Top Banner
A start to hadoop By:- Ayush Mittal Krupa Varughese Parag Sahu
50
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Start to Hadoop

A start to hadoop

By:-Ayush MittalKrupa VarugheseParag Sahu

Page 2: A Start to Hadoop

Major Focus• What is hadoop?• Why hadoop?• What is map reduce?• Phases in map reduce.• What is hdfs?• Setting hadoop in pseudo distributed mode.• A simple pattern matcher program.

Page 3: A Start to Hadoop

The Elementary item-DATA

• What is Data? In computer era data is termed as relevant information which flows through one machine to another. Types can be:

Structured:- Identifiable because of its organized structure Example: Database(Information stored in column and rows)

Unstructured:-It does not have predefined data model Generally text heavy(contain date ,numbers ,logs) Irregular and ambiguous

Page 4: A Start to Hadoop

What is the problem with unstructured

DATA? • Hides important insights

• Storage

• Updation

• Time consuming

Page 5: A Start to Hadoop

Big DataIt consists of datasets that grow so large that they become awkward to work with using on-hand database management tools• Sizes in terabyte ,Exabyte ,zettabyte

• Difficulties include capture, storage, search, sharing, analytics, and visualizing

• Examples include web logs; RFID; sensor networks; social data (due to the Social data Revolution), Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, military surveillance; medical records; photography archives; video archives; and large-scale ecommerce.

Page 6: A Start to Hadoop

Here comes Hadoop !!!

• The Apache Hadoop software library is a framework that allows for the “distributed processing of large data sets across clusters of computers using a simple programming model”.

• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Page 7: A Start to Hadoop
Page 8: A Start to Hadoop

History of hadoop• Hadoop was created by Doug Cutting, the creator of

Apache Lucene.• In 2004, they set about writing an open source

implementation, the Nutch Distributed Filesystem (NDFS).

• In 2004, Google introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.

• In 2006 DOUG Cutting joined yahoo , and it build its site index using 10,000 core hadoop cluster.

• Now hadoop can sort 1 Terabyte data in 62 seconds.

Page 9: A Start to Hadoop

Hadoop ecosystem

Page 10: A Start to Hadoop

• CoreA set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).• AvroA data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)

Page 11: A Start to Hadoop

MapReduce• A MapReduce is a distributed data processing

model introduced by Google to support distributed computing on large data sets on clusters of computers

• Hadoop can run MapReduce programs written in various languages

Page 12: A Start to Hadoop

HDFSReliable Storage: HDFS• Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File System, or HDFS.• HDFS is able to store huge amounts of information, scale up

incrementally.• It can survive the failure of significant parts of the storage infrastructure without losing data.• Clusters can be built with inexpensive computers.• If one fails, Hadoop continues to operate the cluster without closing data or interrupting work, by shifting work to the remaining machines in the cluster.• HDFS manages storage on the cluster by breaking incoming files

into pieces, called “blocks,” and storing each of the blocks redundantly across the pool of servers.

• If namenode fails all system crashes down.

Page 13: A Start to Hadoop

HDFS example

Page 14: A Start to Hadoop

Zookeeper• Distributed consensus engine• Provides well-defined concurrent access semantics:• Distributed locking / mutual exclusion• Message board / mailboxes

Page 15: A Start to Hadoop

Pig• Data-flow oriented language• “Pig Latin”• Data types include sets, associative arrays,• tuples• High-level language for routing data, allows• easy integration of Java for complex tasks• Developed at Yahoo!

Hive• SQL-based data warehousing app• Feature set is similar to Pig• Language is more strictly SQL-esque• Supports SELECT, JOIN, GROUP BY, etc

Page 16: A Start to Hadoop

HBase• Column-store database• Based on design of Google Big-Table• Provides interactive access to information• Holds extremely large datasets (multi-TB)• Constrained access model• (key, value) lookup• Limited transactions (only one row)

Chukwa• A distributed data collection and analysis system.

Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports

Page 17: A Start to Hadoop

Fuse-dfs• Allows mounting of HDFS volumes via Linux FUSE filesystem– Does not imply HDFS can be used for general-purpose file system– Does allow easy integration with other systems for data import/export

Page 18: A Start to Hadoop

Properties of hadoop System

• Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services

• Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.

• Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.

• Simple—Hadoop allows users to quickly write efficient parallel code.

Page 19: A Start to Hadoop

Comparing SQL databases and

Hadoop

• SCALE-OUT INSTEAD OF SCALE-UP Scaling commercial relational databases is expensive. Their design is more friendly to scaling up. To run a bigger database you need to buy a bigger machine. In fact, it’s not unusual to see server vendors market their expensive high-end machines as “database-class servers.” Unfortunately, at some point there won’t be a big enough machine available for the larger data sets.

• KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES A fundamental tenet of relational databases is that data resides in tables having relational structure defined by a schema . Although the relational model has great formal properties, many modern applications deal with data types that don’t fit well into this model. Text documents, images, and XML files are popular examples.

Page 20: A Start to Hadoop

• FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)

SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . Under SQL you have query statements; under MapReduce you have scripts and codes.

• OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS

Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once , read-many-times type of data store. In this aspect it’s similar to data warehouses in the SQL world.

Page 21: A Start to Hadoop

Building blocks of hadoop

• NameNode • DataNode • Secondary NameNode • JobTracker • TaskTracker

Page 22: A Start to Hadoop

Namenode

Datanodes

Page 23: A Start to Hadoop

MapReduce

• A MapReduce program processes data by manipulating (key/value) pairs in the general form

• map: (K1,V1) ➞ list(K2,V2) • reduce: (K2,list(V2)) ➞ list(K3,V3)

• map: (K1, V1) → list(K2, V2)• combine: (K2, list(V2)) → list(K2, V2)• reduce: (K2, list(V2)) → list(K3, V3)

Page 24: A Start to Hadoop

MapReduce

Page 25: A Start to Hadoop

Mapper • Mapper maps input key/value pairs to a set of

intermediate key/value pairs.• MapReduce framework spawns one map task for

each InputSplit (input splits are a logical division of your records ,64MB by default and can be customized) generated by the InputFormat for the job.

• The mapping class needs to extend an abstract class called Mapper

• map: (K1,V1) ➞ list(K2,V2)

Page 26: A Start to Hadoop

Inputsplit• Each map processes a single split, Each split is

divided into records, and the map processes each record—a key-value pair—in turn

Page 27: A Start to Hadoop

Mapper Example• static class FriendMapper extends• Mapper<LongWritable, Text, Text, IntWritable> {• public void map(LongWritable key, Text value, Context context)• throws IOException, InterruptedException {• String data = value.toString();• String[] friends = data.split(" ");• String friendpair = friends[0];• for (int i = 1; i < friends.length; i++) {• if (friendpair.compareTo(friends[i]) > 0) {• friendpair = friendpair + "," + friends[i];• } else {• friendpair = friends[i] + "," + friendpair;• }• context.write(new Text(friendpair), new IntWritable(0));• friendpair = friends[0];• }• for (int j = 0; j < friends.length; j++) {• friendpair= friends[j];• for (int i = j + 1; i < friends.length; i++) {• if (friendpair.compareTo(friends[i]) > 0) {• friendpair = friendpair + "," + friends[i];• } else {• friendpair = friends[i] + "," + friendpair;• }• context.write(new Text(friendpair), new IntWritable(1));• friendpair = friends[j];• }• }

• }• }

Page 28: A Start to Hadoop

Reducer• Reducer reduces a set of intermediate values which share a

key to a smaller set of values.• Reducer has 3 primary phases: shuffle, sort and reduce.

shuffleIn this phase the framework fetches the relevant partition of the output of all the mappersSortframework sorts Reducer inputs by keys lexographicallyReducethe reduce(WritableComparable,Iterator,context) method is called for each <key, (list of values)> pair in the grouped inputs.

Page 29: A Start to Hadoop

Reducer Example• static class FriendReducer extends Reducer<Text, IntWritable, Text, IntWritable>

{• public void reduce(Text key, Iterable<IntWritable> values,• Context context) throws IOException, InterruptedException {• int count = 0;• boolean mark = false;• for (IntWritable v : values) {• if (v.get() == 0) {• mark = true;• }• count++;• }

• if (!mark) {• context.write(key, new IntWritable(count));• }• }• }

Page 30: A Start to Hadoop

Combiner-local reduce

• Running combiners makes map output more compact , so there is less data to write to local disk and to transfer to the reducer.

• If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.

Page 31: A Start to Hadoop

A quick view of hadoop

Page 32: A Start to Hadoop

Setting hadoop in pseudo mode

• Prerequisite:-Linux with jdk 1.6 or laterHadoop jar

• Used configuration:-Linux rhel 5.5 using Vmware playerJdk 1.6

Page 33: A Start to Hadoop

Login as rootusername – rootpassword- root@123

Add group$ groupadd hadoop

Create user(hduser) and add it to group$ useradd –G hadoop hduser

Change password of hduser$ passwd hduser

Adding hduser to sudoers list , this is done to give hduser the privileges to use root using sudo

$ visudo

The file look like this scroll down to the page where:-

Page 34: A Start to Hadoop

Save the file using esc+:wq!

Page 35: A Start to Hadoop

Run the ifconfig command to get the ipaddress and add localhost to the hosts file

$ ifconfig$vi /etc/hosts

Changing the user$su hduser

Page 36: A Start to Hadoop

Generating public private key of sshssh is used by the hadoop server to automatically login into the

system

When it ask to save the key save it in your home with file name id_rsa /$home/.ssh/id_rsawhich is /home/hduser/.ssh/id_rsa here

Page 37: A Start to Hadoop

Adding public key to authorized file$cat /home/hduser/.ssh/id_rsa.pub>>/home/hduser/.ssh/authorized_keysThis command will create a file authorized_keys and add your public key to it.

Restricting permission of /home/hduser/.ssh and /home/hduser/.ssh/authorized_keysThis is done so that the passphrase will be used instead of password which is empty here

$chmod 0700 /home/hduser/.ssh$chmod 0600 /home/hduser/.ssh/authorized_keys

Testing whether the ssh uses the passphrase or not$ssh localhostit should be run without password

Extract the hadoop tar in /usr/local/$cd /usr/local$sudo tar xzf hadoop-0.20.2.tar.gz$sudo mv hadoop-0.20.2 hadoop //changing name of the folder$sudo chown –R hduser hadoop //granting ownership of folder

//to hduser

Page 38: A Start to Hadoop

Configuring the environment variable$vi /home/hduser/.bashrc

Add the following to the file

Save the file using esc+:wq!

Page 39: A Start to Hadoop

Configuring the hadoop serverFiles to be edited:-

hadoop-env.shcore-site.xmlmapred-site.xmlhdfs-site.xml

Change the directory to conf$cd /usr/local/hadoop/conf

Edit the file hadoop-env.sh$vi hadoop-env.sh

Page 40: A Start to Hadoop

Save all the files using esc:wq!Creating a temp dir for hadoop$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmpEditing the core-site.xml file

vi core-site.xml

Page 41: A Start to Hadoop

Editing the mapred file$vi mapred-site.xml

Page 42: A Start to Hadoop

Editing the hdfs file$vi hdfs-site.xml

Page 43: A Start to Hadoop

Formatting the namnode $/usr/local/hadoop/bin/hadoop namenode –format

Page 44: A Start to Hadoop

Starting the cluster$/usr/local/hadoop/bin/start-all.sh

Page 45: A Start to Hadoop

Stopping the cluster $/usr/local/hadoop/stop-all.sh

Page 46: A Start to Hadoop

Program to find a pattern in a text file.

import java.io.IOException;import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

Page 47: A Start to Hadoop

public class PatternFinder {

/** * @param args */

static class PatternMapper extends Mapper<LongWritable,Text,Text,IntWritable>{

public void map(LongWritable key ,Text value,Context context){

String line =value.toString();String lineReadArray[]=line.split(" ");Configuration conf=context.getConfiguration();String pattern=conf.get("ln");String pat="[\\w]*"+pattern+"[\\w]*";for(String val:lineReadArray){if(Pattern.matches(pat, val)){

try {context.write(new Text(val), new

IntWritable(1));}

Page 48: A Start to Hadoop

catch (IOException e) {// TODO Auto-generated catch block

e.printStackTrace();} catch (InterruptedException e) {// TODO Auto-generated catch block

e.printStackTrace();}

}}

}}

static class PatternReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

public void reduce(Text key,Iterable<IntWritable> value,Context context){

int sum=0;for(IntWritable i:value){

sum=sum+i.get();}try {

context.write(key, new IntWritable(sum));} catch (IOException e) {// TODO Auto-generated catch block

e.printStackTrace();}

Page 49: A Start to Hadoop

catch (InterruptedException e) {// TODO Auto-generated catch block

e.printStackTrace();}

}}public static void main(String[] args) throws Exception{

// TODO Auto-generated method stubif (args.length != 3) {

System.err.println("Usage: PatternFinder <input path> <output path> <pattern>");

System.exit(-1);}Configuration conf =new Configuration();conf.set("pattern", args[2]);Job job = new Job(conf);job.setJarByClass(PatternFinder.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(PatternMapper.class);job.setReducerClass(PatternReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

Page 50: A Start to Hadoop

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

Create a jar of the class file and store it in /home/krupa/inputjar/

Copying a test file to hdfsStarting the cluster if not started

$/usr/local/hadoop/bin/start-all.shCreating a input directory

$/usr/local/hadoop/hadoop fs –mkdir inputCopying a file to input directory

$/usr/local/hadoop/hadoop fs –put /home/krupa/abc.txt input

Running the jar $ $HADOOP_HOME/bin/hadoop jar

/home/krupa/inputjar/patternfinder PatternFinder input output

Copying the output$ $HADOOP_HOME/hadoop fs –copyToLocal

/user/kruap/output/part-r-00000 /home/krupa/output/out.txt