A Start to Hadoop

A start to hadoop

By:-Ayush MittalKrupa VarugheseParag Sahu

Major Focus• What is hadoop?• Why hadoop?• What is map reduce?• Phases in map reduce.• What is hdfs?• Setting hadoop in pseudo distributed mode.• A simple pattern matcher program.

The Elementary item-DATA

• What is Data? In computer era data is termed as relevant information which flows through one machine to another. Types can be:

Structured:- Identifiable because of its organized structure Example: Database(Information stored in column and rows)

Unstructured:-It does not have predefined data model Generally text heavy(contain date ,numbers ,logs) Irregular and ambiguous

What is the problem with unstructured

DATA? • Hides important insights

• Storage

• Updation

• Time consuming

Big DataIt consists of datasets that grow so large that they become awkward to work with using on-hand database management tools• Sizes in terabyte ,Exabyte ,zettabyte

• Difficulties include capture, storage, search, sharing, analytics, and visualizing

• Examples include web logs; RFID; sensor networks; social data (due to the Social data Revolution), Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, military surveillance; medical records; photography archives; video archives; and large-scale ecommerce.

Here comes Hadoop !!!

• The Apache Hadoop software library is a framework that allows for the “distributed processing of large data sets across clusters of computers using a simple programming model”.

• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

History of hadoop• Hadoop was created by Doug Cutting, the creator of

Apache Lucene.• In 2004, they set about writing an open source

implementation, the Nutch Distributed Filesystem (NDFS).

• In 2004, Google introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.

• In 2006 DOUG Cutting joined yahoo , and it build its site index using 10,000 core hadoop cluster.

• Now hadoop can sort 1 Terabyte data in 62 seconds.

Hadoop ecosystem

• CoreA set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).• AvroA data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)

MapReduce• A MapReduce is a distributed data processing

model introduced by Google to support distributed computing on large data sets on clusters of computers

• Hadoop can run MapReduce programs written in various languages

http://en.wikipedia.org/wiki/Distributed_computing

HDFSReliable Storage: HDFS• Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File System, or HDFS.• HDFS is able to store huge amounts of information, scale up

incrementally.• It can survive the failure of significant parts of the storage infrastructure without losing data.• Clusters can be built with inexpensive computers.• If one fails, Hadoop continues to operate the cluster without closing data or interrupting work, by shifting work to the remaining machines in the cluster.• HDFS manages storage on the cluster by breaking incoming files

into pieces, called “blocks,” and storing each of the blocks redundantly across the pool of servers.

• If namenode fails all system crashes down.

HDFS example

Zookeeper• Distributed consensus engine• Provides well-defined concurrent access semantics:• Distributed locking / mutual exclusion• Message board / mailboxes

Pig• Data-flow oriented language• “Pig Latin”• Data types include sets, associative arrays,• tuples• High-level language for routing data, allows• easy integration of Java for complex tasks• Developed at Yahoo!

Hive• SQL-based data warehousing app• Feature set is similar to Pig• Language is more strictly SQL-esque• Supports SELECT, JOIN, GROUP BY, etc

HBase• Column-store database• Based on design of Google Big-Table• Provides interactive access to information• Holds extremely large datasets (multi-TB)• Constrained access model• (key, value) lookup• Limited transactions (only one row)

Chukwa• A distributed data collection and analysis system.

Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports

Fuse-dfs• Allows mounting of HDFS volumes via Linux FUSE filesystem– Does not imply HDFS can be used for general-purpose file system– Does allow easy integration with other systems for data import/export

Properties of hadoop System

• Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services

• Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.

• Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.

• Simple—Hadoop allows users to quickly write efficient parallel code.

Comparing SQL databases and

Hadoop

• SCALE-OUT INSTEAD OF SCALE-UP Scaling commercial relational databases is expensive. Their design is more friendly to scaling up. To run a bigger database you need to buy a bigger machine. In fact, it’s not unusual to see server vendors market their expensive high-end machines as “database-class servers.” Unfortunately, at some point there won’t be a big enough machine available for the larger data sets.

• KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES A fundamental tenet of relational databases is that data resides in tables having relational structure defined by a schema . Although the relational model has great formal properties, many modern applications deal with data types that don’t fit well into this model. Text documents, images, and XML files are popular examples.

• FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)

SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . Under SQL you have query statements; under MapReduce you have scripts and codes.

• OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS

Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once , read-many-times type of data store. In this aspect it’s similar to data warehouses in the SQL world.

Building blocks of hadoop

• NameNode • DataNode • Secondary NameNode • JobTracker • TaskTracker

Namenode

Datanodes

MapReduce

• A MapReduce program processes data by manipulating (key/value) pairs in the general form

• map: (K1,V1) ➞ list(K2,V2) • reduce: (K2,list(V2)) ➞ list(K3,V3)

• map: (K1, V1) → list(K2, V2)• combine: (K2, list(V2)) → list(K2, V2)• reduce: (K2, list(V2)) → list(K3, V3)

MapReduce

Mapper • Mapper maps input key/value pairs to a set of

intermediate key/value pairs.• MapReduce framework spawns one map task for

each InputSplit (input splits are a logical division of your records ,64MB by default and can be customized) generated by the InputFormat for the job.

• The mapping class needs to extend an abstract class called Mapper

• map: (K1,V1) ➞ list(K2,V2)

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Mapper.html

Inputsplit• Each map processes a single split, Each split is

divided into records, and the map processes each record—a key-value pair—in turn

Mapper Example• static class FriendMapper extends• Mapper<LongWritable, Text, Text, IntWritable> {• public void map(LongWritable key, Text value, Context context)• throws IOException, InterruptedException {• String data = value.toString();• String[] friends = data.split(" ");• String friendpair = friends[0];• for (int i = 1; i < friends.length; i++) {• if (friendpair.compareTo(friends[i]) > 0) {• friendpair = friendpair + "," + friends[i];• } else {• friendpair = friends[i] + "," + friendpair;• }• context.write(new Text(friendpair), new IntWritable(0));• friendpair = friends[0];• }• for (int j = 0; j < friends.length; j++) {• friendpair= friends[j];• for (int i = j + 1; i < friends.length; i++) {• if (friendpair.compareTo(friends[i]) > 0) {• friendpair = friendpair + "," + friends[i];• } else {• friendpair = friends[i] + "," + friendpair;• }• context.write(new Text(friendpair), new IntWritable(1));• friendpair = friends[j];• }• }

• }• }

Reducer• Reducer reduces a set of intermediate values which share a

key to a smaller set of values.• Reducer has 3 primary phases: shuffle, sort and reduce.

shuffleIn this phase the framework fetches the relevant partition of the output of all the mappersSortframework sorts Reducer inputs by keys lexographicallyReducethe reduce(WritableComparable,Iterator,context) method is called for each <key, (list of values)> pair in the grouped inputs.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html




Reducer Example• static class FriendReducer extends Reducer<Text, IntWritable, Text, IntWritable>

{• public void reduce(Text key, Iterable<IntWritable> values,• Context context) throws IOException, InterruptedException {• int count = 0;• boolean mark = false;• for (IntWritable v : values) {• if (v.get() == 0) {• mark = true;• }• count++;• }

• if (!mark) {• context.write(key, new IntWritable(count));• }• }• }

Combiner-local reduce

• Running combiners makes map output more compact , so there is less data to write to local disk and to transfer to the reducer.

• If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.

A quick view of hadoop

Setting hadoop in pseudo mode

• Prerequisite:-Linux with jdk 1.6 or laterHadoop jar

• Used configuration:-Linux rhel 5.5 using Vmware playerJdk 1.6

Login as rootusername – rootpassword- root@123

Add group$ groupadd hadoop

Create user(hduser) and add it to group$ useradd –G hadoop hduser

Change password of hduser$ passwd hduser

Adding hduser to sudoers list , this is done to give hduser the privileges to use root using sudo

$ visudo

The file look like this scroll down to the page where:-

Save the file using esc+:wq!

Run the ifconfig command to get the ipaddress and add localhost to the hosts file

$ ifconfig$vi /etc/hosts

Changing the user$su hduser

Generating public private key of sshssh is used by the hadoop server to automatically login into the

system

When it ask to save the key save it in your home with file name id_rsa /$home/.ssh/id_rsawhich is /home/hduser/.ssh/id_rsa here

Adding public key to authorized file$cat /home/hduser/.ssh/id_rsa.pub>>/home/hduser/.ssh/authorized_keysThis command will create a file authorized_keys and add your public key to it.

Restricting permission of /home/hduser/.ssh and /home/hduser/.ssh/authorized_keysThis is done so that the passphrase will be used instead of password which is empty here

$chmod 0700 /home/hduser/.ssh$chmod 0600 /home/hduser/.ssh/authorized_keys

Testing whether the ssh uses the passphrase or not$ssh localhostit should be run without password

Extract the hadoop tar in /usr/local/$cd /usr/local$sudo tar xzf hadoop-0.20.2.tar.gz$sudo mv hadoop-0.20.2 hadoop //changing name of the folder$sudo chown –R hduser hadoop //granting ownership of folder

//to hduser

Configuring the environment variable$vi /home/hduser/.bashrc

Add the following to the file

Save the file using esc+:wq!

Configuring the hadoop serverFiles to be edited:-

hadoop-env.shcore-site.xmlmapred-site.xmlhdfs-site.xml

Change the directory to conf$cd /usr/local/hadoop/conf

Edit the file hadoop-env.sh$vi hadoop-env.sh

Save all the files using esc:wq!Creating a temp dir for hadoop$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmpEditing the core-site.xml file

vi core-site.xml

Editing the mapred file$vi mapred-site.xml

Editing the hdfs file$vi hdfs-site.xml

Formatting the namnode $/usr/local/hadoop/bin/hadoop namenode –format

Starting the cluster$/usr/local/hadoop/bin/start-all.sh

Stopping the cluster $/usr/local/hadoop/stop-all.sh

Program to find a pattern in a text file.

import java.io.IOException;import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PatternFinder {

/** * @param args */

static class PatternMapper extends Mapper<LongWritable,Text,Text,IntWritable>{

public void map(LongWritable key ,Text value,Context context){

String line =value.toString();String lineReadArray[]=line.split(" ");Configuration conf=context.getConfiguration();String pattern=conf.get("ln");String pat="[\\w]*"+pattern+"[\\w]*";for(String val:lineReadArray){if(Pattern.matches(pat, val)){

try {context.write(new Text(val), new

IntWritable(1));}

catch (IOException e) {// TODO Auto-generated catch block

e.printStackTrace();} catch (InterruptedException e) {// TODO Auto-generated catch block

e.printStackTrace();}

}}

}}

static class PatternReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

public void reduce(Text key,Iterable<IntWritable> value,Context context){

int sum=0;for(IntWritable i:value){

sum=sum+i.get();}try {

context.write(key, new IntWritable(sum));} catch (IOException e) {// TODO Auto-generated catch block


catch (InterruptedException e) {// TODO Auto-generated catch block


}}public static void main(String[] args) throws Exception{

// TODO Auto-generated method stubif (args.length != 3) {

System.err.println("Usage: PatternFinder <input path> <output path> <pattern>");

System.exit(-1);}Configuration conf =new Configuration();conf.set("pattern", args[2]);Job job = new Job(conf);job.setJarByClass(PatternFinder.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(PatternMapper.class);job.setReducerClass(PatternReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

Create a jar of the class file and store it in /home/krupa/inputjar/

Copying a test file to hdfsStarting the cluster if not started

$/usr/local/hadoop/bin/start-all.shCreating a input directory

$/usr/local/hadoop/hadoop fs –mkdir inputCopying a file to input directory

$/usr/local/hadoop/hadoop fs –put /home/krupa/abc.txt input

Running the jar $ $HADOOP_HOME/bin/hadoop jar

/home/krupa/inputjar/patternfinder PatternFinder input output

Copying the output$ $HADOOP_HOME/hadoop fs –copyToLocal

/user/kruap/output/part-r-00000 /home/krupa/output/out.txt

A Start to Hadoop

Documents

generated

large data

easy integration

import org

map processes

import java

setting hadoop

friendpair