Bigdata - Installation a dn Configuaration.docx

8/10/2019 Bigdata - Installation a dn Configuaration.docx

1/49

HDFS

HDFS is a filesystem designed for storing very large files with streaming data access

patterns, running on clusters of commodity hardware.

WHERE WE CANT USE HDFS:

1)Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds

range, will not work well with HDFS.

2)Lots of small files

Because the namenode holds filesystem metadata in memory, the limit to the

number of files in a filesystem is governed by the amount of memory on the namenode.

3)Multiple writers, arbitrary file modifications

Files in HDFS may be written to by a single writer. Writes are always made at the

end of the file. There is no support for multiple writers or for modifications at

arbitrary offsets in the file.

HDFS Concepts:-BlocksA disk has a block size, which is the minimum amount of data that it can read or write.

Filesystems for a single disk build on this by dealing with data in blocks, which are an

integral multiple of the disk block size.HDFS, too, has the concept of a block, but it is a much

larger unit64 MB by default.

Namenodes and Datanodes:-An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode

(the master) and a number of datanodes (workers). The namenode manages the

filesystem namespace. It maintains the filesystem tree and the metadata for all the files

and directories in the tree. This information is stored persistently on the local disk in

the form of two files: the namespace image and the edit log. The namenode also knows

the datanodes on which all the blocks for a given file are located; however, it does

not store block locations persistently, because this information is reconstructed from

datanodes when the system starts.

Parallel Copying with distcp:-

The canonical use case for distcp is for transferring data between two HDFS clusters.

If the clusters are running identical versions of Hadoop, the hdfs scheme is

appropriate:

% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barThis will copy the/foo directory (and its contents) from the first cluster to the/bar


2/49

directory on the second cluster, so the second cluster ends up with the directory structure

/bar/foo. If/bar doesnt exist, it will be created first. You can specify multiple source

paths, and all will be copied to the destination. Source paths must be absolute.

By default, distcp will skip files that already exist in the destination, but they can be

overwritten by supplying the -overwrite option. You can also update only the files that

have changed using the -update option.

Hadoop ArchivesHDFS stores small files inefficiently, since each file is stored in a block, and block

metadata is held in memory by the namenode. Thus, a large number of small files can

eat up a lot of memory on the namenode.

Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS

blocks more efficiently, thereby reducing namenode memory usage while still allowingtransparent access to files.

Archive command:

% hadoop archive -archiveName files.har /my/files /my

The first option is the name of the archive, herefiles.har. HAR files always have

a .har extension.

Limitations:-There are a few limitations to be aware of with HAR files. Creating an archive creates

a copy of the original files, so you need as much disk space as the files you are archiving

to create the archive (although you can delete the originals once you have created the

archive). There is currently no support for archive compression, although the files that

go into the archive can be compressed (HAR files are like tar files in this respect).

Archives are immutable once they have been created. To add or remove files, you must

re-create the archive.

Hadoop-2.2.0 Installation Steps for Single-Node Cluster (On Ubuntu

12.04)

1) Read chapter seven of HADOOP Definite Guide

Or
http://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.html


3/49

FOLLOW THIS LINK ->(http://codesfusion.blogspot.com/2013/10/setup-

hadoop-2x-220-on-ubuntu.html?m=1)

Or

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-

linux-single-node-cluster/

1. Download and install VMware Player depending on your Host OS (32 bit or 64 bit)

https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player

/6_0

2. Download the .iso image file of Ubuntu 12.04 LTS (32-bit or 64-bit depending on

your requirements)

http://www.ubuntu.com/download/desktop

3. Install Ubuntu from image in VMware. (For efficient use, configure the Virtual

Machine to have at least 2GB (4GB preferred)of RAMand at least 2 cores of

processor

Note:Install it using any user id and password you prefer to keep for your Ubuntuinstallation. We will create a separate user for Hadoop installation later.

4. After Ubuntu is installed, login to it and go to User Accounts (right-top corner) to

create a new user for Hadoop

5. Click on Unlock and unlock the settings by entering your administrator password.
http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0http://www.ubuntu.com/download/desktophttp://www.ubuntu.com/download/desktophttps://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1


4/49

6. Then click on + at the bottom-left to add a new user. Add the user type as

Administrator (I prefer this but you can also select as Standard) and then add the

username as hduser and create it.

Note:After creating the account you may see it as disabled. Click on the Dropdown

where Disabled is written and select Set Password to set the password for this

account or select Enable to enable this account without password.

7. Your account is set. Now login into your new hduser account.

8. Open terminal window by pressing Ctrl + Alt + T

9. Install openJDK using the following command

$ sudo apt-get install openjdk-7-jdk

10. Verify the java version installed

$ java -version

java version "1.7.0_25"

OpenJDK Runtime Environment (IcedTea 2.3.12) (7u25-2.3.12-4ubuntu3)

OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

11. Create a symlink from openjdk default name to jdk using the following commands:

$ cd /usr/lib/jvm

$ ln -s java-7-openjdk-amd64 jdk


5/49

12. Install ssh server:

$ sudo apt-get install openssh-client

$ sudo apt-get install openssh-server

13. Add hadoop group and user

$ sudo addgroup hadoop

$ usermod -a -G hadoop hduser

To verify that hduser has been added to the group hadoop use the command:

$ groups hduser

which will display the groups hduser is in.

14. Configure SSH:

$ ssh-keygen -t rsaP

...

Your identification has been saved in /home/hduser/.ssh/id_rsa

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub

...

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ ssh localhost

15.Disable IPv6 because it creates problems in HadoopRun the following command:

$ gksudo gedit /etc/sysctl.conf

16. Add the following line to the end of the file:

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1


6/49

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Save and close the file. Then restart the system and login with hduser again.

17. Download Hadoop - 2.2.0 from the following link to your Downloads folder

http://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz

18. Extract Hadoop and move it to /usr/local and make this user own it:

$ cd Downloads

$ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local

$ cd /usr/local

$ sudo mv hadoop-2.2.0 hadoop

$ sudo chown -R hduser:hadoop hadoop

19. Open the .bashrc file to edit it:

$ cd ~

$ gksudo gedit .bashrc

20. Add the following lines to the end of the file:

#Hadoop variables

export JAVA_HOME=/usr/lib/jvm/jdk/

export HADOOP_INSTALL=/usr/local/hadoop

export PATH=$PATH:$HADOOP_INSTALL/bin

export PATH=$PATH:$HADOOP_INSTALL/sbin

export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
http://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gzhttp://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz


7/49

export HADOOP_COMMON_HOME=$HADOOP_INSTALL

export HADOOP_HDFS_HOME=$HADOOP_INSTALL

export YARN_HOME=$HADOOP_INSTALL

#end of paste

Save and close the file.

21. Open hadoop-env.sh to edit it:

$ gksudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

and modify the JAVA_HOME variable in the File:

export JAVA_HOME=/usr/lib/jvm/jdk/

Save and close the file

Restart the system and re-login

22. Verify the Hadoop Version installed using the following command in the terminal:

$ hadoop version

The output should be like:

Hadoop 2.2.0

Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768

Compiled by hortonmu on 2013-10-07T06:28Z


8/49

Compiled with protoc 2.5.0

From source with checksum 79e53ce7994d1628b240f09af91e1af4

This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-

common-2.2.0.jar

This makes sure that Hadoop is installed and we just have to configure it now.

23. Run the following command:

$ gksudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

24. Add the following between the ... tags

fs.default.name

hdfs://localhost:9000

Then Save and close the file

25. In extended terminal write:

$ gksudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

and Paste following between tags

yarn.nodemanager.aux-services

mapreduce_shuffle


9/49

yarn.nodemanager.aux-services.mapreduce.shuffle.class

org.apache.hadoop.mapred.ShuffleHandler

26. Run the following command:

$ gksudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

27. Add the following between the ... tags

mapreduce.framework.name

yarn

Note) in that folder (/usr/local/Hadoop/etc/hadoop) where slaves file is there, make

an empty file named as masters and write localhost in it and save it.

28. Instead of saving the file directly, Save As and then set the filename as

mapred-site.xml. Verify that the file is being saved to the/usr/local/hadoop/etc/hadoop/ directory only

29. Type following commands to make folders for namenode and datanode:

$ cd ~

$ mkdir -p mydata/hdfs/namenode


10/49

$ mkdir -p mydata/hdfs/datanode

30. Run the following:

$ gksudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

31. Add the following lines between the

tags

dfs.replication

1

dfs.namenode.name.dir

file:/home/hduser/mydata/hdfs/namenode

dfs.datanode.data.dir

file:/home/hduser/mydata/hdfs/datanode

32. Format the namenode with HDFS:


11/49

$ hdfs namenode -format

33. Start Hadoop Services:

$ start-dfs.sh

....

$ start-yarn.sh

.

34. Run the following command to verify that hadoop services are running

$ jps

If everything was successful, you should see following services running2583 DataNode

2970 ResourceManager3461 Jps

3177 NodeManager

2361 NameNode

2840 SecondaryNameNode

Run Hadoop Example

hduser@ubuntu: cd /usr/local/hadoop

hduser@ubuntu:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5

Number of Maps = 2

Samples per Map = 5

13/10/21 18:41:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for

your platform... using builtin-java classes where applicable

Wrote input for Map #0


12/49

Wrote input for Map #1

Starting Job

13/10/21 18:41:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

13/10/21 18:41:04 INFO input.FileInputFormat: Total input paths to process : 2

13/10/21 18:41:04 INFO mapreduce.JobSubmitter: number of splits:2

13/10/21 18:41:04 INFO Configuration.deprecation: user.name is deprecated. Instead, use

mapreduce.job.user.name

...

Installation of hadoop on linux in Pseudo mode:

http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-

ubuntu-12-04-lts/

http://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.html

Map Reduce

MapReduce is a Language independent programming model for data processing.

MapReduce works by breaking the processing into two phases: the map phase and the

reduce phase. Each phase has key-value pairs as input and output, the types of which may

be chosen by the programmer. The programmer also specifies two functions: the mapfunction and the reduce function.

Data Flow
http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.htmlhttp://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.htmlhttp://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/


13/49

A MapReducejob is a unit of work that the client wants to be performed: it consists of the

input data, the MapReduce program, and configuration information. Hadoop runs the job by

dividing it into tasks, of which there are two types:

map tasks and reduce tasks.

There are two types of nodes that control the job execution process: a jobtrackerand

a number of tasktrackers. The jobtracker coordinates all the jobs run on the system byscheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress

reports to the jobtracker, which keeps a record of the overall progress of each job. If a

task fails, the jobtracker can reschedule it on a different tasktracker.

Hadoop divides the input to a MapReduce job into fixed-size pieces called input

splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined

map function for each record in the split.

MapReduce data flow with a single reduce task

HOW TO RUN A MAPREDUCE PROGRAM IN HADOOP:

HADOOP Step to run a program--

1)hadoop dfs -copyFromLocal /home/cloudera/yourfilename /

example:

hadoop dfs -copyFromlocal /home/cloudera/test/


14/49

***(where test is the name of the input file)

2)hadoop dfs -ls /

3)/usr/lib/hadoop-0.20$ hadoop jar yourjarfile.jar classname /InputFileName

/OutputFileName.

example:

1)/usr/lib/hadoop-0.20$ hadoop jar hadoop -0.20.2-cdh3u0-examples.jar

wordcount/test/output

2)/home/cloudera/ hadoop jar edu.jar count /test /output

***(where yourjarfile.jar= hadoop -0.20.2-cdh3u0-examples.jar,

classname=wordcount,InputFileName=test,OutputFileName=output)

4)hadoop fs -ls /OutputFileName

5)hadoop dfs -cat /OutputFilename/part-r-00000(To see the output).

CASE STUDTIES

Case I: Word-Count problem

Problem: In this problem we have to calculate the no of occurrence of a particular word in a text file.

Solution: Here i am just using the logic of map and reduce function rather than writing the whole

code.

// Map function:public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);

}}

}

// Reduce function:

public static class Reduce extends MapReduceBase implementsReducer {

public void reduce(Text key, Iterator values,OutputCollector output, Reporter reporter) throwsIOException {

int sum = 0;while (values.hasNext()) {


15/49

sum += values.next().get();}output.collect(key, new IntWritable(sum));

}}

Input:

how how are you

where have you been so long

i was worried about you

since the day you had gone

Output:

Case II: Max temperature problem

Problem:

We have been provided by a file having year and temperature. In this case you have more than one

temperature value corresponding to a year. You have to calculate the max temperature,sum of the

temperature,and minimum temperature year-wise. Lets understand it by an example. In your file

you have more than one value of temperature for the year 1900 as 32,45,67,12,68,43.So this

problem will give you output as

1900 68

1900 12


16/49

1900 267

Map function:

public void map(LongWritable key, Text value,OutputCollector output, Reporter reporter)throws IOException {

long airTemp;

String line = value.toString();int a=line.length();int spec=line.indexOf(' ');String year=line.substring(0,4);

airTemp = Long.parseLong(line.substring(spec+1,a));output.collect(new Text(year), new LongWritable(airTemp));

}}

Reduce function:

public static class OldMaxTemperatureReducer extends MapReduceBaseimplements Reducer {@Overridepublic void reduce(Text key, Iterator values,OutputCollector output, Reporter reporter)throws IOException {long maxvalue=0L;long minvalue=Long.MAX_VALUE;int sum=0;//Boolean flag=false;long s=0;while (values.hasNext()){

s=values.next().get();sum+=s;

if(smaxvalue){maxvalue=s;}

}

output.collect(key, new LongWritable(sum));output.collect(key, new LongWritable(maxvalue));output.collect(key, new LongWritable(minvalue));}

Note:You can take DoubleWritable as well if you have double input.

Output:


17/49

Case III: Prime Number problem

Problem: In this problem you have a file having numbers and you have to find the prime number

only from the given file.

//Map function:

public void map(LongWritable key, Text value,

OutputCollector output, Reporter

reporter) throws IOException

{

Boolean flag = true;long num = 0L;String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);while(tokenizer.hasMoreTokens())

{num = Long.parseLong(tokenizer.nextToken());number.set(num);

for(long i = 2; i


18/49

if(flag){

bool.set(true);output.collect(bool, number);

}

}}

}

//Reduce function:public void reduce(BooleanWritable key, Iterator values,OutputCollector output, Reporterreporter) throws IOException

{while(values.hasNext()){

output.collect(key, values.next());

}}}

Output:

Case IV: Alphabets problem(Length)

Problem: In this problem we have a long text file and you have to calculate the length of each word

included in the text file.It seems very similar to WordCount problem though it is But twist comes

here and that is the file is not a simple text file having only words.It also contains some numeric data

as well punctuation symbol such as (.,$ @ 1 2 3 4 5 etc).So It would be difficult to calculate the

length of a word.

Let me elaborate it.Suppose there is field in the text file as world.,so if you

calculate the length of it it will give 6,BUT IT IS VERY EASY TO GUESS THE LENTH OF WORD IS 5.So

what it is doing ,it count . as a character included in world but our problem was to calculate the

length of each world.So what we have to do here first remove all numeric and punctuation symbolfrom the file and then count the character in each word.


19/49


20/49

Case V: Alphabets problem(Arrangements)

Problem: W e have same input file as above but problem statement is different from that.Here you

have to show all distinct words of same length before the length of word. It means that if you file

file contains (hi how are you.i have sent a mail to you ),then output would be like this

1 i,a

2 hi,to

3 how, are, you(we have two you in file,but output have only one that is distinct word)

4 have,sent,mail

Map function:

public class mapreduce1 {public static class Map extends MapReduceBase implements

Mapper{ String b;


21/49

private IntWritable word = new IntWritable();public void map(LongWritable key, Text value,

OutputCollector output, Reporter reporter) throwsIOException {

String line = value.toString();

line=line.replaceAll("[\\p{Punct}]", "");line=line.replaceAll("[0-9]", "");

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()){// int a=tokenizer.nextToken().length();b=tokenizer.nextToken().toString();int a=b.length();word.set(a);

output.collect(word, new Text(b));

}

}}

Reduce function:

public static class Reduce extends MapReduceBase implementsReducer {

public void reduce(IntWritable key, Iterator values,OutputCollector output, Reporter reporter) throwsIOException {

String s="";while (values.hasNext()){

String var=values.next().toString();if(!s.contains(var)){

s=s+var+',';}

}

output.collect(key,new Text(s));

}

}

Output:


22/49

Case VI: Weather data problem

Problem: This is one of the good problem. This problem is to calculate the maximum temperature

,minimum temperature, if temperature is greater than 40 describe it as hot day, if temperature is

less than 10 describe it as cold day according to the date.

But main problem arises when you analyse the input data. In the input file

you have date as the 2ndfield and the remaining field as temp. For simple analysis one can think it

as(you have many column and 2ndcolumn is date and remaining are temp).Though its a flat file but

one can think the way I have described .So you have to extract the date and temperature as

described in the problem above.

Map function:

public static class Map extends MapReduceBase implementsMapper {// private final static IntWritable one = new IntWritable(1);//private Text word = new Text();

double temp;

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {

String line = value.toString();line=line.replaceAll("U","");int a=line.length();if(a>2){


23/49

int spec=line.indexOf(' ');String s=line.substring(spec,spec+9);String b=line.substring(spec+10,a);

StringTokenizer tokenizer = new StringTokenizer(b);while (tokenizer.hasMoreTokens()) {

{temp=Double.valueOf(tokenizer.nextToken().toString());

}output.collect(new Text(s), new DoubleWritable(temp));

}}}

}

Reduce function:

public static class Reduce extends MapReduceBase implements

Reducer {public void reduce(Text key, Iterator values,

OutputCollector output, Reporter reporter) throwsIOException {

Double maxValue = Double.MIN_VALUE;Double minvalue=Double.MAX_VALUE;Double a;while (values.hasNext()){

a=values.next().get();maxValue = Math.max(maxValue,a);minvalue=Math.min(minvalue,a);if(maxValue>40){

output.collect(key,new DoubleWritable(maxValue));}

/* if(minvalue


24/49

Pig(Batch Processing)Pig is a scripting language for exploring large datasets. Pig isnt suitable for all data

processing tasks, however. Like MapReduce, it is designed for batch processing of data. If

you want to perform a query that touches only a small amount of data in a large dataset,

then Pig will not perform well, because it is set up to scan the whole dataset, or at least

large portions of it.

Pig is made up of two pieces:

The language used to express data flows, called Pig Latin.

The execution environment to run Pig Latin programs. There are currently two

environments: local execution in a single JVM and distributed execution on a Hadoop

cluster.


25/49

A Pig Latin program is made up of a series of operations, or transformations, that are

applied to the input data to produce output.

Installation of Pig

1)Download a stable release from http://pig.apache.org/releases.html, and unpackthe tarball in a suitable place on your workstation:

% tar xzf pig-x.y.z.tar.gz

2)Its convenient to add Pigs binary directory to your command-line path. For example:

% export PIG_INSTALL=/home/tom/pig-x.y.z

% export PATH=$PATH:$PIG_INSTALL/bin

3))In home folder,open view and select show hidden file.Then search for .bashrcfile,open

it.Set the home path at the end of the file as

export JAVA_HOME=$HOMEPATH/programs/jdk1.7

export path=.:JAVA_HOME/bin:$PATH

4) Then run the pig commond to check whether home path is set or not.

Running Pig Programs

On Terminal Use command Pig-x local (to load files from /user/cloudera folder)

Otherwise it will try to load data from hdfs.

OR

Alternatively, you can set these two properties in thepig.properties file in Pigs conf

directory (or the directory specified by PIG_CONF_DIR). Heres an example for a

pseudodistributed

setup:

fs.default.name=hdfs://localhost/

mapred.job.tracker=localhost:8021

1) You can also create a script file and save it as .pig extention. Then run it as

$ Pigx local script.pig


26/49

It will give the same result.

2) Parameter substitution is also possible in pig.

Parameter Substitution

If you have a Pig script that you run on a regular basis, its quite common to want to

be able to run the same script with different parameters. For example, a script that runs

daily may use the date to determine which input files it runs over. Pig supportsparameter

substitution, where parameters in the script are substituted with values supplied at

runtime. Parameters are denoted by identifiers prefixed with a $ character; for example,

$input and $output are used in the following script to specify the input and output paths:

-- max_temp_param.pig

records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND

(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group,

MAX(filtered_records.temperature);

STORE max_temp into '$output';

Parameters can be specified when launching Pig, using the -param option, one for each

parameter:

% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \> -param output=/tmp/out \

> ch11/src/main/pig/max_temp_param.pig

You can also put parameters in a file and pass them to Pig using the -param_file option.

For example, we can achieve the same result as the previous command by placing the

parameter definitions in a file:

# Input file

input=/user/tom/input/ncdc/micro-tab/sample.txt

# Output file

output=/tmp/out

Thepig invocation then becomes:% pig -param_file ch11/src/main/pig/max_temp_param.param \

> ch11/src/main/pig/max_temp_param.pig

You can specify multiple parameter files using -param_file repeatedly. You can also

use a combination of -param and -param_file options, and if any parameter is defined

in both a parameter file and on the command line, the last value on the command line

takes precedence.

Dynamic parameters

For parameters that are supplied using the -param option, it is easy to make the value

dynamic by running a command or script. Many Unix shells support command substitution

for a command enclosed in backticks, and we can use this to make the outputdirectory date-based:


27/49


28/49


29/49

Output:

CaseIII: Weather data problem

Problem: This is one of the good problem. This problem is to calculate the maximum temperature .

But main problem arises when you analyse the input data. In the input file you have date as the 2nd

field and the remaining field as temp. For simple analysis one can think it as(you have many column

and 2nd

column is date and remaining are temp).Though its a flat file but one can think the way Ihave described .So you have to extract the date and temperature as described in the problem above.

Solution:

a = load 'a1' as (line:chararray);

b = foreach a generate REPLACE(line,'U','') AS line;

C= FOREACH B GENERATE TRIM(SUBSTRING(LINE,6,14)) AS DATE,TRIM(SUBSTRING(LINE,15,216))

AS TEMP;

D = FOREACH C GENERATE DATE,FLATTEN(TOKENIZE(temp)) as temp;

e = FOREACH D GENERATE DATE,(DOUBLE)TEMP;

f = filter e by temp > = 25;

g = group f by TEMP;

h = foreach g generate group,max(f.temp);

dump h;


30/49

Output:

Case IV: Alphabets problem(Length)

Problem: In this problem we have a long text file and you have to calculate the length of each word

included in the text file.It seems very similar to WordCount problem though it is But twist comes

here and that is the file is not a simple text file having only words.It also contains some numeric data

as well punctuation symbol such as (.,$ @ 1 2 3 4 5 etc).So It would be difficult to calculate the

length of a word.

Let me elaborate it. Suppose there is field in the text file as world.,so if you

calculate the length of it it will give 6,BUT IT IS VERY EASY TO GUESS THE LENTH OF WORD IS 5.So


length of each world.So what we have to do here first remove all numeric and punctuation symbol

from the file and then count the character in each word.

Solution:

A = load alpha2 using PigStorage(\t) as (line:chararray);

B = foreach A generate REPLACE(line,*\\p{Punct}[0-9++, );

C = foreach B GENERATE FLATTEN(TOKENIZE(line)) as line;


31/49


32/49

Note:You can also see the same result on grunts using DUMP g;

Output:

Hive

Hive was created to make it possible for analysts with strong SQL skills (but meager

Java programming skills) to run queries on the huge volumes of data that Facebook

stored in HDFS.

Installing HiveIn normal use, Hive runs on your workstation and converts your SQL query into a series

of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables,which provide a means for attaching structure to data stored in HDFS. Metadatasuch as

table schemasis stored in a database called the metastore. When starting out with Hive, it

is convenient to run the metastore on your local machine. In this configuration, which is the

default, the Hive table definitions that you create will be local to your machine, so you cant

share them with other users.

1) Download a release at http://hive.apache.org/releases.html, and unpack the tarball in a

suitable place on your workstation:

% tar xzf hive-x.y.z.tar.gz

2)Its handy to put Hive on your path to make it easy to launch:% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev

% export PATH=$PATH:$HIVE_INSTALL/bin

3)Now type hive to launch the Hive shell:

% hive local

hive>

4)Sometime it will say unable to open metastore. In that case use following lines on

commond prompt then run again

SUDO chmod -R 777 /VAR/LIB/HIVE/METASTORE/METASTORE_DB

chmod -R a+rwx /var/lib/hive/metastore/metastore_db


33/49

rm /var/lib/hive/metastore/metastore_db/*.lck

Running Script on Hive

NOTE-Loading Data form HDFS to Local System in Hive

Hadoop dfscat /user/hive/warehouse/filename/* > ~/output.txt

hadoop dfs -copyToLocal /user/hive/warehouse/filename/* ~/a.txt

We can create a script and save it .q extension and use command

$ hivef a1.q

It will run successfully.

In-order to delete a table from command prompt

$hivee drop table a1//no need of colon(;)

We can create table on command prompt

1) $hivee create table new(line string);

Load data local inpath s2 overwrite into table new (ENTER)

2)To view data

$ hiveS -e select * from new (ENTER)

S is used to suppress output. It will suppress time taken to show the data. You can do it

without using S too.

CREATE TABLE ...

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'STORED AS TEXTFILE;


34/49


35/49


36/49


length of each world.So what we have to do here first remove all numeric and punctuation symbol

from the file and then count the character in each word.

Solution:

CREATE TABLE A (LINE STRING);

LOAD DATA LOCAL INPATH ALPHA2 OVERWRITE INTO TABLE A;

FROM A INSERT OVERWRITE TABLE A SELECT REGEXP_REPLACE(LINE,*0-9+, );

FROM A INSERT OVERWRITE TABLE A SELECT REGEXP_REPLACE(LINE,\\pPunct-, );

SELECT WORD,LENGTH(WORD) FROM A LATERAL VIEW EXPLODE(SPLIT(LINE, )) 1TABLE AS

WORD;

Create a another table KK and store output

CREATE TABLE KK (Word string, length int) as

SELECT WORD,LENGTH(WORD) FROM A LATERAL VIEW EXPLODE(SPLIT(LINE, )) 1TABLE AS

WORD;

Load the output file from HDFS to local system

Hadoop dfscat /user/hive/warehouse/KK/* > ~/a.txt

Output:


37/49

CASE IV: PASS-FAIL PROBLEM

Problem:

In this problem you have two file one is student and another is result. You have to find out the name

of those students who have passed the exam.

Solution:

Create table student(name string,id int)row format delimited fields terminated by \t;

Load data local inpath student overwrite into table student;

Create table result(id int,status string) row format delimited fields terminated by \t;

Load data local inpath results overwrite into table result;

Select student.id,student.name,result.status from student join result on (student.id=result.id)

Where result.status=pass;

Output:


38/49

input

Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591

Haryana Ambala 404 20 80 30021 76.76746 30.373488 404-20-80-30021

Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591

pig script

A = LOAD 'input.txt' AS line;

B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+')) AS

(f1:chararray,f2:chararray,f3:int,f4:int,f5:int,f6:int,f7:double,f8:double,f9:chararray);

C = FOREACH B GENERATE f1,f2,f8,f9;

DUMP C;

output

(Haryana,Ambala,30.373488,404-20-80-37591)

(Haryana,Ambala,30.373488,404-20-80-30021)

(Haryana,Ambala,30.373488,404-20-80-37591)

Hbase


39/49

HBase is a distributed column-oriented database built on top of HDFS. HBase is the

Hadoop application to use when you require real-time read/write random access to

very large datasets.

Installation1)Download a stable release from an Apache Download Mirror and unpack it on yourlocal filesystem. For example:

% tar xzf hbase-x.y.z.tar.gz

2)As with Hadoop, you first need to tell HBase where Java is located on your system. If

you have the JAVA_HOME environment variable set to point to a suitable Java installation,

then that will be used, and you dont have to configure anything further. Otherwise,

you can set the Java installation that HBase uses by editing HBases conf/hbase-env.sh

and specifying the JAVA_HOME variable to point to

version 1.6.0 of Java.

3)For convenience, add the HBase binary directory to your command-line path. For

example:

% export HBASE_HOME=/home/hbase/hbase-x.y.z

% export PATH=$PATH:$HBASE_HOME/bin

1) To administer your HBase instance, launch the HBase shell by typing:

% hbase shell

HBase Shell; enter 'help' for list of supported commands.

Type "exit" to leave the HBase Shell

Version: 0.89.0-SNAPSHOT, ra4ea1a9a7b074a2e5b7b24f761302d4ea28ed1b2, Sun Jul 18

15:01:50 PDT 2010 hbase(main):001:0>

This will bring up a JRuby IRB interpreter that has had some HBase-specific commands

added to it. Type help and then press RETURN to see the list of shell commands grouped

into categories. Type help COMMAND_GROUP for help by category or help COMMAND for

help on a specific command and example usage.

2) Create a table:

To create a table, you must name your table and define its schema. A tables schema

comprises table attributes and the list of table column families. Column families

themselves have attributes that you in turn set at schema definition time. Examples of

column family attributes include whether the family content should be compressed onthe filesystem and how many versions of a cell to keep. Schemas can be edited later by


40/49

offlining the table using the shell disable command, making the necessary alterations

using alter, then putting the table back online with enable.

To create a table named test with a single column family named data using defaults

for table and column family attributes, enter:

hbase(main):007:0> create 'test', 'data'

0 row(s) in 1.3066 seconds

3)To prove the new table was created successfully, run the list command. This willoutput all tables in user space:

hbase(main):019:0> list

test


4)To insert data into three different rows and columns in the data column family, and

then list the table content, do the following:

hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1'


hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2'


hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3'0 row(s) in 0.0090 seconds

hbase(main):024:0> scan 'test'

ROW COLUMN+CELL

row1 column=data:1, timestamp=1240148026198, value=value1




Notice how we added three new columns without changing the schema.To remove the table, you must first disable it before dropping it:

hbase(main):025:0> disable 'test'

09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test


hbase(main):026:0> drop 'test'

09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test


hbase(main):027:0> list



41/49

5) To get data from specific row:

Get tablename,rowname(ENTER)

Get wc,row2 (It will print data of second row).

6)Delete a row:

Delete wc,row1,data:1(ENTER)

Import data from flat file to

HBase table.1. Create an HBase table

1

2

$ hbase shellhbase> create table 'noaastation', 'd'

This creates a table called noaastation with one column family d.

2. Create a temporary HDFS folder to hold the data for bulk load$ hdfs dfs mkdir /user/john/hbase

3. Run importtsv to generate data in the temporary folder for bulk load.

$ hadoop jar /usr/lib/hbase/hbase.jar importtsv '-

Dimporttsv.separator=|' -Dimporttsv.bulk.output=/user/john/hbase/tmp -

Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2,d:c3,d:c4,d:c5,d:c6,d:c

7,d:c8,d:c9,d:c10,d:c11,d:c12,d:c13,d:c14 noaastation

/user/john/noaa/201212station.txt

NOAA station data in file 201212station.txt has 15 fields separated by |. The first field is

station id, which will be the row key in HBase. The rest of fields will be added as HBase

columns.

1. Change the temporary folder permission$ hdfs dfs -chmod -R +rwx /user/john/hbase

2. Run bulk load

$ hadoop jar /usr/lib/hbase/hbase.jar completebulkload

/user/john/hbase/tmp noaastation


42/49

Now that the data is loaded, run some HBase shell commands to query.

12

hbase> scan noaastationhbase> get 'noaastation', '94994', {COLUMN => 'd:c7'}

CASSANDRA

Apache Cassandra is anopen source, distributed, decentralized, elastically scalable,

highly available, fault-tolerant, tuneably consistent, column-oriented database that

bases its distribution design on Amazons Dynamo and its data model on GooglesBigtable. Created at Facebook, it is now used at some of the most popular sites on the

Web.

Brewers CAP Theorem:The theorem states that within a large-scale distributed data system, there are three

requirements that have a relationship of sliding dependency: Consistency, Availability,

and Partition Tolerance.

Consistency

All database clients will read the same value for the same query, even given concurrent

updates.

Availability

All database clients will always be able to read and write data.

Partition Tolerance

The database can be split into multiple machines; it can continue functioning in

the face of network segmentation breaks.

Brewers theorem is that in any given system, you can strongly support only two of the

three. This is analogous to the saying you may have heard in software development:

You can have it good, you can have it fast, you can have it cheap: pick two.


43/49


44/49

sudo mkdir /var/lib/cassandra (CREATING DIRECTORY)

sudo chown -R cloudera/var/lib/cassandra (CHANGING OWNER FROM ROOT TO

CLOUDERA) OR

sudo chown -R 'whoami'/var/lib/cassnadra (assigning to all user who would be

working on server)

4)Start cassandra daemon

/home/cloudera/cassandra/bin/cassandra -f

5)Run commond on COMMAND LINE INTERFACE AND IN CQLSHELL

/home/cloudera/cassandra/bin/cassandra-cli

/home/cloudera/cassandra/bin/cqlsh

6)netstat -ano |grep 9160 |grep LISTEN or netstat -nl|grep 9160

To check whether a port is working or not.

7)you can also use

netstat -tulpen

8)Run commond on cli

a) CREATE KEYSPACE my_keyspace

with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'

and strategy_options = {replication_factor:1};

then run USE MMMY_KEYSPACE

b)create column family account1

with key_validation_class = UTF8Type

and comparator = 'AsciiType'

and default_validation_class = UTF8Type;

update column family User with

column_metadata =


45/49


46/49

E)deleting a cell

del account1['123']['fisrt_name']; //delete column first_name.

f)update column metadata

update column family address

with comparator = 'AsciiType'

and key_validation_class = 'UTF8Type'

and default_validation_class = 'UTF8Type'

and column_metadata =

[{column_name : city,

validation_class : utf8},

{column_name : zip,

validation_class : utf8}];

g) create super column family

create column family student

with column_type = 'Super'

and key_validation_class = UTF8Type

and comparator = 'AsciiType'and default_validation_class = UTF8Type;

h)update super column metadata

update column family student

with comparator = 'AsciiType'

and key_validation_class = 'UTF8Type'

and default_validation_class = 'UTF8Type'

and column_metadata =

[{column_name : city,

validation_class : utf8},

{column_name : zip,

validation_class : utf8}];

i)execute a script

bin/cassandra-cli -host localhost -port 9160 -f cassandrascript.txt


47/49

WORKING ON CQLSH

(http://www.datastax.com/documentation/cql/3.1/webhelp/index.html#cql/cql_usin

g/use_ttl_t.html)

NOTE: FOCUS ON LIST,SET,MAP DATATYPE OF CASSANDRA

1) CREATE KEYSPACE key

WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 2};

2)USE key;

3) ALTER KEYSPACE Key WITH REPLICATION =

{ 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

4)Creating TABLE(COLUMN-FAMILY IN CLI) WITHIN A KEYSPACE

a)CREATE TABLE users (

user_name varchar,

password varchar,

gender varchar,

session_token varchar,

state varchar,

birth_year bigint,PRIMARY KEY (user_name)); /// PRIMARY KEY

b)CREATE TABLE emp (

empID int,

deptID int,

first_name varchar,

last_name varchar,

PRIMARY KEY (empID, deptID)); //COMPOSITE PRIMARY KEY

5) INSERTION IN TABLE

INSERT INTO emp (empID, deptID, first_name, last_name)

VALUES (104, 15, 'jane', 'smith');

6)SYSTEM KEYSPACE ::

The system keyspace includes a number of tables that contain details about your Cassandra

database objects and cluster configuration.Cassandra populates these tables and others in

the system keyspace.Keyspace, table, and column information.An alternative to the Thrift

API describe_keyspaces function is querying system.schema_keyspaces directly. You can

also retrieve information about tables by querying system.schema_columnfamilies and

about column metadata by querying system.schema_columns.


48/49

Procedure

Query the defined keyspaces using the SELECT statement

SELECT * from system.schema_keyspaces;

Note: To kill all process running under CASSANDRADAEMON USE:

pgrep -u user -f cassandra |xargs kill -9 (Here User is cloudera).

MAHOUT

Mahout is an open source machine learning library from Apache. The algorithms it

implements fall under the broad umbrella of machine learning or collective intelligence. This

can mean many things, but at the moment for Mahout it means primarily recommender

engines (collaborative filtering), clustering, and classification. Its also scalable. Mahout aims

to be the machine learning tool of choice when the collection of data to be processed is very

large, perhaps far too large for a single machine. In its current incarnation, these scalable

machine learning implementations in Mahout are written in Java, and some portions are

built upon Apaches Hadoop distributed computation project.

Recommender enginesRecommender engines are the most immediately recognizable machine learning technique

in use today. Youll have seen services or sites that attempt to recommend books or movies

or articles based on your past actions. They try to infer tastes and preferences and identify

unknown items that are of interest:

1) Amazon.com is perhaps the most famous e-commerce site to deploy recommendations.

Based on purchases and site activity, Amazon recommends books and other items likely to

be of interest. See figure 1.2.

2) Netflix similarly recommends DVDs that may be of interest, and famously offered a

$1,000,000 prize to researchers who could improve the quality of their recommendations.

3) Dating sites like Lbmseti (discussed later) can even recommend people to people.

4) Social networking sites like Facebook use variants on recommender techniques to identify

people most likely to be as-yet-unconnected friends.

Clustering


49/49

Clustering is less apparent, but it turns up in equally well-known contexts. As its name

implies, clustering techniques attempt to group a large number of things together into

clusters that share some similarity.

1) Google News groups news articles by topic using clustering techniques, in order to

present news grouped by logical story, rather than presenting a raw listing of all articles.

2)Search engines like Clusty group their search results for similar reasons.3)Consumers may be grouped into segments (clusters) using clustering techniques based on

attributes like income, location, and buying habits.

Classification

Classification techniques decide how much a thing is or isnt part of some type or category,

or how much it does or doesnt have some attribute. Classification, like clustering, is

ubiquitous, but its even more behind the scenes. Often these systems learn by reviewing

many instances of items in the categories in order to deduce classification rules. This general

idea has many applications:

1) Yahoo! Mail decides whether or not incoming messages are spam based on prior emailsand spam reports from users, as well as on characteristics of the email itself.

2) Googles Picasa and other photo-management applications can decide when a region of

an image contains a human face.

3) Optical character recognition software classifies small regions of scanned text into

individual characters.

4) Apples Genius feature in iTunes reportedly uses classification to classify songsinto

potential playlists for users.

Bigdata - Installation a dn Configuaration.docx

Documents