8/10/2019 Bigdata - Installation a dn Configuaration.docx
1/49
HDFS
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
WHERE WE CANT USE HDFS:
1)Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS.
2)Lots of small files
Because the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode.
3)Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the
end of the file. There is no support for multiple writers or for modifications at
arbitrary offsets in the file.
HDFS Concepts:-BlocksA disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an
integral multiple of the disk block size.HDFS, too, has the concept of a block, but it is a much
larger unit64 MB by default.
Namenodes and Datanodes:-An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode
(the master) and a number of datanodes (workers). The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files
and directories in the tree. This information is stored persistently on the local disk in
the form of two files: the namespace image and the edit log. The namenode also knows
the datanodes on which all the blocks for a given file are located; however, it does
not store block locations persistently, because this information is reconstructed from
datanodes when the system starts.
Parallel Copying with distcp:-
The canonical use case for distcp is for transferring data between two HDFS clusters.
If the clusters are running identical versions of Hadoop, the hdfs scheme is
appropriate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barThis will copy the/foo directory (and its contents) from the first cluster to the/bar
8/10/2019 Bigdata - Installation a dn Configuaration.docx
2/49
directory on the second cluster, so the second cluster ends up with the directory structure
/bar/foo. If/bar doesnt exist, it will be created first. You can specify multiple source
paths, and all will be copied to the destination. Source paths must be absolute.
By default, distcp will skip files that already exist in the destination, but they can be
overwritten by supplying the -overwrite option. You can also update only the files that
have changed using the -update option.
Hadoop ArchivesHDFS stores small files inefficiently, since each file is stored in a block, and block
metadata is held in memory by the namenode. Thus, a large number of small files can
eat up a lot of memory on the namenode.
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, thereby reducing namenode memory usage while still allowingtransparent access to files.
Archive command:
% hadoop archive -archiveName files.har /my/files /my
The first option is the name of the archive, herefiles.har. HAR files always have
a .har extension.
Limitations:-There are a few limitations to be aware of with HAR files. Creating an archive creates
a copy of the original files, so you need as much disk space as the files you are archiving
to create the archive (although you can delete the originals once you have created the
archive). There is currently no support for archive compression, although the files that
go into the archive can be compressed (HAR files are like tar files in this respect).
Archives are immutable once they have been created. To add or remove files, you must
re-create the archive.
Hadoop-2.2.0 Installation Steps for Single-Node Cluster (On Ubuntu
12.04)
1) Read chapter seven of HADOOP Definite Guide
Or
http://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.html8/10/2019 Bigdata - Installation a dn Configuaration.docx
3/49
FOLLOW THIS LINK ->(http://codesfusion.blogspot.com/2013/10/setup-
hadoop-2x-220-on-ubuntu.html?m=1)
Or
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-
linux-single-node-cluster/
1. Download and install VMware Player depending on your Host OS (32 bit or 64 bit)
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player
/6_0
2. Download the .iso image file of Ubuntu 12.04 LTS (32-bit or 64-bit depending on
your requirements)
http://www.ubuntu.com/download/desktop
3. Install Ubuntu from image in VMware. (For efficient use, configure the Virtual
Machine to have at least 2GB (4GB preferred)of RAMand at least 2 cores of
processor
Note:Install it using any user id and password you prefer to keep for your Ubuntuinstallation. We will create a separate user for Hadoop installation later.
4. After Ubuntu is installed, login to it and go to User Accounts (right-top corner) to
create a new user for Hadoop
5. Click on Unlock and unlock the settings by entering your administrator password.
http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0http://www.ubuntu.com/download/desktophttp://www.ubuntu.com/download/desktophttps://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=18/10/2019 Bigdata - Installation a dn Configuaration.docx
4/49
6. Then click on + at the bottom-left to add a new user. Add the user type as
Administrator (I prefer this but you can also select as Standard) and then add the
username as hduser and create it.
Note:After creating the account you may see it as disabled. Click on the Dropdown
where Disabled is written and select Set Password to set the password for this
account or select Enable to enable this account without password.
7. Your account is set. Now login into your new hduser account.
8. Open terminal window by pressing Ctrl + Alt + T
9. Install openJDK using the following command
$ sudo apt-get install openjdk-7-jdk
10. Verify the java version installed
$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.12) (7u25-2.3.12-4ubuntu3)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
11. Create a symlink from openjdk default name to jdk using the following commands:
$ cd /usr/lib/jvm
$ ln -s java-7-openjdk-amd64 jdk
8/10/2019 Bigdata - Installation a dn Configuaration.docx
5/49
12. Install ssh server:
$ sudo apt-get install openssh-client
$ sudo apt-get install openssh-server
13. Add hadoop group and user
$ sudo addgroup hadoop
$ usermod -a -G hadoop hduser
To verify that hduser has been added to the group hadoop use the command:
$ groups hduser
which will display the groups hduser is in.
14. Configure SSH:
$ ssh-keygen -t rsaP
...
Your identification has been saved in /home/hduser/.ssh/id_rsa
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub
...
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost
15.Disable IPv6 because it creates problems in HadoopRun the following command:
$ gksudo gedit /etc/sysctl.conf
16. Add the following line to the end of the file:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
8/10/2019 Bigdata - Installation a dn Configuaration.docx
6/49
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Save and close the file. Then restart the system and login with hduser again.
17. Download Hadoop - 2.2.0 from the following link to your Downloads folder
http://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
18. Extract Hadoop and move it to /usr/local and make this user own it:
$ cd Downloads
$ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hduser:hadoop hadoop
19. Open the .bashrc file to edit it:
$ cd ~
$ gksudo gedit .bashrc
20. Add the following lines to the end of the file:
#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
http://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gzhttp://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz8/10/2019 Bigdata - Installation a dn Configuaration.docx
7/49
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
#end of paste
Save and close the file.
21. Open hadoop-env.sh to edit it:
$ gksudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh
and modify the JAVA_HOME variable in the File:
export JAVA_HOME=/usr/lib/jvm/jdk/
Save and close the file
Restart the system and re-login
22. Verify the Hadoop Version installed using the following command in the terminal:
$ hadoop version
The output should be like:
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
8/10/2019 Bigdata - Installation a dn Configuaration.docx
8/49
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-
common-2.2.0.jar
This makes sure that Hadoop is installed and we just have to configure it now.
23. Run the following command:
$ gksudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml
24. Add the following between the ... tags
fs.default.name
hdfs://localhost:9000
Then Save and close the file
25. In extended terminal write:
$ gksudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml
and Paste following between tags
yarn.nodemanager.aux-services
mapreduce_shuffle
8/10/2019 Bigdata - Installation a dn Configuaration.docx
9/49
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
26. Run the following command:
$ gksudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
27. Add the following between the ... tags
mapreduce.framework.name
yarn
Note) in that folder (/usr/local/Hadoop/etc/hadoop) where slaves file is there, make
an empty file named as masters and write localhost in it and save it.
28. Instead of saving the file directly, Save As and then set the filename as
mapred-site.xml. Verify that the file is being saved to the/usr/local/hadoop/etc/hadoop/ directory only
29. Type following commands to make folders for namenode and datanode:
$ cd ~
$ mkdir -p mydata/hdfs/namenode
8/10/2019 Bigdata - Installation a dn Configuaration.docx
10/49
$ mkdir -p mydata/hdfs/datanode
30. Run the following:
$ gksudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
31. Add the following lines between the
tags
dfs.replication
1
dfs.namenode.name.dir
file:/home/hduser/mydata/hdfs/namenode
dfs.datanode.data.dir
file:/home/hduser/mydata/hdfs/datanode
32. Format the namenode with HDFS:
8/10/2019 Bigdata - Installation a dn Configuaration.docx
11/49
$ hdfs namenode -format
33. Start Hadoop Services:
$ start-dfs.sh
....
$ start-yarn.sh
.
34. Run the following command to verify that hadoop services are running
$ jps
If everything was successful, you should see following services running2583 DataNode
2970 ResourceManager3461 Jps
3177 NodeManager
2361 NameNode
2840 SecondaryNameNode
Run Hadoop Example
hduser@ubuntu: cd /usr/local/hadoop
hduser@ubuntu:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5
Number of Maps = 2
Samples per Map = 5
13/10/21 18:41:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Wrote input for Map #0
8/10/2019 Bigdata - Installation a dn Configuaration.docx
12/49
Wrote input for Map #1
Starting Job
13/10/21 18:41:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/10/21 18:41:04 INFO input.FileInputFormat: Total input paths to process : 2
13/10/21 18:41:04 INFO mapreduce.JobSubmitter: number of splits:2
13/10/21 18:41:04 INFO Configuration.deprecation: user.name is deprecated. Instead, use
mapreduce.job.user.name
...
Installation of hadoop on linux in Pseudo mode:
http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-
ubuntu-12-04-lts/
http://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.html
Map Reduce
MapReduce is a Language independent programming model for data processing.
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. Each phase has key-value pairs as input and output, the types of which may
be chosen by the programmer. The programmer also specifies two functions: the mapfunction and the reduce function.
Data Flow
http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.htmlhttp://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.htmlhttp://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/8/10/2019 Bigdata - Installation a dn Configuaration.docx
13/49
A MapReducejob is a unit of work that the client wants to be performed: it consists of the
input data, the MapReduce program, and configuration information. Hadoop runs the job by
dividing it into tasks, of which there are two types:
map tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtrackerand
a number of tasktrackers. The jobtracker coordinates all the jobs run on the system byscheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress
reports to the jobtracker, which keeps a record of the overall progress of each job. If a
task fails, the jobtracker can reschedule it on a different tasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.
MapReduce data flow with a single reduce task
HOW TO RUN A MAPREDUCE PROGRAM IN HADOOP:
HADOOP Step to run a program--
1)hadoop dfs -copyFromLocal /home/cloudera/yourfilename /
example:
hadoop dfs -copyFromlocal /home/cloudera/test/
8/10/2019 Bigdata - Installation a dn Configuaration.docx
14/49
***(where test is the name of the input file)
2)hadoop dfs -ls /
3)/usr/lib/hadoop-0.20$ hadoop jar yourjarfile.jar classname /InputFileName
/OutputFileName.
example:
1)/usr/lib/hadoop-0.20$ hadoop jar hadoop -0.20.2-cdh3u0-examples.jar
wordcount/test/output
2)/home/cloudera/ hadoop jar edu.jar count /test /output
***(where yourjarfile.jar= hadoop -0.20.2-cdh3u0-examples.jar,
classname=wordcount,InputFileName=test,OutputFileName=output)
4)hadoop fs -ls /OutputFileName
5)hadoop dfs -cat /OutputFilename/part-r-00000(To see the output).
CASE STUDTIES
Case I: Word-Count problem
Problem: In this problem we have to calculate the no of occurrence of a particular word in a text file.
Solution: Here i am just using the logic of map and reduce function rather than writing the whole
code.
// Map function:public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);
}}
}
// Reduce function:
public static class Reduce extends MapReduceBase implementsReducer {
public void reduce(Text key, Iterator values,OutputCollector output, Reporter reporter) throwsIOException {
int sum = 0;while (values.hasNext()) {
8/10/2019 Bigdata - Installation a dn Configuaration.docx
15/49
sum += values.next().get();}output.collect(key, new IntWritable(sum));
}}
Input:
how how are you
where have you been so long
i was worried about you
since the day you had gone
Output:
Case II: Max temperature problem
Problem:
We have been provided by a file having year and temperature. In this case you have more than one
temperature value corresponding to a year. You have to calculate the max temperature,sum of the
temperature,and minimum temperature year-wise. Lets understand it by an example. In your file
you have more than one value of temperature for the year 1900 as 32,45,67,12,68,43.So this
problem will give you output as
1900 68
1900 12
8/10/2019 Bigdata - Installation a dn Configuaration.docx
16/49
1900 267
Map function:
public void map(LongWritable key, Text value,OutputCollector output, Reporter reporter)throws IOException {
long airTemp;
String line = value.toString();int a=line.length();int spec=line.indexOf(' ');String year=line.substring(0,4);
airTemp = Long.parseLong(line.substring(spec+1,a));output.collect(new Text(year), new LongWritable(airTemp));
}}
Reduce function:
public static class OldMaxTemperatureReducer extends MapReduceBaseimplements Reducer {@Overridepublic void reduce(Text key, Iterator values,OutputCollector output, Reporter reporter)throws IOException {long maxvalue=0L;long minvalue=Long.MAX_VALUE;int sum=0;//Boolean flag=false;long s=0;while (values.hasNext()){
s=values.next().get();sum+=s;
if(smaxvalue){maxvalue=s;}
}
output.collect(key, new LongWritable(sum));output.collect(key, new LongWritable(maxvalue));output.collect(key, new LongWritable(minvalue));}
Note:You can take DoubleWritable as well if you have double input.
Output:
8/10/2019 Bigdata - Installation a dn Configuaration.docx
17/49
Case III: Prime Number problem
Problem: In this problem you have a file having numbers and you have to find the prime number
only from the given file.
//Map function:
public void map(LongWritable key, Text value,
OutputCollector output, Reporter
reporter) throws IOException
{
Boolean flag = true;long num = 0L;String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);while(tokenizer.hasMoreTokens())
{num = Long.parseLong(tokenizer.nextToken());number.set(num);
for(long i = 2; i
8/10/2019 Bigdata - Installation a dn Configuaration.docx
18/49
if(flag){
bool.set(true);output.collect(bool, number);
}
}}
}
//Reduce function:public void reduce(BooleanWritable key, Iterator values,OutputCollector output, Reporterreporter) throws IOException
{while(values.hasNext()){
output.collect(key, values.next());
}}}
Output:
Case IV: Alphabets problem(Length)
Problem: In this problem we have a long text file and you have to calculate the length of each word
included in the text file.It seems very similar to WordCount problem though it is But twist comes
here and that is the file is not a simple text file having only words.It also contains some numeric data
as well punctuation symbol such as (.,$ @ 1 2 3 4 5 etc).So It would be difficult to calculate the
length of a word.
Let me elaborate it.Suppose there is field in the text file as world.,so if you
calculate the length of it it will give 6,BUT IT IS VERY EASY TO GUESS THE LENTH OF WORD IS 5.So
what it is doing ,it count . as a character included in world but our problem was to calculate the
length of each world.So what we have to do here first remove all numeric and punctuation symbolfrom the file and then count the character in each word.
8/10/2019 Bigdata - Installation a dn Configuaration.docx
19/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
20/49
Case V: Alphabets problem(Arrangements)
Problem: W e have same input file as above but problem statement is different from that.Here you
have to show all distinct words of same length before the length of word. It means that if you file
file contains (hi how are you.i have sent a mail to you ),then output would be like this
1 i,a
2 hi,to
3 how, are, you(we have two you in file,but output have only one that is distinct word)
4 have,sent,mail
Map function:
public class mapreduce1 {public static class Map extends MapReduceBase implements
Mapper{ String b;
8/10/2019 Bigdata - Installation a dn Configuaration.docx
21/49
private IntWritable word = new IntWritable();public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter) throwsIOException {
String line = value.toString();
line=line.replaceAll("[\\p{Punct}]", "");line=line.replaceAll("[0-9]", "");
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){// int a=tokenizer.nextToken().length();b=tokenizer.nextToken().toString();int a=b.length();word.set(a);
output.collect(word, new Text(b));
}
}}
Reduce function:
public static class Reduce extends MapReduceBase implementsReducer {
public void reduce(IntWritable key, Iterator values,OutputCollector output, Reporter reporter) throwsIOException {
String s="";while (values.hasNext()){
String var=values.next().toString();if(!s.contains(var)){
s=s+var+',';}
}
output.collect(key,new Text(s));
}
}
Output:
8/10/2019 Bigdata - Installation a dn Configuaration.docx
22/49
Case VI: Weather data problem
Problem: This is one of the good problem. This problem is to calculate the maximum temperature
,minimum temperature, if temperature is greater than 40 describe it as hot day, if temperature is
less than 10 describe it as cold day according to the date.
But main problem arises when you analyse the input data. In the input file
you have date as the 2ndfield and the remaining field as temp. For simple analysis one can think it
as(you have many column and 2ndcolumn is date and remaining are temp).Though its a flat file but
one can think the way I have described .So you have to extract the date and temperature as
described in the problem above.
Map function:
public static class Map extends MapReduceBase implementsMapper {// private final static IntWritable one = new IntWritable(1);//private Text word = new Text();
double temp;
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();line=line.replaceAll("U","");int a=line.length();if(a>2){
8/10/2019 Bigdata - Installation a dn Configuaration.docx
23/49
int spec=line.indexOf(' ');String s=line.substring(spec,spec+9);String b=line.substring(spec+10,a);
StringTokenizer tokenizer = new StringTokenizer(b);while (tokenizer.hasMoreTokens()) {
{temp=Double.valueOf(tokenizer.nextToken().toString());
}output.collect(new Text(s), new DoubleWritable(temp));
}}}
}
Reduce function:
public static class Reduce extends MapReduceBase implements
Reducer {public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throwsIOException {
Double maxValue = Double.MIN_VALUE;Double minvalue=Double.MAX_VALUE;Double a;while (values.hasNext()){
a=values.next().get();maxValue = Math.max(maxValue,a);minvalue=Math.min(minvalue,a);if(maxValue>40){
output.collect(key,new DoubleWritable(maxValue));}
/* if(minvalue
8/10/2019 Bigdata - Installation a dn Configuaration.docx
24/49
Pig(Batch Processing)Pig is a scripting language for exploring large datasets. Pig isnt suitable for all data
processing tasks, however. Like MapReduce, it is designed for batch processing of data. If
you want to perform a query that touches only a small amount of data in a large dataset,
then Pig will not perform well, because it is set up to scan the whole dataset, or at least
large portions of it.
Pig is made up of two pieces:
The language used to express data flows, called Pig Latin.
The execution environment to run Pig Latin programs. There are currently two
environments: local execution in a single JVM and distributed execution on a Hadoop
cluster.
8/10/2019 Bigdata - Installation a dn Configuaration.docx
25/49
A Pig Latin program is made up of a series of operations, or transformations, that are
applied to the input data to produce output.
Installation of Pig
1)Download a stable release from http://pig.apache.org/releases.html, and unpackthe tarball in a suitable place on your workstation:
% tar xzf pig-x.y.z.tar.gz
2)Its convenient to add Pigs binary directory to your command-line path. For example:
% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
3))In home folder,open view and select show hidden file.Then search for .bashrcfile,open
it.Set the home path at the end of the file as
export JAVA_HOME=$HOMEPATH/programs/jdk1.7
export path=.:JAVA_HOME/bin:$PATH
4) Then run the pig commond to check whether home path is set or not.
Running Pig Programs
On Terminal Use command Pig-x local (to load files from /user/cloudera folder)
Otherwise it will try to load data from hdfs.
OR
Alternatively, you can set these two properties in thepig.properties file in Pigs conf
directory (or the directory specified by PIG_CONF_DIR). Heres an example for a
pseudodistributed
setup:
fs.default.name=hdfs://localhost/
mapred.job.tracker=localhost:8021
1) You can also create a script file and save it as .pig extention. Then run it as
$ Pigx local script.pig
8/10/2019 Bigdata - Installation a dn Configuaration.docx
26/49
It will give the same result.
2) Parameter substitution is also possible in pig.
Parameter Substitution
If you have a Pig script that you run on a regular basis, its quite common to want to
be able to run the same script with different parameters. For example, a script that runs
daily may use the date to determine which input files it runs over. Pig supportsparameter
substitution, where parameters in the script are substituted with values supplied at
runtime. Parameters are denoted by identifiers prefixed with a $ character; for example,
$input and $output are used in the following script to specify the input and output paths:
-- max_temp_param.pig
records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
STORE max_temp into '$output';
Parameters can be specified when launching Pig, using the -param option, one for each
parameter:
% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \> -param output=/tmp/out \
> ch11/src/main/pig/max_temp_param.pig
You can also put parameters in a file and pass them to Pig using the -param_file option.
For example, we can achieve the same result as the previous command by placing the
parameter definitions in a file:
# Input file
input=/user/tom/input/ncdc/micro-tab/sample.txt
# Output file
output=/tmp/out
Thepig invocation then becomes:% pig -param_file ch11/src/main/pig/max_temp_param.param \
> ch11/src/main/pig/max_temp_param.pig
You can specify multiple parameter files using -param_file repeatedly. You can also
use a combination of -param and -param_file options, and if any parameter is defined
in both a parameter file and on the command line, the last value on the command line
takes precedence.
Dynamic parameters
For parameters that are supplied using the -param option, it is easy to make the value
dynamic by running a command or script. Many Unix shells support command substitution
for a command enclosed in backticks, and we can use this to make the outputdirectory date-based:
8/10/2019 Bigdata - Installation a dn Configuaration.docx
27/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
28/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
29/49
Output:
CaseIII: Weather data problem
Problem: This is one of the good problem. This problem is to calculate the maximum temperature .
But main problem arises when you analyse the input data. In the input file you have date as the 2nd
field and the remaining field as temp. For simple analysis one can think it as(you have many column
and 2nd
column is date and remaining are temp).Though its a flat file but one can think the way Ihave described .So you have to extract the date and temperature as described in the problem above.
Solution:
a = load 'a1' as (line:chararray);
b = foreach a generate REPLACE(line,'U','') AS line;
C= FOREACH B GENERATE TRIM(SUBSTRING(LINE,6,14)) AS DATE,TRIM(SUBSTRING(LINE,15,216))
AS TEMP;
D = FOREACH C GENERATE DATE,FLATTEN(TOKENIZE(temp)) as temp;
e = FOREACH D GENERATE DATE,(DOUBLE)TEMP;
f = filter e by temp > = 25;
g = group f by TEMP;
h = foreach g generate group,max(f.temp);
dump h;
8/10/2019 Bigdata - Installation a dn Configuaration.docx
30/49
Output:
Case IV: Alphabets problem(Length)
Problem: In this problem we have a long text file and you have to calculate the length of each word
included in the text file.It seems very similar to WordCount problem though it is But twist comes
here and that is the file is not a simple text file having only words.It also contains some numeric data
as well punctuation symbol such as (.,$ @ 1 2 3 4 5 etc).So It would be difficult to calculate the
length of a word.
Let me elaborate it. Suppose there is field in the text file as world.,so if you
calculate the length of it it will give 6,BUT IT IS VERY EASY TO GUESS THE LENTH OF WORD IS 5.So
what it is doing ,it count . as a character included in world but our problem was to calculate the
length of each world.So what we have to do here first remove all numeric and punctuation symbol
from the file and then count the character in each word.
Solution:
A = load alpha2 using PigStorage(\t) as (line:chararray);
B = foreach A generate REPLACE(line,*\\p{Punct}[0-9++, );
C = foreach B GENERATE FLATTEN(TOKENIZE(line)) as line;
8/10/2019 Bigdata - Installation a dn Configuaration.docx
31/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
32/49
Note:You can also see the same result on grunts using DUMP g;
Output:
Hive
Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data that Facebook
stored in HDFS.
Installing HiveIn normal use, Hive runs on your workstation and converts your SQL query into a series
of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables,which provide a means for attaching structure to data stored in HDFS. Metadatasuch as
table schemasis stored in a database called the metastore. When starting out with Hive, it
is convenient to run the metastore on your local machine. In this configuration, which is the
default, the Hive table definitions that you create will be local to your machine, so you cant
share them with other users.
1) Download a release at http://hive.apache.org/releases.html, and unpack the tarball in a
suitable place on your workstation:
% tar xzf hive-x.y.z.tar.gz
2)Its handy to put Hive on your path to make it easy to launch:% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev
% export PATH=$PATH:$HIVE_INSTALL/bin
3)Now type hive to launch the Hive shell:
% hive local
hive>
4)Sometime it will say unable to open metastore. In that case use following lines on
commond prompt then run again
SUDO chmod -R 777 /VAR/LIB/HIVE/METASTORE/METASTORE_DB
chmod -R a+rwx /var/lib/hive/metastore/metastore_db
8/10/2019 Bigdata - Installation a dn Configuaration.docx
33/49
rm /var/lib/hive/metastore/metastore_db/*.lck
Running Script on Hive
NOTE-Loading Data form HDFS to Local System in Hive
Hadoop dfscat /user/hive/warehouse/filename/* > ~/output.txt
hadoop dfs -copyToLocal /user/hive/warehouse/filename/* ~/a.txt
We can create a script and save it .q extension and use command
$ hivef a1.q
It will run successfully.
In-order to delete a table from command prompt
$hivee drop table a1//no need of colon(;)
We can create table on command prompt
1) $hivee create table new(line string);
Load data local inpath s2 overwrite into table new (ENTER)
2)To view data
$ hiveS -e select * from new (ENTER)
S is used to suppress output. It will suppress time taken to show the data. You can do it
without using S too.
CREATE TABLE ...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'STORED AS TEXTFILE;
8/10/2019 Bigdata - Installation a dn Configuaration.docx
34/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
35/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
36/49
what it is doing ,it count . as a character included in world but our problem was to calculate the
length of each world.So what we have to do here first remove all numeric and punctuation symbol
from the file and then count the character in each word.
Solution:
CREATE TABLE A (LINE STRING);
LOAD DATA LOCAL INPATH ALPHA2 OVERWRITE INTO TABLE A;
FROM A INSERT OVERWRITE TABLE A SELECT REGEXP_REPLACE(LINE,*0-9+, );
FROM A INSERT OVERWRITE TABLE A SELECT REGEXP_REPLACE(LINE,\\pPunct-, );
SELECT WORD,LENGTH(WORD) FROM A LATERAL VIEW EXPLODE(SPLIT(LINE, )) 1TABLE AS
WORD;
Create a another table KK and store output
CREATE TABLE KK (Word string, length int) as
SELECT WORD,LENGTH(WORD) FROM A LATERAL VIEW EXPLODE(SPLIT(LINE, )) 1TABLE AS
WORD;
Load the output file from HDFS to local system
Hadoop dfscat /user/hive/warehouse/KK/* > ~/a.txt
Output:
8/10/2019 Bigdata - Installation a dn Configuaration.docx
37/49
CASE IV: PASS-FAIL PROBLEM
Problem:
In this problem you have two file one is student and another is result. You have to find out the name
of those students who have passed the exam.
Solution:
Create table student(name string,id int)row format delimited fields terminated by \t;
Load data local inpath student overwrite into table student;
Create table result(id int,status string) row format delimited fields terminated by \t;
Load data local inpath results overwrite into table result;
Select student.id,student.name,result.status from student join result on (student.id=result.id)
Where result.status=pass;
Output:
8/10/2019 Bigdata - Installation a dn Configuaration.docx
38/49
input
Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591
Haryana Ambala 404 20 80 30021 76.76746 30.373488 404-20-80-30021
Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591
pig script
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+')) AS
(f1:chararray,f2:chararray,f3:int,f4:int,f5:int,f6:int,f7:double,f8:double,f9:chararray);
C = FOREACH B GENERATE f1,f2,f8,f9;
DUMP C;
output
(Haryana,Ambala,30.373488,404-20-80-37591)
(Haryana,Ambala,30.373488,404-20-80-30021)
(Haryana,Ambala,30.373488,404-20-80-37591)
Hbase
8/10/2019 Bigdata - Installation a dn Configuaration.docx
39/49
HBase is a distributed column-oriented database built on top of HDFS. HBase is the
Hadoop application to use when you require real-time read/write random access to
very large datasets.
Installation1)Download a stable release from an Apache Download Mirror and unpack it on yourlocal filesystem. For example:
% tar xzf hbase-x.y.z.tar.gz
2)As with Hadoop, you first need to tell HBase where Java is located on your system. If
you have the JAVA_HOME environment variable set to point to a suitable Java installation,
then that will be used, and you dont have to configure anything further. Otherwise,
you can set the Java installation that HBase uses by editing HBases conf/hbase-env.sh
and specifying the JAVA_HOME variable to point to
version 1.6.0 of Java.
3)For convenience, add the HBase binary directory to your command-line path. For
example:
% export HBASE_HOME=/home/hbase/hbase-x.y.z
% export PATH=$PATH:$HBASE_HOME/bin
1) To administer your HBase instance, launch the HBase shell by typing:
% hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version: 0.89.0-SNAPSHOT, ra4ea1a9a7b074a2e5b7b24f761302d4ea28ed1b2, Sun Jul 18
15:01:50 PDT 2010 hbase(main):001:0>
This will bring up a JRuby IRB interpreter that has had some HBase-specific commands
added to it. Type help and then press RETURN to see the list of shell commands grouped
into categories. Type help COMMAND_GROUP for help by category or help COMMAND for
help on a specific command and example usage.
2) Create a table:
To create a table, you must name your table and define its schema. A tables schema
comprises table attributes and the list of table column families. Column families
themselves have attributes that you in turn set at schema definition time. Examples of
column family attributes include whether the family content should be compressed onthe filesystem and how many versions of a cell to keep. Schemas can be edited later by
8/10/2019 Bigdata - Installation a dn Configuaration.docx
40/49
offlining the table using the shell disable command, making the necessary alterations
using alter, then putting the table back online with enable.
To create a table named test with a single column family named data using defaults
for table and column family attributes, enter:
hbase(main):007:0> create 'test', 'data'
0 row(s) in 1.3066 seconds
3)To prove the new table was created successfully, run the list command. This willoutput all tables in user space:
hbase(main):019:0> list
test
1 row(s) in 0.1485 seconds
4)To insert data into three different rows and columns in the data column family, and
then list the table content, do the following:
hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1'
0 row(s) in 0.0454 seconds
hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2'
0 row(s) in 0.0035 seconds
hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3'0 row(s) in 0.0090 seconds
hbase(main):024:0> scan 'test'
ROW COLUMN+CELL
row1 column=data:1, timestamp=1240148026198, value=value1
row2 column=data:2, timestamp=1240148040035, value=value2
row3 column=data:3, timestamp=1240148047497, value=value3
3 row(s) in 0.0825 seconds
Notice how we added three new columns without changing the schema.To remove the table, you must first disable it before dropping it:
hbase(main):025:0> disable 'test'
09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test
0 row(s) in 6.0426 seconds
hbase(main):026:0> drop 'test'
09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test
0 row(s) in 0.0210 seconds
hbase(main):027:0> list
0 row(s) in 2.0645 seconds
8/10/2019 Bigdata - Installation a dn Configuaration.docx
41/49
5) To get data from specific row:
Get tablename,rowname(ENTER)
Get wc,row2 (It will print data of second row).
6)Delete a row:
Delete wc,row1,data:1(ENTER)
Import data from flat file to
HBase table.1. Create an HBase table
1
2
$ hbase shellhbase> create table 'noaastation', 'd'
This creates a table called noaastation with one column family d.
2. Create a temporary HDFS folder to hold the data for bulk load$ hdfs dfs mkdir /user/john/hbase
3. Run importtsv to generate data in the temporary folder for bulk load.
$ hadoop jar /usr/lib/hbase/hbase.jar importtsv '-
Dimporttsv.separator=|' -Dimporttsv.bulk.output=/user/john/hbase/tmp -
Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2,d:c3,d:c4,d:c5,d:c6,d:c
7,d:c8,d:c9,d:c10,d:c11,d:c12,d:c13,d:c14 noaastation
/user/john/noaa/201212station.txt
NOAA station data in file 201212station.txt has 15 fields separated by |. The first field is
station id, which will be the row key in HBase. The rest of fields will be added as HBase
columns.
1. Change the temporary folder permission$ hdfs dfs -chmod -R +rwx /user/john/hbase
2. Run bulk load
$ hadoop jar /usr/lib/hbase/hbase.jar completebulkload
/user/john/hbase/tmp noaastation
8/10/2019 Bigdata - Installation a dn Configuaration.docx
42/49
Now that the data is loaded, run some HBase shell commands to query.
12
hbase> scan noaastationhbase> get 'noaastation', '94994', {COLUMN => 'd:c7'}
CASSANDRA
Apache Cassandra is anopen source, distributed, decentralized, elastically scalable,
highly available, fault-tolerant, tuneably consistent, column-oriented database that
bases its distribution design on Amazons Dynamo and its data model on GooglesBigtable. Created at Facebook, it is now used at some of the most popular sites on the
Web.
Brewers CAP Theorem:The theorem states that within a large-scale distributed data system, there are three
requirements that have a relationship of sliding dependency: Consistency, Availability,
and Partition Tolerance.
Consistency
All database clients will read the same value for the same query, even given concurrent
updates.
Availability
All database clients will always be able to read and write data.
Partition Tolerance
The database can be split into multiple machines; it can continue functioning in
the face of network segmentation breaks.
Brewers theorem is that in any given system, you can strongly support only two of the
three. This is analogous to the saying you may have heard in software development:
You can have it good, you can have it fast, you can have it cheap: pick two.
8/10/2019 Bigdata - Installation a dn Configuaration.docx
43/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
44/49
sudo mkdir /var/lib/cassandra (CREATING DIRECTORY)
sudo chown -R cloudera/var/lib/cassandra (CHANGING OWNER FROM ROOT TO
CLOUDERA) OR
sudo chown -R 'whoami'/var/lib/cassnadra (assigning to all user who would be
working on server)
4)Start cassandra daemon
/home/cloudera/cassandra/bin/cassandra -f
5)Run commond on COMMAND LINE INTERFACE AND IN CQLSHELL
/home/cloudera/cassandra/bin/cassandra-cli
/home/cloudera/cassandra/bin/cqlsh
6)netstat -ano |grep 9160 |grep LISTEN or netstat -nl|grep 9160
To check whether a port is working or not.
7)you can also use
netstat -tulpen
8)Run commond on cli
a) CREATE KEYSPACE my_keyspace
with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = {replication_factor:1};
then run USE MMMY_KEYSPACE
b)create column family account1
with key_validation_class = UTF8Type
and comparator = 'AsciiType'
and default_validation_class = UTF8Type;
update column family User with
column_metadata =
8/10/2019 Bigdata - Installation a dn Configuaration.docx
45/49
8/10/2019 Bigdata - Installation a dn Configuaration.docx
46/49
E)deleting a cell
del account1['123']['fisrt_name']; //delete column first_name.
f)update column metadata
update column family address
with comparator = 'AsciiType'
and key_validation_class = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and column_metadata =
[{column_name : city,
validation_class : utf8},
{column_name : zip,
validation_class : utf8}];
g) create super column family
create column family student
with column_type = 'Super'
and key_validation_class = UTF8Type
and comparator = 'AsciiType'and default_validation_class = UTF8Type;
h)update super column metadata
update column family student
with comparator = 'AsciiType'
and key_validation_class = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and column_metadata =
[{column_name : city,
validation_class : utf8},
{column_name : zip,
validation_class : utf8}];
i)execute a script
bin/cassandra-cli -host localhost -port 9160 -f cassandrascript.txt
8/10/2019 Bigdata - Installation a dn Configuaration.docx
47/49
WORKING ON CQLSH
(http://www.datastax.com/documentation/cql/3.1/webhelp/index.html#cql/cql_usin
g/use_ttl_t.html)
NOTE: FOCUS ON LIST,SET,MAP DATATYPE OF CASSANDRA
1) CREATE KEYSPACE key
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 2};
2)USE key;
3) ALTER KEYSPACE Key WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
4)Creating TABLE(COLUMN-FAMILY IN CLI) WITHIN A KEYSPACE
a)CREATE TABLE users (
user_name varchar,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint,PRIMARY KEY (user_name)); /// PRIMARY KEY
b)CREATE TABLE emp (
empID int,
deptID int,
first_name varchar,
last_name varchar,
PRIMARY KEY (empID, deptID)); //COMPOSITE PRIMARY KEY
5) INSERTION IN TABLE
INSERT INTO emp (empID, deptID, first_name, last_name)
VALUES (104, 15, 'jane', 'smith');
6)SYSTEM KEYSPACE ::
The system keyspace includes a number of tables that contain details about your Cassandra
database objects and cluster configuration.Cassandra populates these tables and others in
the system keyspace.Keyspace, table, and column information.An alternative to the Thrift
API describe_keyspaces function is querying system.schema_keyspaces directly. You can
also retrieve information about tables by querying system.schema_columnfamilies and
about column metadata by querying system.schema_columns.
8/10/2019 Bigdata - Installation a dn Configuaration.docx
48/49
Procedure
Query the defined keyspaces using the SELECT statement
SELECT * from system.schema_keyspaces;
Note: To kill all process running under CASSANDRADAEMON USE:
pgrep -u user -f cassandra |xargs kill -9 (Here User is cloudera).
MAHOUT
Mahout is an open source machine learning library from Apache. The algorithms it
implements fall under the broad umbrella of machine learning or collective intelligence. This
can mean many things, but at the moment for Mahout it means primarily recommender
engines (collaborative filtering), clustering, and classification. Its also scalable. Mahout aims
to be the machine learning tool of choice when the collection of data to be processed is very
large, perhaps far too large for a single machine. In its current incarnation, these scalable
machine learning implementations in Mahout are written in Java, and some portions are
built upon Apaches Hadoop distributed computation project.
Recommender enginesRecommender engines are the most immediately recognizable machine learning technique
in use today. Youll have seen services or sites that attempt to recommend books or movies
or articles based on your past actions. They try to infer tastes and preferences and identify
unknown items that are of interest:
1) Amazon.com is perhaps the most famous e-commerce site to deploy recommendations.
Based on purchases and site activity, Amazon recommends books and other items likely to
be of interest. See figure 1.2.
2) Netflix similarly recommends DVDs that may be of interest, and famously offered a
$1,000,000 prize to researchers who could improve the quality of their recommendations.
3) Dating sites like Lbmseti (discussed later) can even recommend people to people.
4) Social networking sites like Facebook use variants on recommender techniques to identify
people most likely to be as-yet-unconnected friends.
Clustering
8/10/2019 Bigdata - Installation a dn Configuaration.docx
49/49
Clustering is less apparent, but it turns up in equally well-known contexts. As its name
implies, clustering techniques attempt to group a large number of things together into
clusters that share some similarity.
1) Google News groups news articles by topic using clustering techniques, in order to
present news grouped by logical story, rather than presenting a raw listing of all articles.
2)Search engines like Clusty group their search results for similar reasons.3)Consumers may be grouped into segments (clusters) using clustering techniques based on
attributes like income, location, and buying habits.
Classification
Classification techniques decide how much a thing is or isnt part of some type or category,
or how much it does or doesnt have some attribute. Classification, like clustering, is
ubiquitous, but its even more behind the scenes. Often these systems learn by reviewing
many instances of items in the categories in order to deduce classification rules. This general
idea has many applications:
1) Yahoo! Mail decides whether or not incoming messages are spam based on prior emailsand spam reports from users, as well as on characteristics of the email itself.
2) Googles Picasa and other photo-management applications can decide when a region of
an image contains a human face.
3) Optical character recognition software classifies small regions of scanned text into
individual characters.
4) Apples Genius feature in iTunes reportedly uses classification to classify songsinto
potential playlists for users.