1 | Page Faculty of computers and information Information System Department BIGDATA ANALYTICAL DISTRIBUTED SYSTEM Under Supervision Prof. Dr. Ali H. El-Bastawissy Eng. Ali Zidane Educational Sponsor Team Members Name ID Group Adel Awd El Agawany 20090183 IS-DS Ayman Mohamed Mahmood 20090067 IS-DS Hazem Ahmed Talat 20090101 IS-DS Mohamed Abd el-aal El-Tantawy 20090282 IS-DS Mohamed Ahmed Saber 20090250 IS-DS Mohamed Fayez Khater 20090293 IS-DS
158
Embed
Big data analytics graduation project documentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1 | P a g e
Faculty of computers and information
Information System Department
BIGDATA ANALYTICAL DISTRIBUTED SYSTEM
Under Supervision
Prof. Dr. Ali H. El-Bastawissy Eng. Ali Zidane
Educational Sponsor
Team Members
Name ID Group
Adel Awd El Agawany 20090183 IS-DS
Ayman Mohamed Mahmood 20090067 IS-DS
Hazem Ahmed Talat 20090101 IS-DS
Mohamed Abd el-aal El-Tantawy 20090282 IS-DS
Mohamed Ahmed Saber 20090250 IS-DS
Mohamed Fayez Khater 20090293 IS-DS
2 | P a g e
List of Figures .............................................................................................................. 5
We would like to express our deepest appreciation to all those who provided us the possibility to complete this project. A special gratitude we give to our final year project supervisors, Dr.Ali El-Bastwissy, Eng. Ali zidane whose contribution in stimulating suggestions and encouragement, helped us to coordinate our project.
Furthermore I would also like to acknowledge with much appreciation the crucial role of
our educational sponsor ‘ ’, that gave us the permission to use all required and necessary materials to complete the project “DataLytics”. And Special thanks go to all team mate,
Acknowledgment
8 | P a g e
Big data is a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing
applications. The challenges include capture, storage, search, sharing, transfer,
analysis, and visualization. The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of related data, requiring
instead "massively parallel software running on tens, hundreds, or even thousands of
servers in clustering.
We have compared between more than one tool to use it in clustering and perform
processing on this very large volume of data and produce analysis and visualization.
We decided to use Apache hadoop software for clustering and managing machines
then we integrated between Hadoop and R programming language to perform our
analytics and solve big data problem.
Abstract
9 | P a g e
1.1 Introduction
In a broad range of application areas, data is being collected at unprecedented
scale. Decisions that previously were based on guesswork, or on painstakingly
constructed models of reality, can now be made based on the data itself. Such Big
Data analysis now drives nearly every aspect of our modern society, including mobile
services, retail, manufacturing, financial services, life sciences, and physical sciences.
Big Data has the potential to revolutionize not just research, but also education .A
recent detailed quantitative comparison of different approaches taken by 35 charter
schools in NYC has found that one of the top five policies correlated with measurable
academic effectiveness was the use of data to guide instruction. Imagine a world in
which we have access to a huge database where we collect every detailed measure of
every student's academic performance. This data could be used to design the most
effective approaches to education, starting from reading, writing, and math, to
advanced, college-level, courses. We are far from having access to such data, but
there are powerful trends in this direction. In particular, there is a strong trend for
massive Web deployment of educational activities, and this will generate an increasingly
large amount of detailed data about students' performance.
There have been persuasive cases made for the value of Big Data for urban
WHERE user_uuid = 88b8fd18-b1ed-4e96-bf79-4280797cba80;
TRUNCATE <column_family>;
68 | P a g e
A TRUNCATE statement results in the immediate, irreversible removal of all data in the
named column family.
Example
USE
Connects the current client session to a keyspace.
Synopsis
Description
A USE statement tells the current client session and the connected Cassandra instance
which keyspace you will be working in.
Example
DROP INDEX
Drops the named secondary index.
Synopsis
Description
A DROP INDEX statement removes an existing secondary index. If the index was not
given a name during creation, the index name
is <columnfamily_name>_<column_name>_idx.
USE <keyspace_name>;
TRUNCATE user_activity;
DROP INDEX <name>;
USE PortfolioDemo;
69 | P a g e
Example
DROP TABLE
Removes the named column family.
Synopsis
Description
A DROP TABLE statement results in the immediate, irreversible removal of a column
family, including all data contained in the column family. You can also use the
alias DROP TABLE.
Example
CQL (Platforms) Support:
PHP
Python
Java
Ruby
CQL (IDEs) Support:
Eclipse
Netbeans
Python
DROP TABLE worldSeriesAttendees;
DROP TABLE <name>;
DROP INDEX user_state;
DROP INDEX users_zip_idx;
70 | P a g e
4.1 Clustering
4.1.1 Concept of Clustering
A computer cluster consists of a set of loosely connected or tightly connected machines that work together so that in many respects they can be viewed as a single system.
The components of a cluster are usually connected to each other through fast local area network ("LAN"), with each node (computer used as a server) running its own instance of an operating system. Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for high performance distributed computing.
Clusters are usually deployed to improve performance and availability over that of a single machine while typically being much more cost-effective than single machine of comparable speed or availability.
The desire to get more computing power and better reliability by orchestrating a number of low cost commercial off-the-shelf computers has given rise to a variety of architectures and configurations.
The computer clustering approach usually (but not always) connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast local area network. The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit.
Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as peer to peer or grid computing which also uses many nodes, but with a far more distributed nature.
A computer cluster may be a simple two-node system which just connects two personal computers, or may be a very fast supercomputer. A basic approach to building a cluster is that of a Beowulf cluster which may be built with a few personal computers to produce a cost-effective alternative to traditional high performance computing.
Although a cluster may consist of just a few personal computers connected by a simple network, the cluster architecture may also be used to achieve very high levels of performance.
PART Four: Work Environment
71 | P a g e
Figure 4.1 Computer clustering
4.1.2 Clustering characteristics:
Consists of many of the same or similar type of machines Tightly-coupled using dedicated network connections All machines share resources such as a common home directory They must trust each other so that RSH or SSH does not require a password,
Otherwise you would need to do a manual start on each machine.
4.1.3 Clustering attributes:
Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. In either case, the cluster may use high availability approach.
72 | P a g e
Note that the attributes described below are not exclusive and a "compute cluster" may also use a high-availability approach, etc.
Load balancing clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized. However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from a web-server cluster which may just use a simple round-robin method by assigning each new request to a different node.
Computer clusters are used for computation-intensive purposes, rather than handling IO-oriented operations such as web service or database. For instance, a computer cluster might support computational simulation of vehicle crashes or weather. Very tightly coupled computer clusters are designed for work that may approach supercomputing.
4.1.4 Clustering Benefits:
Low Cost: Customers can eliminate the cost and complexity of procuring, configuring and operating HPC clusters with low, pay-as-you-go pricing. Further, you can optimize costs by leveraging one of several pricing models: On Demand, Reserved or Spot Instances.
Elasticity: You can add and remove computer resources to meet the size and time requirements for your workloads.
Run Jobs Anytime, Anywhere: You can launch compute jobs using simple APIs or management tools and automate workflows for maximum efficiency and scalability. You can increase your speed of innovation by accessing computer resources in minutes instead of spending time in queues
4.1.5 Clustering management:
One of the challenges in the use of a computer cluster is the cost of administrating it which can at times be as high as the cost of administrating N independent machines, if the cluster has N nodes. In some cases this provides an advantage to shared memory architectures with lower administration costs. This has also made virtual machines popular, due to the ease of administration.
Task scheduling
When a large multi-user cluster needs to access very large amounts of data, task scheduling becomes a challenge. In a heterogeneous CPU-GPU cluster, which has a complex application environment the performance of each job depends on the characteristics of the underlying cluster, mapping tasks onto CPU cores and
73 | P a g e
GPU devices provides significant challenges. This is an area of ongoing research and algorithms that combine and extend MapReduce and Hadoop have been proposed and studied.
Node failure management
When a node in a cluster fails, strategies such as "fencing" may be employed to keep the rest of the system operational. Fencing is the process of isolating a node or protecting shared resources when a node appears to be malfunctioning. There are two classes of fencing methods; one disables a node itself, and the other disallows access to resources such as shared disks.
4.1.6 Problems solved using Clustering:
Clusters can be used to solve three typical problems in a data center environment:
Need for High Availability. High availability refers to the ability to provide end user access to a service for a high percentage of scheduled time while attempting to reduce unscheduled outages. A solution is highly available if it meets the organization's scheduled uptime goals. Availability goals are achieved by reducing unplanned downtime and then working to improve total hours of service operation.
Need for High Reliability. High reliability refers to the ability to reduce the frequency of system failure, while attempting to provide fault tolerance in case of failure. A solution is highly reliable if it minimizes the number of single points of failure and reduces the risk that failure of a single component/system will result in the outage of the entire service offering. Reliability goals are achieved using redundant, fault tolerant hardware components, application software and systems.
Need for High Scalability. High scalability refers to the ability to add resources and computers while attempting to improve performance. A solution is highly scalable if it can be scaled up and out. Individual systems in a service offering can be scaled up by adding more resources (for example, CPUs, memory, disks, etc.). The service can be scaled out by adding additional computers.
74 | P a g e
4.1.7 Setting up single node cluster using Hadoop
Prerequisites
Sun Java 6
Hadoop requires a working Java 1.5+ installation. However, using Java 1.6 is recommended for running Hadoop.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Add the Ferramosca Roberto's repository to your apt repositories
# See https://launchpad.net/~ferramroberto/
#
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:ferramroberto/java
# Update the source list
$ sudo apt-get update
# Install Sun Java 6 JDK
$ sudo apt-get install sun-java6-jdk
# Select Sun's Java as the default on your machine.
# See 'sudo update-alternatives --config java' for more information.
#
$ sudo update-java-alternatives -s java-6-sun
The full JDK which will be placed in /usr/lib/jvm/java-6-sun.
After installation, make a quick check whether Sun’s JDK is correctly set up:
1
2
3
4
user@ubuntu:~# java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation
75 | P a g e
from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
1
2
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
This will add the user hduser and the group hadoop to your local machine.
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
Secure Shell (SSH) is a cryptographic network protocol for secure data communication, remote command-line login, remote command execution, and other secure network services between two networked computers that connects
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication.
First, we have to generate an SSH key for the hduser user.
1
2
3
4
5
6
7
8
9
10
11
12
user@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.
The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config
1
2
3
4
5
6
7
8
9
hduser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686
GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
hduser@ubuntu:~$
Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:
1
2
3
4
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
1 $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
77 | P a g e
A return value of 0 means IPv6 is enabled, a value of 1 means disabled.
Hadoop installation:
After downloading Hadoop from the Apache Download Mirrors, extract the contents of the Hadoop package to a location of your choice. For example /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:
1
2
3
4
$ cd /usr/local
$ sudo tar xzf hadoop-1.0.3.tar.gz
$ sudo mv hadoop-1.0.3 hadoop
$ sudo chown -R hduser:hadoop hadoop
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.
1
2
3
4
5
6
7
8
9
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-6-sun
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
10 export PATH=$PATH:$HADOOP_HOME/bin
Configurations:
Hadoop-env.sh
The only required environment variable we have to configure for Hadoop JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.
we will configure the directory where Hadoop will store its data files, the network ports it listens to. Our setup will use Hadoop’s Distributed File System, HDFS.
Now we create the directory and set the required ownerships and permissions:
1
2
3
4
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp
Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml:
Conf/core-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
79 | P a g e
15 </property>
In file conf/mapred-site.xml:
Conf/mapred-site.xml
1
2
3
4
5
6
7
8
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
In file conf/hdfs-site.xml:
Conf/hdfs-site.xml
1
2
3
4
5
6
7
8
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
Do not format a running Hadoop filesystem as you will lose all the data currently in the
cluster (in HDFS)!
80 | P a g e
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred.
Download example input data
We will use three documents for example:
1
2
3
4
5
6
hduser@ubuntu:~$ ls -l /tmp/gutenberg/
total 3604
-rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt
hduser@ubuntu:~$
Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
This command will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the result in the HDFS directory /user/hduser/gutenberg-output.
Example output of the previous command in the console:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount
hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output
"(Lo)cra" 1
"1490 1
"1498," 1
85 | P a g e
8
9
10
11
12
13
14
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
hduser@ubuntu:/usr/local/hadoop$
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
http://localhost:50070/ – web UI of the NameNode daemon http://localhost:50030/ – web UI of the JobTracker daemon http://localhost:50060/ – web UI of the TaskTracker daemon
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.
NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.
86 | P a g e
Figure 4.2 NameNode web interface
JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).
By default, it’s available at http://localhost:50030/.
87 | P a g e
Figure 4.3 Job tracker web interface
TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.
By default, it’s available at http://localhost:50060/.
88 | P a g e
Figure 4.4 Task tracker web interface
4.1.8 Setting up a multi cluster computing using Hadoop:
From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Ubuntu boxes . In my humble opinion, the best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two Ubuntu boxes, and in a second step to “merge” these two single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master (but also act as a slave with regard to data storage and processing), and the other box will become only a slave. It’s much easier to track down any problems you might encounter due to the reduced complexity of doing a single-node cluster setup first on each machine.
89 | P a g e
Figure 4.5 Multi node cluster components
Prerequisites:
Firstly, the single node cluster must be installed and configured on both machines as shown before and it is recommended to use the same settings of installation and paths of folders because the two machines will be merged.
One of the Ubuntu boxes will act as a master and slave simultaneously and the other box will act as a slave Network configuration:
This should come hardly as a surprise, but for the sake of completeness I have to point out that both machines must be able to reach each other over the network. The easiest is to put both machines in the same network with regard to hardware and software configuration. To make it simple, we will assign the IP address 192.168.0.1 to the master machine and 192.168.0.2 to the slave machine.
90 | P a g e
SSH configuaration:
The hduser user on the master (hduser@master) must be able to connect
a. To its own user account on the master b. To the hduser user account on the slave ( hduser@slave) via a password-less SSH
login. You just have to add the hduser@master’s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave (in this user’s $HOME/.ssh/authorized_keys). You can do this manually or use the following SSH command:
The final step is to test the SSH setup by connecting with user hduser from the master to the user account hduser on the slave. The step is also needed to save slave’s host key fingerprint to the hduser@master’s known_hosts file.
So, connecting from master to master…
1
2
3
4
5
6
7
8
hduser@master:~$ ssh master
The authenticity of host 'master (192.168.0.1)' can't be established.
RSA key fingerprint is 3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'master' (RSA) to the list of known hosts.
Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC 2007 i686
...
hduser@master:~$
And from master to slave.
1
2
3
4
5
6
hduser@master:~$ ssh slave
The authenticity of host 'slave (192.168.0.2)' can't be established.
RSA key fingerprint is 74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave' (RSA) to the list of known hosts.
Ubuntu 10.04
91 | P a g e
7
8
...
hduser@slave:~$
Hadoop configuration:
We will prepare one Ubuntu box to act as a master, slave and the other to act as a
slave
Figure 4.6 multi node cluster layers
The master node will run the “master” daemons for each layer: Name Node for the HDFS storage layer and Job Tracker for the Map Reduce processing layer. Both machines will run the “slave” daemons: Data Node for the HDFS layer and Task Tracker for Map Reduce processing layer. Basically, the “master” daemons are responsible for coordination and management of the “slave” daemons while the latter will do the actual data storage and data processing work.
92 | P a g e
Configuration
Conf/masters (master only)
Actually, one machine will act as the master including name node and the job tracker, the remaining machines whatever its number will act as slave including data node and task tracker.
On master, update conf/masters that it looks like this: Conf/masters (on master)
1 master
Conf/slaves (master only):
The conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (Data Nodes and Task Trackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.
On master, update conf/slaves that it looks like this: Conf/slaves (on master)
1 2
master slave
If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.
You must change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.
93 | P a g e
First, we have to change the fs.default.name parameter (in conf/core-site.xml), which specifies the Name Node (the HDFS master) host and port. In our case, this is the master machine.
Conf/core-site.xml (ALL machines)
1
2
3
4
5
6
7
8
9
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
Second, we have to change the mapred.job.tracker parameter (in conf/mapred-site.xml), which specifies the JobTracker (MapReduce master) host and port. Again, this is the master in our case.
Conf/mapred-site.xml (ALL machines)
1
2
3
4
5
6
7
8
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
Third, we change the dfs.replication parameter (in conf/hdfs-site.xml) which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. If you set this to a value higher than the number of available slave nodes (more precisely, the number of DataNodes), you will start seeing a lot of “(Zero targets found, forbidden1.size=1)” type errors in the log files.
The default value of dfs.replication is 3. However, we have only two nodes available, so we set dfs.replication to 2.
Conf/hdfs-site.xml (ALL machines)
94 | P a g e
1
2
3
4
5
6
7
8
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Formatting the HDFS file system via the Name Node
Before we start our new multi-node cluster, we must format Hadoop’s distributed file system (HDFS) via the NameNode. You need to do this the first time you set up a Hadoop cluster.
Warning: Do not format a running cluster because this will erase all existing data in the HDFS file system!
To format the file system (which simply initializes the directory specified by the dfs.name.dir variable on the Name Node), run the command
... INFO dfs.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully
formatted.
hduser@master:/usr/local/hadoop$
Starting the multi-node cluster
Starting the cluster is performed in two steps.
1. We begin with starting the HDFS daemons: the NameNode daemon is started on master, and DataNode daemons are started on all slaves (here: master and slave).
2. Then we start the MapReduce daemons: the JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and slave).
95 | P a g e
HDFS daemons
Run the command bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.
In our case, we will run bin/start-dfs.sh on master:
Start the HDFS layer
1
2
3
4
5
6
7
hduser@master:/usr/local/hadoop$ bin/start-dfs.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-
master.out
slave: Ubuntu 10.04
slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-
slave.out
master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-
master.out
master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-
secondarynamenode-master.out
hduser@master:/usr/local/hadoop$
On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-hduser-datanode-slave.log.
Example output:
1
2
3
4
5
6
7
8
9
10
11
12
13
... INFO org.apache.hadoop.dfs.Storage: Storage directory /app/hadoop/tmp/dfs/data is not
formatted.
... INFO org.apache.hadoop.dfs.Storage: Formatting ...
... INFO org.apache.hadoop.dfs.DataNode: Opened server at 50010
... INFO org.mortbay.util.Credential: Checking Resource aliases
... INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
... INFO org.apache.hadoop.dfs.DataNode: using BLOCKREPORT_INTERVAL of
3538203msec
As you can see in slave’s output above, it will automatically format its storage directory (specified by the dfs.data.dir parameter) if it is not formatted already. It will also create the directory if it does not exist yet.
At this point, the following Java processes should run on master…
Java processes on master after starting HDFS daemons
1
2
3
4
5
6
hduser@master:/usr/local/hadoop$ jps
14799 NameNode
15314 Jps
14880 DataNode
14977 SecondaryNameNode
hduser@master:/usr/local/hadoop$
And the following on slave. Java processes on slave after starting HDFS daemons
1
2
3
4
hduser@slave:/usr/local/hadoop$ jps
15183 DataNode
15616 Jps
hduser@slave:/usr/local/hadoop$
MapReduce daemons
Run the command bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.
In our case, we will run bin/start-mapred.sh on master:
... INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
... INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
... INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
... INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50060
... INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@1e63e3d
... INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50050: starting
... INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050: starting
... INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: 50050
... INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_slave:50050
... INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050: starting
... INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all
reduce tasks on tracker_slave:50050
At this point, the following Java processes should run on master… Java processes on master after starting MapReduce daemons
1
2
3
4
5
6
7
8
hduser@master:/usr/local/hadoop$ jps
16017 Jps
14799 NameNode
15686 TaskTracker
14880 DataNode
15596 JobTracker
14977 SecondaryNameNode
hduser@master:/usr/local/hadoop$
And the following on slave. Java processes on slave after starting MapReduce daemons
98 | P a g e
1
2
3
4
5
hduser@slave:/usr/local/hadoop$ jps
15183 DataNode
15897 TaskTracker
16284 Jps
hduser@slave:/usr/local/hadoop$
Stopping the multi-node cluster
Like starting the cluster, stopping it is done in two steps. The workflow however is the opposite of starting.
1. We begin with stopping the MapReduce daemons: the JobTracker is stopped on master, and TaskTracker daemons are stopped on all slaves (here: master and slave).
2. Then we stop the HDFS daemons: the NameNode daemon is stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave).
MapReduce daemons
Run the command bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-mapred.sh on master:
Note: The output above might suggest that the JobTracker was running and stopped on
“slave“, but you can be assured that the JobTracker ran on “master“.
At this point, the following Java processes should run on master… Java processes on master after stopping MapReduce daemons
1
2
3
4
hduser@master:/usr/local/hadoop$ jps
14799 NameNode
18386 Jps
14880 DataNode
99 | P a g e
5
6
14977 SecondaryNameNode
hduser@master:/usr/local/hadoop$
And the following on slave. Java processes on slave after stopping MapReduce daemons
1
2
3
4
hduser@slave:/usr/local/hadoop$ jps
15183 DataNode
18636 Jps
hduser@slave:/usr/local/hadoop$
HDFS daemons
Run the command bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-dfs.sh on master:
Stopping the HDFS layer
1
2
3
4
5
6
7
hduser@master:/usr/local/hadoop$ bin/stop-dfs.sh
stopping namenode
slave: Ubuntu 10.04
slave: stopping datanode
master: stopping datanode
master: stopping secondarynamenode
hduser@master:/usr/local/hadoop$
(Again, the output above might suggest that the NameNode was running and stopped on slave, but you can be assured that the NameNode ran on master)
At this point, the only following Java processes should run on master… Java processes on master after stopping HDFS daemons
And the following on slave. Java processes on slave after stopping HDFS daemons
1
2
3
hduser@slave:/usr/local/hadoop$ jps
18894 Jps
hduser@slave:/usr/local/hadoop$
1
2
3
hduser@master:/usr/local/hadoop$ jps
18670 Jps
hduser@master:/usr/local/hadoop$
100 | P a g e
4.2 R
4.2.1 What is R?
R is a free software programming language and a software environment
for statistical computing and graphics. The R language is widely used among
statisticians and data miners for developing statistical software and data analysis. Polls
and surveys of data miners are showing R's popularity has increased substantially in
recent years.
R is an implementation of the S programming language combined with lexical
scoping semantics inspired by Scheme. S was created by John Chambers while at Bell
Labs. R was created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development Core Team,
of which Chambers is a member. R is named partly after the first names of the first two
R authors and partly as a play on the name of S.
R is a GNU project. The source code for the R software environment is written
primarily in C, Fortran, and R. R is freely available under the GNU General Public
License, and pre-compiled binary versions are provided for various operating systems.
R uses a command line interface; however, several graphical user interfaces are
available for use with R.
4.2.2 Statistical features
R provides a wide variety of statistical and graphical techniques,
in linear and nonlinear modeling, classical statistical tests, time-series analysis,
classification, clustering, and others. R is easily extensible through functions and
extensions, and the R community is noted for its active contributions in terms of
packages. There are some important differences, but much code written for S runs
unaltered. Many of R's standard functions are written in R itself, which makes it easy for
users to follow the algorithmic choices made. For computationally intensive
tasks, C, C++, andFortran code can be linked and called at run time. Advanced users
can write C or Java code to manipulate R objects directly.
R is highly extensible through the use of user-submitted packages for specific
functions or specific areas of study. Due to its S heritage, R has stronger object-oriented
programming facilities than most statistical computing languages. Extending R is also
eased by its lexical scoping rules.
Strength of R is static graphics, which can produce publication-quality graphs,
including mathematical symbols. Dynamic and interactive graphics are available
through additional packages.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hard copy.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:
An effective data handling and storage facility, A suite of operators for calculations on arrays, in particular matrices, A large, coherent, integrated collection of intermediate tools for data analysis, Graphical facilities for data analysis and display either on-screen or on hardcopy,
and A well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output facilities.
4.2.4 Basic syntax of the language
> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)
> y <- x^2 # Square the elements of x > print(y) # print (vector) y
[1] 1 4 9 16 25 36 > mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar
[1] 15.16667 > var(y) # Calculate sample variance
[1] 178.9667 > lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)"
# store the results as lm_1 > print(lm_1) # Print the model from the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-9.333 7.000
> summary(lm_1) # Compute and print statistics for the fit
PostgreSQL, often simply Postgres, is an object-relational database
management system (ORDBMS) available for many platforms including Linux,
FreeBSD, Solaris, Microsoft Windows and Mac OS X. It is released under the
PostgreSQL License, which is an MIT-style license, and is thus free and open source
software. PostgreSQL is developed by the PostgreSQL Global Development Group,
consisting of a handful of volunteers employed and supervised by companies such
as Red Hat and EnterpriseDB. It implements the majority of
the SQL:2008 standard, is ACID-compliant, is fully transactional (including all DDL
statements), has extensible data types, operators, index methods, functions,
aggregates, procedural languages, and has a large number of extensions written by
third parties.
The vast majority of Linux distributions have PostgreSQL available in supplied
packages. Mac OS X, starting with Lion, has PostgreSQL server as its standard default
database in the server edition, and PostgreSQL client tools in the desktop edition.
4.3.1 Connecting R with PostgreSQL using DBI
RPostgreSQL provides a DBI-compliant database connection from GNU R to PostgreSQL. Development of RPostgreSQL was supported via the Google Summer of Code 2008 program. The package is now available on the CRAN mirror network and can be installed via install.packages() from within R. We use this to retrieve data from Postgre store to R in order to make map reduce job and statistical analysis on data.
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.
HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites, including Facebook's Messaging Platform. In the parlance of Eric Brewer’s CAP theorem, HBase is a CP type system.
4.4.1 Connect R with HBase Using RHbase
This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE. The following functions are part of this package
Installing the package requires that you first install and build Thrift. Once you have the libraries built, be sure they are in a path where the R client can find them (i.e. /usr/lib). This package was built and tested using Thrift 0.8
The Thrift server by default starts on port 9090. [hbase-root]/bin/hbase thrift start
hb.init(host=127.0.0.1, port=9090) 4.5 RHadoop Connecting R with Hadoop using rhdfs The 'Big Data' explosion of the last few years has led to new infrastructure investments around storage and data architectures. Apache Hadoop has rapidly become a leading option for storing and performing operations on big data. Meanwhile, R has emerged as the tool of choice for data scientists modeling and running advanced analytics. Revolution Analytics brings R to Hadoop, giving companies a way to get better returns on their big data investments and extract unique, competitive insights from advanced analytics with the most cost-effective solution on the market. R users can:
Interface directly with the HDFS filesystem from R. Import big-data tables into R from Hadoop filestores via Apache HBASE. Create big-data analytics by writing map-reduce tasks directly in the R language
4.5.1 Using R with Hadoop
You can access data stored in Hadoop HDFS as well as running MapReduce jobs by using standard R functions. Reading and writing are supported via the pipe() command that either receives information from R or transmits data to R. A MapReduce job can be run synchronously or asynchronously as a shell script: both stdout and stderr can be redirected and captured if required.
Revolution Analytics has released a set of R packages callef RHadoop that allow direct access to Hadoop HDFS, HBase, and MapReduce. The HBase and HDFS interfaces provide a number of access functions: the goal of the project was to mirror the Java interfaces as much as possible. mapreduce() is the exception. The input and output directories are specified as would be done with an Hadoop MapReduce job; however, the Map function and the Reduce function are both written in R.
RHadoop is a bridge between R, a language and environment to statistically
explore data sets, and Hadoop, a framework that allows for the distributed processing of
large data sets across clusters of computers. RHadoop is built out of 3 components
which are R packages: rmr, rhdfs and rhbase. Below, we will present each of those R
packages and cover their installation and basic usage.
4.5.3 RMR
The rmr package offers Hadoop MapReduce functionalities in R. For Hadoop users, writing MapReduce programs in R may be considered easier, more productive and more elegant with much less code than in java and easier deployment. It is great to prototype and do research. For R users, it opens the doors of MapReduce programming and access to big data analysis.
The rmr package must not be seen as Hadoop streaming even if internally it uses the streaming architecture. You can do Hadoop streaming with R without any of those packages since the language support stdin and stdout access. Also, rmr programs are not meant to be more efficient than those written in Java and other languages.
Finally:
RMR does not provide a map reduce version of any of the more than 3000 packages available for R. It does not solve the problem of parallel programming. You still have to write parallel algorithms for any problem you need to solve, but you can focus only on the interesting aspects. Some problems are believed not to be amenable to a parallel solution and using the map reduce paradigm or rmr does not create an exception.
4.5.4 Rhdfs
The rhdfs package offers basic connectivity to the Hadoop Distributed File System. It comes with convenient functions to browse, read, write, and modify files stored in HDFS.
4.5.5 Rhbase
The rhbase package offers basic connectivity to HBase. It comes with convenient functions to browse, read, write, and modify tables stored in HBASE.
You must have at your disposal a working installation of Hadoop. It is recommanded and tested with the Cloudera CDH3 distribution. Consult the RHadoop wiki for alternative installation and future evolution. At the time of this writing, the Cloudera CDH4 distribution is not yet compatible and is documented as a work in progress. All the common Hadoop services must be started as well as the HBase Thrift server in case you want to test rhbase.
Below I have translated my Chef recipes into shell commands. Please contact me directly if you find an error or wish to read the original Chef recipes.
Thrift dependency
Note, if my memory is accurate, maven (apt-get install maven2) might also be required.
R installation, environment and package dependencies:
Figure 4.9 install Rbase
111 | P a g e
RMR
Figure 4.10 Installing RMR package
Rhdfs
Figure 4.11 Installing Rhdfs Package
Rhbase
Figure 4.12 Installing Rhbase package
112 | P a g e
Usage
We are now ready to test our installation. Let’s use the second example present on the tutorial of the RHadoop wiki. This example start with a standart R script wich generate a list of values and count their occurences:
Figure 4.13 Standard R script 1
It then translate the last script into a scalable MapReduce script:
Figure 4.14 Converting R script into Mapreduce
The result is now stored inside the ‘/tmp’ folder of HDFS. Here are two commands to print the file path and the file content:
Figure 4.15 Results commands
If you want to go through big data analysis and want to implement and run mapreduce jobs to deploy it through a distributed system, you should start from here.
The first step we should configure and establish the Hadoop distributed file system and start the namenode ,secondary namenode, datanode,job tracker and task tracker.
Figure 4.16 Starting Hadoop
Second step copy the files you want from local directory to hadoop distributed file
You can only brows the files in HDFS only through hadoop web interface .
Then, we have to establish a connection between hadoop and R-studio where we
will write and run the mapreduce job and make data visualization on the coming output,
to make this we must install some packages on R to help me in connection.
packages courtesy of Revolution R
o Rhdfs -- access to hdfs functions
o Rhbase – access to HBase
o RMR – run MapReduce job
115 | P a g e
The next step we will implement a mapreduce job to count the number of each word in R language which is placed in master ,the processing required for this job will divided into tasks which will run on the slaves.
Mapreduce job running through rmr package using hadoop streaming which convert R script into a different format that hadoop can deal with it.
Wait until map and reduce completed this will take some time.
Figure 4.17 RStudio interface
116 | P a g e
After completion of mapreduce job , we should go to the output folder to see the result of mapreduce of word count , you can explore the file and download it from hadoop web interface.
117 | P a g e
Figure 4.18 Word count example output
118 | P a g e
Figure 4.19 MapReduce word count example
119 | P a g e
Figure 4.20 Data visualization using R
Now we can go to the last step : Data visualization
In the given example I want to know the most repeated 5 words from data .
120 | P a g e
In the next graph you will see the top 5 words with red points
121 | P a g e
Big Data analytics Case Study [Social Network #tags analysis]:
We have two files the first one is [Facebook hash tags] and the another is [twitter hash
tags] , they contain hash tags that people write , every hash tag written on facebook or
twitter the name of this hash tag written in these files .what is the benefit of making
analysis on this ?
The First file
122 | P a g e
The second file
123 | P a g e
After Running the map reduce job that implement word count on this input files the
output will be:
124 | P a g e
Visualize the results to get the most popular hash tag in the social networks this will
benefit me for hash tags ranking in the list of recommendations when I search for a
specific hash tag.
125 | P a g e
4.3 Developing Graphical user interface.
1- This is the window appeared After running GUI
Figure 4.21 Window appeared after running GUI.
2- Press the button run hadoop
Figure 4.22 Window appeared during running GUI
126 | P a g e
3- The button run hadoop is active
Figure 4.23 Hadoop is running when “Run hadoop” .
4- Now hadoop is Running
Figure 4.24 Hadoop is now running
127 | P a g e
5- Running a jar file and get output (ex: run word count example)
Figure 4.25 Running Word count example .
6- It is opening a new window to get the output file name
Figure 4.26 New window opened for entering output file name.
128 | P a g e
7- After run the jar file we have to Call Job tracker interface to monitor the job progress
figure 4.27 calling job tracker web interface.
129 | P a g e
7.1 - in progress maping and reducing for job ID = job_201306211046_0002
Figure 4.28- This snapshot show the progress of the job ID job_201306211046_0002
130 | P a g e
7.2- The map/reduce job is completed (100%map and 100 % reduce) for jobID job_201306211046_0002
Figure 4.29 Completed MapReduce job
131 | P a g e
7.3- Press job ID to show its summary
Figure 4.30 press job ID link to show its summary.
132 | P a g e
7.4- Map/reduce job summary of job ID job_201306211046_0002
figure 4.31 job summary
133 | P a g e
8- We can show all task trackers progress through the jop tracker web interface as follow :
figure 4.32 show task trackers
134 | P a g e
9- Show all task trackers in the cluster and Select the task tracker to show its progress
figure 4.33 show all task trackers links .
135 | P a g e
10- Slave task tracker through applying the job ID job_201306211046_0002
figure 4.34 slave task tracker
136 | P a g e
11- The progress of one slave task tracker through applying the job ID job_201306211046_0002
figure 4.35 slave task tracker progress
137 | P a g e
12- After slave task tracker complete its task.
figure 4.36 slave task tracker complete its task.
138 | P a g e
13- master task tracker through applying the job ID job_201306211046_0002
figure 4.37 master task tracker
139 | P a g e
14- The progress of one master task tracker through applying the job ID job_201306211046_0002
figure 4.38 master task tracker progress
140 | P a g e
15- After master task tracker complete its job
figure 4.39 master task tracker complete its task.
141 | P a g e
16- we have to Call name node interface to browse file systems of both master and slaves
figure 4.40 calling name node web interface
142 | P a g e
17- Name node interface to browse file systems of all nodes (masters and slaves)
figure 4.41 name node web interface.
143 | P a g e
18- Active nodes link
figure 4.42 active nodes link.
144 | P a g e
19- Active nodes
figure 4.43 browsing active nodes .
145 | P a g e
21.1- If we browse the Master machine , Press User to get the user name of hadoop in the master machine
figure 4.44 browsing master file system
146 | P a g e
21.2- The user of the master node called “dola” press it to show the directory in which DFS stores it data
figure 4.45 browsing master user to show file system.
147 | P a g e
21.3- The content of HDFS which include the output file called “Adel_awd_el-agawany” which is entered before
figure 4.46 master file system content including the output file name which is entered before.
148 | P a g e
21.4- Open output file and show output data
figure 4.47 display output file content.
149 | P a g e
21.5- The output file content .
figure 4.48 display output file content
150 | P a g e
20- Call CMD to write any command you want
figure 4.49 calling CMD to write commands
151 | P a g e
21- Stopping hadoop
figure 4.50 click stop hadoop
22- Hadoop is stopped
figure 4.51 hadoop is stopped
152 | P a g e
1) we want to run hadoop and run each node through GUI .
we download eclipse-hadoop plugin ( https://code.google.com/p/hadoop-eclipse-plugin/downloads/detail?name=hadoop-0.20.3-dev-eclipse-plugin.jar&can=2&q= ) but it didnt work , also we try to create a new plugin (http://iredlof.com/part-4-compile-hadoop-v1-0-4-eclipse-plugin-on-ubuntu-12-10/) but also didnt work i think thats because we used hadoop-1.1.0
2)Try to change the output web-based interface layout but we could change only the CSS file
3) while running mapreduce job the error we get is that jobtracker in safe mode
We have setup a single node Hadoop cluster (v 1.0.3) on Ubuntu and installed the R (2.15) and
RHadoop packages rmr, rhdfs and rhbase. When we try to run the first tutorial program we run
into errors.
Solution:
Reinstall rmr in a system directory rather than a user specific directory. It could very well
be that when R is started as part of Hadoop its libPaths() is more limited than when used
interactively and then chech the hadoop streaming package and the streaming that work under
hadoop environment.
How Streaming Works:
In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.
When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it
22/6/13 11:49:20 ERROR streaming.StreamJob: Job not successful. Error: # of
12/08/07 11:49:20 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if
(is.list(input)) { :
hadoop streaming failed with error code 1
154 | P a g e
converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.
This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.
You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
User can specify stream.non.zero.exit.is.failure as true or false to make a streaming task that exits with a non-zero status to be Failure or Successrespectively. By default, streaming tasks exiting with non-zero status are considered to be failed tasks.
155 | P a g e
1) change the block size to imporve mapreduce job , also create more that HDFS home
directory that can help in save data in different directories insted of saving it in one
directory to avoid all the data loss in case of hard disk failure
2) create a second master machine and cluster it with the base master machine to avoid
single point fo failuer ,, first we try to use ubuntu MPICH packge we create the second
machine on external USB storage and run the job from it and then unplug it but it didnt
work secondlly we thought that we want the second master machine to be on a separte
virtual server (VS) and then cluster it with the other VS that holds the first master virtual
machine so we used ESXI-5 server and connect through it using vmware vsphere client
to add,remove and edit the virtual machines also used vmware vcenter converter
standalone client to convert the pre-defined virtual machines ( master and slave
machines ) and upload it on the ESXI server also used vcenter center from windows
server 2008 R2 to cluster the virtual ESXI hosts ,, but because of lake of resources we
couldnt test it also when we put the two master machines on the same server we
wouldnt test it because of slowe respond from the machines